Page MenuHomePhabricator

[Story] add a new datatype for formulae
Closed, ResolvedPublic

Description

We need a new datatype to store mathematical formulae.

There are two major usecases for storing formulas:

  • display on Wikipedia
  • usage to do actual calculations

We'd need to cater to both usecases.
In a first step the format should only display formulae. In a second step additional functionality will be added. To be foward compatible the a json format seems advisable. In a first version the formula data can be simple:

{"tex":"\sin x^2 + \cos x^2 = 1"}

thereafter, we can add additional information regarding the identifiers

{"tex":"E=mc^2",
   "definitons": [
    "E":"Q11379",
    "m":"Q11423",
    "c":"Q2111"
  ]
}

Related Objects

StatusSubtypeAssignedTask
OpenNone
ResolvedPhysikerwelt
ResolvedTobi_WMDE_SW
Resolveddaniel
Resolveddaniel
Resolveddaniel
Resolveddaniel
Resolveddaniel
Resolveddaniel
Resolveddaniel
Resolvedaude
Resolveddaniel
Resolveddaniel
Resolvedthiemowmde
ResolvedLlyrian
Resolveddaniel
Resolveddaniel
ResolvedPhysikerwelt
ResolvedMbch331

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

@Lydia_Pinstscher. Right. Sorry for duplicating comments on different tickets.

@NiharikaKohli A subsequent task would be to allow using more than just meaningless plain text in formulae. I think it would be nice to have quantity data types in formula, link to wikidata items and properties, such as quantity symbols https://www.wikidata.org/wiki/Property:P416

Should this datatype go into its own extension or should Wikibase have a soft dependency on the Math extension?

Lydia_Pintscher renamed this task from add a new datatype for formulas to [Story] add a new datatype for formulas.Aug 13 2015, 2:54 PM
Lydia_Pintscher removed a project: Epic.

Should this datatype go into its own extension or should Wikibase have a soft dependency on the Math extension?

If that's possible I'd prefer the Math extension. Are there examples of other data types implementd in extensions?
cf. https://github.com/TU-Berlin-DIMA/dbproW15WikiData/issues/2

@Physikerwelt: You want to do it? That's be awesome! Before you start I think we should have a quick chat as there are still a few things in the air about this. I'd hate for you to waste time on dead-ends.

@Lydia_Pintscher: I thought this would be an optimal task for our database project course. In the database project course students apply their knowledge on databases and data modelling they obtained in classes on database management to real world problems. I think it would be super awesome if some of the effort that is spent in this class could actually be used in production.

Hah! Yeah if it gets done by the students let's actually make it so they can get the code into production. In this case I would also advise against making it its own extension as that complicates code review and deployment considerably (for little gain here as far as I can see).

@Physikerwelt I'm afraid implementing this will have next to nothing to do with database modelling. It's more an exerciser in integrating the math extension with wikibase. The display code is the crucial bit; other than that, I imagine it's just a StringValue to wikibase.

To answer your questions regarding datatypes being implemented by extensions: there are no examples, because we are only now making it possible to dynamically define datatypes.

@daniel: What is special about the display code?
I would imagine something like

$renderer = MathRenderer::getRenderer( $tex, array('id'=>$wikidataId, 'mathml' );
$checkResult = $renderer->checkTex();
...
$renderer->render()
$renderer->getHtmlOutput();

@Physikerwelt if the math extension offers a nice interface like that, then it's probably just that easy, yea :)

@daniel: @Llyrian and @WickieTheViking now have a good understand how WikiData works. Moreover, they managed to set up a testing environment at wmflabs. Now, the implementation must be done. Will the code go to the wikibase repo or can the math extension hook into wikidata without the requirement to modify wikibase extension code itself?
Moreover it would be nice, if the mathml mode could be enabled on wikidata. If you could point me to the server configuration file I can make a pull request for that.
https://github.com/wikimedia/operations-mediawiki-config/blob/5bc5ee989f22664648e2635dee1b0ed31711b04b/wmf-config/CommonSettings.php#L2171
Along the lines if wikidata -> $wgDefaultUserOptions['math'] = 'mathml'; But maybe there is more appropriate place to configure wikidata specific settings

@Physikerwelt Yes, that's the correct hook. In the hook handle, you would do something like this:

public static function onWikibaseClientDataTypes( array &$dataTypeDefinitions ) {
    $dataTypeDefinitions['PT:math'] = array(
        'value-type' => 'string',
        'validator-factory-callback' => function() {
            $repo = WikibaseRepo::getDefaultInstance();
            return new MathValidator( ... );
        },
        'parser-factory-callback' => function( $format, FormatterOptions $options ) {
            $repo = WikibaseRepo::getDefaultInstance();
            $normalizer = new WikibaseStringValueNormalizer( $repo->getStringNormalizer() );
            return new StringParser( $normalizer );
        }
        'formatter-factory-callback' => function( $format, FormatterOptions $options ) {
            $repo = WikibaseRepo::getDefaultInstance();
            return new MathFormatter( ... );
        },
    );
}

Note the "PT:" prefix, indicating that "math" is a property type, not a value type ("VT:"). Also note that the extension point for adding data types is pretty new, and still in flux. The PT and VT prefixes were added only this week. I'll try to keep it more stable now though.

Change 259167 had a related patch set uploaded (by Llyrian):
WIP: Add classes and modify hook in Math

https://gerrit.wikimedia.org/r/259167

Physikerwelt renamed this task from [Story] add a new datatype for formulas to [Story] add a new datatype for formulae.Dec 14 2015, 11:13 PM
Physikerwelt updated the task description. (Show Details)
Physikerwelt edited projects, added Math, Mathoid; removed Patch-For-Review, patch-welcome.
Physikerwelt removed a subscriber: gerritbot.

The variables are mapped, but what about the operators ?

That's a good point. Currently texvcinfo extracts variables only. However, this can be easily expanded to operators as well. Since, I was not sure if there is a significant portion of operators that depend on the context I restricted the extraction to variables in the first place. For the operators build into texvcjs such as '+,-,\sin...\ker...\lim' we could build the links into the math rendering engine. However, my impression was that the meaning of those operators is clear anyhow. Like in standard wikipedia articles, I think only significant concepts should be conencted with links. Otherwise it would be like a Wikipedia article where every word is a link.

@TomT0m: Can you point to some example pages that use user defined operators?

Change 260210 had a related patch set uploaded (by Llyrian):
Fix composer test issues and remove unneccesary comments

https://gerrit.wikimedia.org/r/260210

Change 260210 abandoned by Llyrian:
WIP: Add classes and modify hook in Math for PT:math

Reason:
Unneccasary new review

https://gerrit.wikimedia.org/r/260210

Change 259167 merged by jenkins-bot:
Implement datatype 'Math' for Wikidata

https://gerrit.wikimedia.org/r/259167

Otherwise it would be like a Wikipedia article where every word is a link.

I don't think it's a question of generating a link, but a way to have a generic math formula semantics precisions.

But to take only one example : the ∧ and ⋁ symbols means something different in

...

More broadly, scientists constantly introduce new symbols, so a way to map à symbol / macro (parametered macro) to an operation or a meaning may be necessary for some semantic representation of formulas. Operation also differs, so map this to an item about the operation itself might be cool

TomT0m: Can you explain the difference, please. For me both looks like \and or \or respectively.
I think we should be very precise, which annotations we want to allow.
You can check at http://api.formulasearchengine.com which identifiers are currently extracted by mathoid.
I have the feeling that it would be reasonable to start with those in a first place and add more in the future, if a need for that comes up.

It's not a question of appearance, but of meaning. In a boolean algebra,
the inputs and result of the operation are sets, whereas in logics we
manipulate truth values of different kind.

I have a feeling that it would be useful to add a feature such as a mapping
Tex macro ; arity ; meaning
with macro beeing a string such as \frac ; arity beeing a number of
parameters (2 here) ; and meaning beeing a Qitem such as
https://www.wikidata.org/wiki/Q1068675 depending of the type of objects the
formula ranges on. There could be a
default mapping, such as standard division on the real numbers for \frac,
of course.

EDIT I think I saw all but a few symbols I never saw actually use one way or another in maths formulas in this page : https://en.wikipedia.org/wiki/Mathematical_operators_and_symbols_in_Unicode

I think it will be hard to have type save operators. So many versions of + and even more times (i.e. invisble times)
We are working on the mapping between tex function and meaning. (e.g. http://drmf.wmflabs.org/wiki/Definition:AntiDer and the list http://drmf.wmflabs.org/wiki/Main_Page#Definition_Pages)
But it will take a few years until that will be done.
The identifier extraction on the other hand works quite well. The most serious problem is that the integral d is often mixed up with the identifier d.

You are working on automatic mapping ? Does seem a little hard indeed. I
was thinking on optional manual mapping in the UI of the formula datatype
as a first step. Maybe we could suggest a few predefined operator mappings
for common case later ...

The advantage of using item as the definition of the function or operator is that we're totally open on the type of operation, to have a new one we just have to add a Wikidata item as needed.

I removed "T125522: Show preview when entering value for math data type" which is a future improvement. Probably the task will eliminate itself when VE is enabled.

I'd prefer to keep these tasks together.
Visual Editor will not solve this as the previews we have for our datatypes don't have anything to do with it.

T90870 suggest that this is a self contained task. So maybe we can create a follow up task for everything that is planned for phase two?

We typically don't consider tasks in the "blocked by" section as hard blockers but rather as related tasks because phabricator doesn't let us to differentiate between them. So we rather link more tasks then necessary to get a better overview of what still needs to be done related to some story.

ok, I really like to resolve task, which gives you the impression of getting things done.
I created a test property at https://test.wikidata.org/wiki/Q2209
However, it seems that the datatype is now gone on http://wikidata.beta.wmflabs.org/wiki/Special:NewProperty

Perhaps we should split this story into a "baseline" version (which blocks deployment) and a "full" version (which is blocked by all the additional "nice" features).

@daniel yes. Is there something that currently blocks deployment?

@daniel yes. Is there something that currently blocks deployment?

Not that I know of, that was the point :)

Change 269386 had a related patch set uploaded (by Thiemo Mättig (WMDE)):
Add Math property type to ontology.owl

https://gerrit.wikimedia.org/r/269386

Change 269386 merged by jenkins-bot:
Add Math property type to ontology.owl

https://gerrit.wikimedia.org/r/269386

Change 269386 merged by jenkins-bot:
Add Math property type to ontology.owl

https://gerrit.wikimedia.org/r/269386

For the record I think, the patch above should not have been merged yet. I think WMDE should try to respect different opinions, and not classify different opinion as invalid. It is the nature of democracy that one cannot find consensual solutions for all problems, and that things are decided not everybody agrees with. But I will not accept that WMDE classifies opinions as invalid.

@Physikerwelt I'm confused - what opinion was not respected? Is this about the TeX vs MathML thing?

I think the "invalid" bit is a misunderstanding. Thiemo said -1 means "please improve". A -1 with no reason given is invalid. This is indeed our policy: A CR-1 vote needs to come with a reason and way forward, otherwise it's considered stonewalling, and can be ignored - though it's preferred to ask for clarification before overriding a CR-1. I suppose you consider your inline comments to be the reason for the CR-1, but that isn't clear from looking at the discussion, or the comments.

FWIW, I don't see how the ontology.owl patch is related to the MathML vs TeX discussion. All ontology.owl does is give a description of the data types, it doesn't say anything about the format of literals or the structure of values. We could mention the format in the description of the data type, but we don't do that for any of the other types, and I would consider it a bad idea. It would mean mixing two levels of abstraction: the data type is about interpretation ("mathematical expression"), the literal type is about the format or encoding (TeX, MathML, etc). The format is specified by the uri given as the literal's type, and the uri should resolve to a description of the format.

Btw, I'm not very happy about an extension defined data type being in ontology.owl, but I don't think it's a disaster either. I tried to come up with a better mechanism in I7599e7fa5391f9, but using that for math would mean a breaking change.

I wrote

I think documenting an unintended behaviour makes it worse. People might read the documentation and adjust to this unintended behaviour. That's the reason, why I think this should not be merged.

In the inline comment I proposed how to proceed

I would perfere to wait for I15fbeec282868b4267a3e3d15740f2c3ff37ea48

I respect people with a different oppinion about that, but I do not understand why this argumentation is not considered as a reason.

I would perfere to wait for I15fbeec282868b4267a3e3d15740f2c3ff37ea48

I respect people with a different oppinion about that, but I do not understand why this argumentation is not considered as a reason.

From the conversation on the change, I think it was simply unclear that you meant that to be the reason for your CR-1. I now understand that that was your intention, but I still don't understand the connection, since the entry in the owl file says nothing about the format.

I understand that you are unhappy about the "your vote invalid" thing, but I believe no harm was done to the codebase. Do you see any concrete problems with the way to code is now?

The problematic thing is that the description says "as supported by the Math extension". I'm afraid that this might raise questions, bug reports and service requests like:

  • "What is supported by the math extension?"
  • "Why is X supported but not Y?
  • "Please enable support of Z."

Answering those requests consumes much more time compared to a simple bug that can be fixed by changing the source code.
My experience in with the wikimedia services team is that it's common to discuss those effects on the community before merging and that people try to avoid to use override negative votes.
For people not working full time on MediaWiki projects it's much more comfortable, if changes get merged once they are ready and well discussed even if that consumes more time.

@Physikerwelt Ah, I see what you mean. Perhaps that line should read: "Type for mathematical expressions as defined by the Math extension." Then it's clear that we are talking about a data type defined by an extension, not about a format. The description of the data type says nothing about how the value is represented (just like we don't say anything about how other kinds of values are represented).

For example quantity simply avoids the problem

<owl:NamedIndividual rdf:about="&wikibase;Quantity">
<rdfs:label>Quantity</rdfs:label>
<rdfs:comment>Type for numerical quantity.</rdfs:comment>
<rdf:type rdf:resource="&wikibase;PropertyType"/>
</owl:NamedIndividual>

In the same way one could write

<rdfs:comment>Type for mathematical expression.</rdfs:comment>

That way those types are described after the same pattern and we would avoid the problem at all.

@Physikerwelt yes, that's what I mean. There isn't supposed to be any info about the format here. However, I do find it useful to explicitly say that this type is defined by the Math extension. Otherwise, someone might remove it, since it does not seem to be used in Wikibase at all.

@daniel: That's another point. This is indeed helpful for users that are aware of MediaWiki and that there are extensions and so on. However, if this text is displayed to users that are not aware of the technicalities, it might also cause confusion. But on a meta level I think those questions should be discussed before things are merged.

Now, we have the datatype but no associated properties. To get feedback on the ease of use it would be good to create at least one property.
You are invited to support
https://www.wikidata.org/wiki/Wikidata:Property_proposal/Natural_science#defining_formula
@TomT0m, @ArthurPSmith, @Bene