Page MenuHomePhabricator

Wikidata JSON dump: file directory location should follow standard patterns
Closed, ResolvedPublic

Description

The Wikidata JSON dump is currently located at

http://dumps.wikimedia.org/other/wikidata/

This does not follow the common scheme used by all other dumps. For example, the daily (incremental) dumps are at the location

http://dumps.wikimedia.org/other/incr/wikidatawiki/

Here "incr" specifies is the type of dump, and "wikidatawiki" is the official Wikimedia site name of Wikidata.org. The current scheme uses a custom string name ("wikidata") that is not a site name, and it completely fails to specify the dump type. If more projects would generate JSON dumps (e.g., a future Wikimedia Commons installation of Wikibase), then this naming pattern will not work.

I suggest to use a location like:

http://dumps.wikimedia.org/other/wikibase-json/wikidatawiki/

Or maybe use "json" if you find this specific enough. While doing this, the file names should also be made more descriptive (Bug 68792).


Version: unspecified
Severity: normal
Whiteboard: u=dev c=infrastructure p=0
See Also:
https://bugzilla.wikimedia.org/show_bug.cgi?id=68792

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 22 2014, 3:47 AM
bzimport set Reference to bz70385.

In addition to the above, there should be a timestamp-based sub-directory for each export (even if it would contain only one file for now). For example, the daily dumps are in directories like

http://dumps.wikimedia.org/other/incr/wikidatawiki/20140903/

Using the same structure will make it easier for consumers to find dump files without needing custom code for each type of dump (a program that checks the Web to find out for which dates there are dumps could use the same code for all types of "other" dumps). Moreover, it might be good to have a directory per dump to organise multiple files in the future (md5 sum, several types of compression [Bug 68793], dump status).

I'm fine with json/wikidatadumps. WIkidata folks please sign off or suggest something you like better. This wil entail: fix to the cron job, move of the existing dumps, correcting any links that already exist (where are those?)

I'm not to fond of having "json" in the path, as we'll provide non-JSON Wikibase specific dumps at some point (rdf, maybe more) and those should IMO be at the same place. If we can't integrate this with the usual dump process now, can we have something like /other/wikibase-dumps/wikidatawiki which makes it clear, that those are the dumps we'll provide for Wikimedia's Wikibase repo installations (that would later on exist for commons as well... and maybe also for testwikidata)?

Also we should probably make the old folder a redirect if we decided to change this, just fixing all links wont work.

I can certainly make the old dir a symlink. wikibase-dumps/wikidatawiki is fine too. Markus?

I think json should be in the path somewhere. It does not have to be at the top-level, but it would not be good if dump files of one type end up in their own directory. The only way for tools to detect and download dumps automatically is to look at the HTML directory listings, and this listing should not change its appearance (again). Note that different types of dumps will be created in different intervals, so a combined directory that contains several types of dumps would look quite messy in the end.

We could have wikibase-dumps/wikidatawiki/json if you prefer this over something like other/wikibase-json/wikidatawiki. However, the latter seems to be more consistent with /other/incr/wikidatawiki. I don't care much about the details, but it would be good to have something systematic in the end: either other/projectname/dumptype or other/dumptype/projectname seems most logical. Also, I think that "dumptype" could already mention wikibase if desired, so that there is no need for an extra directory "wikibase-dumps" on the path. The thing to avoid is to introduce a new directory structure for every new kind of dump (and "wikibase-dumps" smells a lot like this, even if there is a faint possibility that there will be more dumps of this kind in the future -- do you actually have any plans to move our RDF dumps from http://tools.wmflabs.org/wikidata-exports/rdf/ to the dumps site? Could be done, but not sure if it is needed.).

We will publish more dumps than the current json dumps, yes. Daniel wants expanded json dumps for example that include full uri for external identifiers for example.

Ok, I'd prefer to either have https://dumps.wikimedia.org/wikidatawiki/json or https://dumps.wikimedia.org/wikidatawiki/wikibase-json (etc.) if that's easily possible (without messing with the xml dumps or having the xml dumps mess with our dumps).

If that's not easily possible, I'd go for https://dumps.wikimedia.org/wikibase-dumps/wikidatawiki/json (as suggested by Markus).

Would be nice to get this solved fast, as we want this changed before introducing (experimental) RDF dumps.

The current dump location should be a symlink to the new one so that we can keep b/c.

hoo raised the priority of this task from Medium to High.Mar 23 2015, 7:04 PM

Just for the record, I personally prefer https://dumps.wikimedia.org/wikidatawiki/wikibase-json over https://dumps.wikimedia.org/wikidatawiki/json as I think that makes it clear that those are Wikibase dumps.

Ok, I talked about this with @ArielGlenn and we decided that the following would be doable and nice:

Store the dumps (on the file system) in https://dumps.wikimedia.org/other/wikibase-dumps/wikidatawiki/json. Then we can have a symlink to that from https://dumps.wikimedia.org/wikidatawiki/wikibase-json and for backwards compatibility reasons from the current location (https://dumps.wikimedia.org/other/wikidata/).

On top of that we can, if we want, have (via symlinks) https://dumps.wikimedia.org/wikibase-dumps/wikidatawiki/json (although I think this is overkill for now).

Why should Wikibase be in the name?

Why should Wikibase be in the name?

Because just having "json" dumps could mean anything IMO... also I think having wikibase in there is more future proof after we hit commons. But that's just my opinion and it's not a particularly strong one.

Ok, we talked about this in the office and came up with the following:

https://dumps.wikimedia.org/wikidatawiki/entities/ is the (user visible) base path (the actual files would be in /other/…), could also have a fancy html overview page with additional explanations. In there we have the subdirectories full and truthy (and possibly more later on). Those contain all dumps of those flavors, no matter the format.

In those we have files like (all|items|properties)-20150324(-BETA).(json|ttl|…).

@mkroetzsch: What's you opinion on the above naming scheme? Is it ok for you? If so, I will implement it soon.

@hoo Thanks for the heads up! I do have comments.

(1) I would remove the "full" and "truthy" distinction from the path and rather make this part of the dump type (for example "statements" and "truthy-statements"). The reason is that we have many full dumps (terms, sitelinks, statements, properties), which can be readily exported in RDF and JSON, but we have only one "truthy" dump and it really is mainly for RDF (at least we did not discuss a JSON format for "single-triple statements"). Therefore, it does not seem worth to make a top-level distinction in the directory structure for this. For consumers, it is easier if a dump file is addressed with four components (projectname, dumptype, date, file format). The truthy/full distinction would be another parameter that does not seem to add any functionality.

(2) My comment right at the beginning of this bug report was to have timestamped subdirectories, just like we have for the main dumps. Maybe you have reasons for not having these, but could you explain them here?

About 2: We didn't add timestamped subdirectories because they would likely be confusing. Dumps of different formats or flavors would not be done on the same date. And dump creation usually takes more than a day. So finding the right subfolder that has the format and flavor you are looking for seems bad.

@Lydia_Pintscher I understand this problem, but if you put different dumps for different times all in one directory, won't this become quite big over time and hard to use? Maybe one should group dumps by how often they are created (and have date-directories only below that). For some cases, there does not seem to be any problem. For example, creating all RDF dumps from the JSON dump takes about 3-6h in total (on labs). So this is easily doable on the same day as the JSON dump generation. I am sure that we could also generate alternative JSON dumps in comparable time (maybe add an hour to the RDF if you do it in one batch). The slow part seems to be the DB export that leads to the first JSON dump -- once you have this the other formats should be relatively quick to do.

All of these dumps will be generated by exporting from the DB. AFAIK currently the dumps can contain edits that were made after the dump is started. We should at some point change this, but we should not block adding RDF for that. The result is that currently each dump format might represent slightly different data.

All of these dumps will be generated by exporting from the DB.

Why would one want to do this? The JSON dump contains all information we need for building the other dumps, and it seems that the generation from the JSON dump is much faster, avoids any load on the DB, and would guarantee consistent state of all files (same revision status). Moreover, we already have code for doing it now (which will be updated to agree with any changes in RDF export structures we want).

I would propose to discuss dump partitioning for RDF in T93488, since it becomes hard to track otherwise.

Why would one want to do this?

To be able to use the same code as is used for the linked data endpoint of Wikibase. Example: https://www.wikidata.org/wiki/Special:EntityData/Q42.rdf?flavor=full (this format is not final and not yet to be relied on).

would guarantee consistent state of all files

It would guarantee that all dump files are inconsistent in the same way. It would not achieve the consistency of the JSON dump. Not sure if anyone has a use for the previous but not the later. Anyway making the JSON dumps consistent allows both independent of how the other dumps are generated.

"Consistency" of dumps in different formats is a questionable thing. What would it mean to have JSON and RDF "consistent"? Of course they'd contain same entities, that's a given, and the data would be kind of alike. But even values may differ - i.e. RDF has no standard for representing coordinates, so we have to choose something. That something will not be the same as JSON. Also, if we want to represent dates in standard way - e.g. xsd:dateTime - we'd have to modify them, slightly or substantially. Same goes for many other things which look slightly different - ranks, units, truthy statements, etc. Ultimately, we're basing on the same data set, so excepting bugs we'd have consistency on that level, but beyond that I'm not sure what it is.

@JanZerebecki:

Re using the same code: That's not essential here. All we want is that the dumps are the same. It's also not necessary to develop the code twice, since it is already there twice anyway. It's just the question if we want to use a slow method that keeps people waiting for the dumps for days (as they already do now with many other dumps), or a fast one that you can run anywhere (even without DB access; on a laptop if you like). The fact that we must have the code in PHP too makes it possible to go back to the slow system if it should ever be needed, so there is no lock-in. Dump file generation is also not operation-critical for Wikidata (the internal SPARQL query will likely be based on a live feed, not on dumps). What's not to like?

Re consistency: I meant that the dumps would contain the same information, not that they reflect a consistent state of the site. If it is important for you to have a defined state, then the dump-based file generation is also your friend: one can do the same with the full history dump, where one could exactly specify the revision to dump. Probably still as fast as the DB method, but guaranteed to provide a globally consistent snapshot (yes, I know, modulo deletions). Not sure if this type of consistency is relevant though. Having a guarantee that the dump files in various formats are based on the same data, however, would be quite useful (e.g., in SPARQL, where you often mix data from truthy and full dumps in one query).

Recall that we are discussing this here since Lydia said that the slowness of the DB-based exports would be a reason for why we cannot have an (otherwise convenient) date-based directory structure. I agree with Lydia that this would be a blocker, but in this case it's really one that we can easily remove. The code I am talking about is at https://github.com/Wikidata/Wikidata-Toolkit, well tested, extensively documented, and partially WMF-funded. Why not make this into a community engagement success story? :-)

@Smalyshev

Re "what does consistent mean": to be based on the same input data. All dumps are based on Wikidata content. If they are based on the same content, they are consistent, otherwise they are not.

Re "discussing RDF dump partitioning in T93488": Agreed. We are not discussing which RDF dumps to have here, only whether they are likely to be well organised by distinguishing "full" and "truthy" as a primary categorisation that sits above format (RDF vs. JSON and other matters).

I don't think splitting full and truthy would be too useful, as most query engines, except for the absolutely most basic ones, will want both anyway. And for JSON we don't even have that distinction I think?

@Smalyshev Yes, this is what I was saying. @hoo was proposing to create a special directory for "truthy" based on offline discussion in the office.

For the record: while I was proposing to have the dump "flavor" at the bottom of the hierarchy and putting the timestamp only into the filename, I'm coming around to the opposite view again: have the date at the base of the hierarchy.

Having the date as the base makes sense if we can make sure that all the dumps in that directory consistently reflect the state of the data at the given point in time. This is infamously untrue for the "standard" MediaWiki dumps. We could however make it true for our dumps by generating everything off a single JSON dump, as @mkroetzsch suggested.

If we want to split our RDF output into several files (terms, sitelinks, statements, etc), this consistency is essential. I think we should go that route, so I filed a ticket for implementing a script for generating RDF from JSON: T94019.

We could generate multiple dump files from the same database, it doesn't have to be from JSON. I'm also not sure why JSON and RDF should always have the same snapshot - it's a random point (or, given that dump takes many hours during which data changes, random collection of points) in time, no better than any other one.

Also, I'm not sure why generating RDF from JSON should block this task.

@Smalyshev: You are right that it doesn't have to be based on JSON, but since that is our primary data representation, it seems sensible to use it as a basis.

I agree that it doesn't matter much to have the RDF dumps consistent with the JSON dumps. But if we make multiple RDF dumps, it's important that they are consistent with each other. The easiest way to achieve this is to base them on the same JSON dump.

Whether that should block this task is debatable of course. Perhaps it shouldn't. The idea was that putting a timestamp in the directory name only makes sense if we have consistent dumps. But we can live with inconsistencies for a while - it's not like the regular XML dumps were consistent either.

Change 201208 had a related patch set uploaded (by Hoo man):
Add new wikidata folders, define dataset folders in puppet

https://gerrit.wikimedia.org/r/201208

Change 201208 merged by ArielGlenn:
Add new wikidata folders, define dataset folders in puppet

https://gerrit.wikimedia.org/r/201208

Change 201238 had a related patch set uploaded (by Hoo man):
Adopt dumpwikidatajson.sh to the new naming pattern

https://gerrit.wikimedia.org/r/201238

Change 201238 merged by ArielGlenn:
Adopt dumpwikidatajson.sh to the new naming pattern

https://gerrit.wikimedia.org/r/201238

New dumps will be located at https://dumps.wikimedia.org/wikidatawiki/entities/

The legacy directory containing the json dumps will stay in place as is.