Page MenuHomePhabricator

Partial dumps
Open, MediumPublic

Description

The dumps are growing very big and too cumbersome for people to work with. They also contain a large amount of data that is not relevant for the specific usecase. One option we have to solve this is to meaningfully split the dumps. We are already splitting them between Lexemes and Items + Properties. We need to figure out which other meaningful splits there are. We ideally divide the data into distinct subsets.

Current alternatives:

Relevant notes and discussions:

Details

Reference
bz44581

Event Timeline

bzimport raised the priority of this task from to Low.Nov 22 2014, 1:28 AM
bzimport set Reference to bz44581.
bzimport added a subscriber: Unknown Object (MLST).

A list of entities is very different than limiting it by the type of entities. Once this bug is taken up it should be first split into two.

Lydia_Pintscher removed a subscriber: Unknown Object (MLST).
Lydia_Pintscher removed a subscriber: Unknown Object (MLST).

In addition to the entities given explicitly, any properties used in describing the entities should be automatically included in the dump.

What about other linked items in statements?

I wonder if this is still worth pursuing. Is there demand for it? What are the concrete usecases it's serve?

hoo subscribed.

Giving the speed at which the dump grows, we need to look into this again.

I'd suggest splitting (at least the nt dump as I work mostly with those and therefore have an opinion on that side) in the following:

  1. Triples regarding properties (where the property is the subject)
  2. Triples that contain language information (aka the labels)
  3. The "pure" direct triples

In case someone works with those, I think it's reasonable to assume they just need one of those dumps, or are able to combine them. Not sure however how much of that is already done (I think property-triples (1.) aren't in the nt dump atm anyway, are they?)

These types of dumps should be considered:
Terms

  1. Dump of all labels
  2. Dump of all descriptions
  3. Dump of all aliases
  4. Dumps of all labels in a specific language
  5. Dumps of all descriptions in a specific language
  6. Dumps of all aliases in a specific language
  7. Dump of all terms (optional)
  8. Dumps of all terms in a specific language (optional)

Sitelinks

  1. Dump of all sitelinks
  2. Dumps of all sitelinks in a given wiki
  3. Dumps of all entities with sitelinks in a given wiki

Statements

  1. Dump of all statements
  2. Dump of all truthy statements
  3. Dumps of all statements for a property
  4. Dumps of all truthy statements for a property
  5. Dumps of all entities with statements for a property

Other

  1. Dump of all page properties (wikibase:statements, wikibase:sitelinks)

Users may easy to make a custom dump by combining several types of dumps above.

Lydia_Pintscher renamed this task from Partial RDF dumps to Partial dumps.Dec 19 2019, 9:39 AM
Lydia_Pintscher raised the priority of this task from Low to Medium.
Lydia_Pintscher updated the task description. (Show Details)
Lydia_Pintscher added subscribers: abian, Nikki, Aklapper.

https://tools.wmflabs.org/wdumps/ provided a way to generate a partial dump, but the dump can not be regularly generated.

Filtering dumps by area of interest is convenient, if a good criterion can be found to identify items relevant to the topic. It would probably make sense to also include any items directly references, to provide the immediate context of the items.

However, for this to be useful, a great many of such specialized "area of interest" dumps would have to exist, with substantial overlap. If WMF can afford that in terms of resources, it would sure be nice to have.

But perhaps there is a different way to slice this: create a stub dump that filters out most of the statements (and sitelinks?), providing labels, descriptions, aliases, plus instanceof and subclass.

Another approach would be to focus on structure rather than topic: e.g. export all items that have (or are the subject of) a parent taxon property, and include only terms and maybe a very limited set op properties. Similarly, dumps that contain the geographical inclusion structure, or the genealogical structure, or historical timeline may be useful.

I might be wrong but I feel there's a large demand for couple of types of dumps and there's a long tail that we can't afford to have. For example, having a dump of all humans is very very useful (even I need it for one of my tools) and there might be a request for dumps that can be easily handled through WDQS+scraper but for example just getting list of all humans times out in WDQS (understandably)