Partial dumps
Open, MediumPublic
Actions

Assigned To

None

Authored By

	daniel
	Feb 1 2013, 11:36 AM

Description

The dumps are growing very big and too cumbersome for people to work with. They also contain a large amount of data that is not relevant for the specific usecase. One option we have to solve this is to meaningfully split the dumps. We are already splitting them between Lexemes and Items + Properties. We need to figure out which other meaningful splits there are. We ideally divide the data into distinct subsets.

Current alternatives:

https://tools.wmflabs.org/wdumps/ provides the opportunity to create custom dumps.

Relevant notes and discussions:

https://www.wikidata.org/wiki/Wikidata:WikiProject_Schemas/Subsetting

Details

Reference: bz44581

Related Objects
Search...

View Standalone Graph

This task is connected to more than 200 other tasks. Only direct parents and subtasks are shown here. Use View Standalone Graph to show more of the graph.

Status	Assigned	Task
		· · ·
Open	None	T88991 improve Wikidata dumps [tracking]
Open	None	T46581 Partial dumps
Resolved	Smalyshev	T46580 Script for creating RDF dumps of all entities
Duplicate	None	T211495 Dump(s) of Wikidata classes
Duplicate	None	T211497 Dump(s) of Wikidata instances of Q5
Open	None	T162351 Create a "page prop" RDF dump for Wikidata entities ("pagePropertiesRdf")
Open	None	T98320 [Task] Create dump of entity redirects (JSON or n-triples)
Open	None	T285307 Create randomly split partial entity dumps
		· · ·

Event Timeline

• bzimport raised the priority of this task from to Low.Nov 22 2014, 1:28 AM

• bzimport added a project: MediaWiki-extensions-WikibaseRepository.

• bzimport set Reference to bz44581.

• bzimport added a subscriber: Unknown Object (MLST).

daniel created this task.Feb 1 2013, 11:36 AM

A list of entities is very different than limiting it by the type of entities. Once this bug is taken up it should be first split into two.

Lydia_Pintscher added a project: Wikidata.Dec 1 2014, 2:57 PM

Lydia_Pintscher removed a subscriber: Unknown Object (MLST).

Lydia_Pintscher added a parent task: T88991: improve Wikidata dumps [tracking].Feb 9 2015, 4:03 PM

In addition to the entities given explicitly, any properties used in describing the entities should be automatically included in the dump.

What about other linked items in statements?

Tobi_WMDE_SW closed subtask T46580: Script for creating RDF dumps of all entities as Resolved.Mar 13 2015, 11:39 AM

I wonder if this is still worth pursuing. Is there demand for it? What are the concrete usecases it's serve?

Lydia_Pintscher moved this task from incoming to hold on the Wikidata board.Mar 16 2015, 9:54 AM

Lydia_Pintscher closed this task as Invalid.Mar 30 2015, 11:00 AM

Lydia_Pintscher claimed this task.

Giving the speed at which the dump grows, we need to look into this again.

Restricted Application added a subscriber: PokestarFan. · View Herald TranscriptAug 14 2017, 2:52 PM

I'd suggest splitting (at least the nt dump as I work mostly with those and therefore have an opinion on that side) in the following:

Triples regarding properties (where the property is the subject)
Triples that contain language information (aka the labels)
The "pure" direct triples

In case someone works with those, I think it's reasonable to assume they just need one of those dumps, or are able to combine them. Not sure however how much of that is already done (I think property-triples (1.) aren't in the nt dump atm anyway, are they?)

Lydia_Pintscher removed Lydia_Pintscher as the assignee of this task.Jan 14 2018, 12:44 PM

abian added a subtask: T211495: Dump(s) of Wikidata classes.Dec 9 2018, 12:34 PM

abian added a subtask: T211497: Dump(s) of Wikidata instances of Q5.Dec 9 2018, 1:10 PM

These types of dumps should be considered:
Terms

Dump of all labels
Dump of all descriptions
Dump of all aliases
Dumps of all labels in a specific language
Dumps of all descriptions in a specific language
Dumps of all aliases in a specific language
Dump of all terms (optional)
Dumps of all terms in a specific language (optional)

Sitelinks

Dump of all sitelinks
Dumps of all sitelinks in a given wiki
Dumps of all entities with sitelinks in a given wiki

Statements

Dump of all statements
Dump of all truthy statements
Dumps of all statements for a property
Dumps of all truthy statements for a property
Dumps of all entities with statements for a property

Other

Dump of all page properties (wikibase:statements, wikibase:sitelinks)

Users may easy to make a custom dump by combining several types of dumps above.

Bugreporter mentioned this in T195332: Allow alternative pagination options for LDF endpoint.Jul 29 2019, 11:01 PM

Bugreporter added subtasks: T162351: Create a "page prop" RDF dump for Wikidata entities ("pagePropertiesRdf"), T98320: [Task] Create dump of entity redirects (JSON or n-triples).Aug 9 2019, 5:40 AM

Lydia_Pintscher renamed this task from Partial RDF dumps to Partial dumps.Dec 19 2019, 9:39 AM

Lydia_Pintscher raised the priority of this task from Low to Medium.

Lydia_Pintscher updated the task description. (Show Details)

Lydia_Pintscher merged a task: T211495: Dump(s) of Wikidata classes.

Lydia_Pintscher merged a task: T211497: Dump(s) of Wikidata instances of Q5.

Lydia_Pintscher added subscribers: abian, Nikki, Aklapper.

https://tools.wmflabs.org/wdumps/ provided a way to generate a partial dump, but the dump can not be regularly generated.

Ladsgroup subscribed.Dec 21 2019, 5:30 PM

Filtering dumps by area of interest is convenient, if a good criterion can be found to identify items relevant to the topic. It would probably make sense to also include any items directly references, to provide the immediate context of the items.

However, for this to be useful, a great many of such specialized "area of interest" dumps would have to exist, with substantial overlap. If WMF can afford that in terms of resources, it would sure be nice to have.

But perhaps there is a different way to slice this: create a stub dump that filters out most of the statements (and sitelinks?), providing labels, descriptions, aliases, plus instanceof and subclass.

Another approach would be to focus on structure rather than topic: e.g. export all items that have (or are the subject of) a parent taxon property, and include only terms and maybe a very limited set op properties. Similarly, dumps that contain the geographical inclusion structure, or the genealogical structure, or historical timeline may be useful.

I might be wrong but I feel there's a large demand for couple of types of dumps and there's a long tail that we can't afford to have. For example, having a dump of all humans is very very useful (even I need it for one of my tools) and there might be a request for dumps that can be easily handled through WDQS+scraper but for example just getting list of all humans times out in WDQS (understandably)

Isaac subscribed.Jan 7 2020, 3:30 PM

• toan subscribed.Jul 28 2020, 1:28 PM

Lydia_Pintscher updated the task description. (Show Details)Sep 2 2020, 8:29 AM

GoranSMilovanovic subscribed.Nov 20 2020, 6:08 PM

Addshore mentioned this in T94019: Generate RDF from JSON.Jul 20 2021, 9:19 AM

Manuel mentioned this in T314372: [EPIC] Improve Wikidata dumps.Aug 2 2022, 11:23 AM