Page MenuHomePhabricator

Add support for datasets
Closed, DeclinedPublic

Description

"A dataset corresponds to the contents of a single database table, or a single statistical data matrix, where every column of the table represents a particular variable, and each row corresponds to a given member of the dataset in question."

For inspiration see:
http://json-stat.org/schema/
http://dataprotocols.org/json-table-schema/
http://www.w3.org/wiki/WebSchemas/Datasets

Related discussions:
https://www.wikidata.org/wiki/Wikidata:Contact_the_development_team/Archive/2014/03#What.27s_the_plan_for_heavy_data.3F
https://meta.wikimedia.org/wiki/Talk:DataNamespace


Version: unspecified
Severity: enhancement

Details

Reference
bz62555

Event Timeline

bzimport raised the priority of this task from to Lowest.Nov 22 2014, 3:02 AM
bzimport set Reference to bz62555.
bzimport added a subscriber: Unknown Object (MLST).

So in practice we use this where a property has hundreds of different values distinguished by qualifiers

e.g a table for population property values for a particular administrative region with columns for date; sex; age from; age to; race; religion; Source; basis; preferred/deprecated etc?

Is this just a presentation thing or a change to the data format?

There are important differences:
1.- Fixed structure: theoretically you could convert any dataset into a classical statement-qualifier structure, but not always the other way round. And there is no need to.
2.- No need to source: a dataset is part of an statement so the source would go on the statement where it is linked from, not on the page where the information is stored (it becomes much easier to manage)
3.- Non-editable content: like with media files, datasets are not expected to change. You could clear them and repopulate them again, though (like uploading a new version of a file on Commons). Defining which property corresponds to each row, which Q-items are used in the dataset, and checking datatype constraints, that should be done only on dataset creation/upload.

In any case it provides advantadges over just using tabular data, like re-using the well defined existing wikidata datatypes and the next to come, re-using visualization options that might be developed for the queries, etc.

Set to Lowest for now.

How would this be used practically within Wikidata? Such as from my understanding, this will contain different types of data typed data as well as some other different types of information.

In basic, how would all the features of Wikidata (source, qualifiers, ranks) and functions such as searching, sorting, querying, lua and parsing be integrated with this?

Wontfixing this as per lack of answers to John's fundamental questions. I'm sorry but we really shouldn't do this if we don't want to break some basics of Wikidata.

(In reply to John F. Lewis from comment #3)

How would this be used practically within Wikidata? Such as from my
understanding, this will contain different types of data typed data as well
as some other different types of information.

In basic, how would all the features of Wikidata (source, qualifiers, ranks)
and functions such as searching, sorting, querying, lua and parsing be
integrated with this?

If dataset/chart/table was considered its own datatype, then sources, qualifiers, and ranks could work as with other statements. An item could contain a statement, and the value of the statement would be a chart, and sources, qualifiers, and ranks would be set through the normal interface (the "add source" button and so on).

As Yair said the data would be considered a (json?) blob and sourced/ranked/qualified as a whole. The blob could reside in its own namespace page (Dataset:X), which won't be editable as other items. Instead, new revisions could be uploaded and the data typing would be tested on upload.

The upload test would consist on:

  • check that the data conforms to the WD data representations
  • check that there is always the same claim-qualifier structure as given by the first claim

Lua modules could access it as other wikidata items (since the datatypes, structure, etc is the same as any item) with the difference that the data is considered static and its structure always the same.

For instance this table [1] would be called "Dataset:Populations with multiracial identifiers in CA 2010". It would be used as a value of a claim as:
"Demographics of California (Q3044234)" <census data> "Dataset:Populations with multiracial identifiers in CA 2010"
Sourced, and qualified as usual ("year of creation:2010").

The dataset itself would be represented as claims (*) with qualifiers (--):
*Group: White
--population:15,763,625
--percentage:42.3%
*Group: Hispanic
--population:14,013,719
--percentage:37.6%
etc.

This would translate visually into a non-editable spreadsheet:
{|

-

! Group population percentage

-
White15,763,62542.3%
-
Hispanic14,013,71937.6%
}

I hope this answers John's fundamental questions. Lydia, do you still think that it would break some Wikidata basics? If so, could it be considered for Wikibase-Commons?

[1] https://en.wikipedia.org/wiki/Demographics_of_California#2010_Populations_with_multiracial_identifiers

Yes I do think so. Because none of this addresses how it'd be treated in queries for example. If you have a spreadsheet of the population for example how is this going to show up in the searches for "population > 5 million" and so on? And where do you draw the line between having such data in a "tabular datatype" that consists of other datatypes and having them as statements on their own. I'm sorry folks but there are so many conceptual issues with this...

The thing with datasets is that it is data that it is not usually included into Wikidata proper, because the effort required to enter it and maintain it as regular data would be too big. By offering a simplified alternative at least the data could be shared without copying and pasting. It also avoids creating wikidata pages with so many statements that cannot be loaded. And perhaps can be used to generate visual representations.

Besides, it is not necessary that it shows up in searches other than in the same way that Commons files show up in searches.

As an implementation example, see:
http://datahub.io/

Example 1: "2011 Annual Report for the Vancouver Landfill"
http://datahub.io/dataset/2011-city-of-vancouver-landfill-quantities-of-nuisance-waste-and-recyclable-materials/resource/6fdf7864-415b-4a11-88f6-351645cf802f

Example 2: "Spanish Premier Football league 2013/2014"
http://datahub.io/dataset/spain-football-match-data-la-liga-primera-segunda/resource/d2a579f9-d3aa-49e8-8bc5-e63db55106d1

Example 3: http://datahub.io/dataset/municipal-organics-diversion-carbon-credits-for-carbon-neutral-reporting-2012-reporting-year/resource/b4ee5c88-08a5-4f2c-a5ab-f301a3a6d956

All that is data that probably won't make it into Wikidata but it might be useful for Wikimedia projects if it lives in a central, structured repository.

Perhaps Wikidata wouldn't be the right place to store it, but Wikibase could provide the technology to another site.

I agree this is needed (somewhere). If there are no native data islands, they will appear as data wrspped in code, instead of code having nice libraries (above mw.loadData) to access structured datasets.

https://www.mediawiki.org/wiki/Extension:Scribunto/Lua_reference_manual#mw.loadData