Page MenuHomePhabricator

[Story] Add a new datatype for geoshapes
Closed, ResolvedPublic

Description

Add support for geoshapes as a datatype to Wikidata, to accurately represent linear and polygon geographic objects.

There are over 5,000 KML files located at https://en.wikipedia.org/wiki/Special:WhatLinksHere/Template:Attached_KML; it would be nice if these could be moved to Wikidata without file conversion, and thus made available to all Wikimedia projects (note https://www.wikidata.org/wiki/Q54725 which has an article in 25 Wikipedias).

See Also:
T28059: Add support for KML/KMZ filetype
T55023: Support for GPS eXchange Format (GPX)

Current workaround: https://www.wikidata.org/wiki/Property:P3096

Related Objects

StatusSubtypeAssignedTask
OpenNone
Resolved Jonas
ResolvedTobi_WMDE_SW
Resolveddaniel
Resolveddaniel
Resolveddaniel
Resolveddaniel
Resolveddaniel
Resolveddaniel
Resolveddaniel
Resolvedaude
Resolveddaniel
Resolveddaniel
ResolvedYurik
ResolvedLydia_Pintscher
ResolvedWMDE-leszek
ResolvedSmalyshev
Resolvedaude
Resolved Jonas
Resolved Aleksey_WMDE
ResolvedLadsgroup
OpenNone
OpenNone
ResolvedLadsgroup
OpenNone
OpenNone
ResolvedLydia_Pintscher

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

We should look into using http://leafletjs.com for this.

This is out of scope. Once we are able to input, store, and output spatial data, we can turn to using it. One foot in front of another.

@Kolossos: You're making a very good point about geojson very likely being too big. Do you have recommendations for other ways to store it?

We should use whatever method that's been used by every other project using spatial data, and that likely means, well, a regular database (spatial extensions were integrated into the main trees long ago). MySQL and MariaDB are no stranger to spatial data. Unless technical constraints are identified, we should not impose arbitrary constraints.

@Kolossos: You're making a very good point about geojson very likely being too big. Do you have recommendations for other ways to store it?

I see two option:
*Handle it Geodata like a file and allow uploads of KML/GPX/GeoJSON to Commons.
*Handle it in a Geo-database (Postgresql/PostGIS) where you could make complex queries, etc.

I would prefer the second way. We have also to keep in mind that some geodata are not only big but have also a complex structure, so e.g. the border of russia is composited from different parts (border to Country A, border to Country B, ...) and you want to reuse these elements to define e.g the order of Country A.

OpenStreetMap datastructure can handle such things by using Nodes,Ways and Relations. And we know how we can link between both projects from WIWOSM. Also for Queries we have software with Overpass-API/Overpass-turbo.

On the other side the maintaining of a full OSM software stack seems more complex than the file based approach.

I think the data should be stored in the underlying database, using whatever format the underlying database uses, and allow input of any format which can be converted to it. I think that means WKT/WKB in the database, and allowing most formats as input/output, but this should be left to the assigned implementer. If the assignee finds it easier or somehow better to store this underlying data on a sister project, then OK, but otherwise the suggestion seems unnecessarily complicated.

Enforcing a conversion to a geodatabase format without any migration path is not a good idea, however, and will lead to the loss of data or a reluctance to migrate data to Wikidata.

I say we use the current de facto migration path to migrate data from external projects to Wikidata, e.g., from Wikipedia or Commons. My understanding is that this is ad hoc, but again I think this is out of scope.

Add support for geoshapes as a datatype to Wikidata, to accurately represent linear and polygon geographic objects.

Again, I say using WKT/WKB in the database, and allowing most formats as input/output. It's been several years since we've known we'd need this. This is blocking numerous likely properties. The longer we wait, the more work will be required to actually implement this and we will also need to merge this into changes like T123565 and all future changes made in this domain. This is blocking pretty much every non-point geographic dataset and their integration not only in Wikidata but Wikipedia as well. Project leadership needs to step up and lead.

I see two option:
*Handle it Geodata like a file and allow uploads of KML/GPX/GeoJSON to Commons.
*Handle it in a Geo-database (Postgresql/PostGIS) where you could make complex queries, etc.

I would prefer the second way. We have also to keep in mind that some geodata are not only big but have also a complex structure, so e.g. the border of russia is composited from different parts (border to Country A, border to Country B, ...) and you want to reuse these elements to define e.g the order of Country A.

Storing Geo-Shapes "natively" in the database would require a completely new mechanism for versioning editable content. This may be different in a year, but until now, MediaWiki only allows "primary" (versioned) content to be a single blob of data (traditionally wikitext, JSON for Wikibase/Wikidata).

We can copy information from that blob to database tables for queries, but that would not be the primary storage format, but rather something like a materialized view. As you know, do the same for links, image usage, etc.

For now, the primary data storage has to be a single blob - embedded in the Wikibase JSON, or a separate page with it's own content model, or an uploaded file. Versioned "native" storage as currently not feasible as far as I can see.

Interactive team plans to work on implementing on-wiki GeoJSON storage on Commons. This data will be easily usable directly from maps & graphs, similar to how you can already show geoshapes directly from OpenStreetMap. See examples.

The Commons Datasets with GeoJSON support has been enabled 2 weeks ago. See help page, and an endangered habitat example. What else is needed to close this task?

@Yurik we still need a Wikibase datatype that allows us to reference GeoJSON pages.

@daniel Ah, thanks, I misunderstood. Do you want to handle these pages the same way you handle images on Commons, or do you want to treat them as entities? My guess you are going for the former, and will wait for the structured (meta) data project to take off first.

@Yurik yes, indeed, we want to treat them similar to how we handle commons media. It's not exactly the same though, so it needs a separate datatype implementation.

@daniel, thanks. Will you want to group them together, or have one datatype per the content handler? For now, we have .map and .tab, but eventually we might want to expand that list. I am already thinking of .tabheader, which would allow multiple .tab pages to use a shared header information page (this way all weather datasets don't have to store the header localizations in each one, instead storing it in just one shared page instead). I'm sure we can come up with more "real" types of data, like .json for "anything hierarchical", or more specific things like .schema for json validation.

@Yurik think they need to be separate datatypes (at least geo separate from other data)

@aude what would be the benefit? I'm guessing the property would need to be restricted to a specific target type, but can we do that with a regex (e.g. /.\.map$/) ?

@Yurik think it would be easier to search/query if they were separate, and also display and handle other aspects differently

agree, visualization aspect is important (i wonder if visualization should be per-property though). As for searching - do we have a per-value-type search rather than per-property? I'm not arguing for-or-against separating them yet, just trying to understand what value multiple datatypes would have vs their cost. Thx :)

@Yurik Wikibase doesn't know about what properties there are but can handle things at the data type level.

not sure I understand what you mean by per-value-type search.

For querying, it would be great to be able to find points within a shape (if the shapes are cleanly enclosed polygons), or within x distance of a shape, etc. Mixing different content types in the same datatype or property would complicate this type of query.

Ah, gotcha. Yep, totally makes sense for searching :) Thanks for explaining!

or all polygons, with instance of (P31) of something, that include a particular point, etc.

Change 330228 had a related patch set uploaded (by Jonas Kress (WMDE)):
Added support for CommonsDataType

https://gerrit.wikimedia.org/r/330228

Change 330230 had a related patch set uploaded (by Jonas Kress (WMDE)):
[WIP] Added support for CommonsDataType

https://gerrit.wikimedia.org/r/330230

So we will add a new datatype for geo-shapes that allows selecting .map files from commons.

The initial request mentioned .kml files.
Do we still need to support for this filetype?

@Jonas KML is a separate thing and not part of this task or part of commons data.

PS. don't think there are .map files on commons. (only pages named *.map)

Yes, let's please be very clear about files vs pages. This is a findamental disctinction both for MediaWiki internally, as well as for Commons editors.

This distinction makes little sense to casual visitors, and perhaps should go away on the technical side at some point, but we are quite far from that. For now, when discussing this with developers and editors, we have to be very clear about whether we mean files or pages. So. geo-shapes and tabular data on commons are pages. They have a content-model, but no mime type. Their page title does have an "extension", like files do, which determines the content-model (because the extension that defines the content model says so). But thy are not media files, and cannot be treated as such.

Change 330228 merged by jenkins-bot:
Added support for geo shapes

https://gerrit.wikimedia.org/r/330228

Change 330230 abandoned by Jonas Kress (WMDE):
Added support for geo-shape

Reason:
Split up big patch into small ones

https://gerrit.wikimedia.org/r/330230

But thy are not media files, and cannot be treated as such.

As I also mentioned in T28059: Add support for KML/KMZ filetype however, it's probably quite simple to write JS tools around .map pages and, which allow you to import and to export KML and gpx files to these .map pages. (not lossless-ly of course, but that's also why it is much simpler to implement than adding support for them as media files).

I've done something similar for .tab pages to import and export CSV and xlsx files.

Change 339168 had a related patch set uploaded (by Aleksey Bekh-Ivanov (WMDE)):
Add InterWikiLinkWikitextFormatter

https://gerrit.wikimedia.org/r/339168

Change 339208 had a related patch set uploaded (by Aleksey Bekh-Ivanov (WMDE)):
Make geo-shape validator configurable

https://gerrit.wikimedia.org/r/339208

Change 339220 had a related patch set uploaded (by Aleksey Bekh-Ivanov (WMDE)):
Make geo-shape storage baseUrl configurable

https://gerrit.wikimedia.org/r/339220

Change 340147 had a related patch set uploaded (by Thiemo Mättig (WMDE)):
Trivialize InterWikiLinkWikitextFormatterTest

https://gerrit.wikimedia.org/r/340147

Change 339168 merged by jenkins-bot:
Add InterWikiLinkWikitextFormatter

https://gerrit.wikimedia.org/r/339168

Change 340149 had a related patch set uploaded (by Thiemo Mättig (WMDE); owner: Thiemo Mättig (WMDE)):
Cleanups and fixes to both InterWikiLink…Formatters

https://gerrit.wikimedia.org/r/340149

Change 339220 merged by jenkins-bot:
Make geo-shape storage baseUrl configurable

https://gerrit.wikimedia.org/r/339220

Change 340149 merged by jenkins-bot:
Cleanups and fixes to both InterWikiLink…Formatters

https://gerrit.wikimedia.org/r/340149

Change 340147 merged by jenkins-bot:
Trivialize InterWikiLinkWikitextFormatterTest

https://gerrit.wikimedia.org/r/340147

Change 341509 had a related patch set uploaded (by jk):
[data-values/value-view] Use fulltext search in commons suggester

https://gerrit.wikimedia.org/r/341509

WMDE-leszek subscribed.

According to @Jonas all functionality intended to be included in the baseline implementation have been created. This seems to be also reflected in the subtask graph (all subtask are resolved).
There are two smaller features that we've committed to provide but are not part of the basic baseline, so they'll be tracked separately. @Jonas will open tickets for them.
Concluding to all that I am moving this ticket to the Done column of the sprint board. I'll leave decision on marking the tickets as "resolved" to @Lydia_Pintscher.
And as final words I'd like to express the hope this is the last "Story" ticket we have in the "Review" column of the sprint board.

Change 339208 merged by jenkins-bot:
[mediawiki/extensions/Wikibase] Make geo-shape validator configurable

https://gerrit.wikimedia.org/r/339208

If I enter a geo-shape thorugh the api does the validator only check that the target page:

  • Is on Commons
  • Has content type Map.JsonConfig

Or does it also look at the page name/namespace etc?
Wondering for the sake of T161726: Support new geo-shape datatype in Pywikibot

AFAIK it checks that the title matches

'/^Data:[^\\[\\]#\\\:{|}]+\.map$/u'

and checks if the page exists on commons.

AFAIK it checks that the title matches

'/^Data:[^\\[\\]#\\\:{|}]+\.map$/u'

and checks if the page exists on commons.

Thanks!

Is that check true for any Wikibase installation or Wikidata only?

A Wikibase installation can change the URL to the Wiki storing the geo-shapes, bu the checks will stay the same.
In the future we might remove the constraint of the 'Data' namespace inside the check.

A Wikibase installation can change the URL to the Wiki storing the geo-shapes, bu the checks will stay the same.
In the future we might remove the constraint of the 'Data' namespace inside the check.

Thanks for the clarification. =)

A Wikibase installation can change the URL to the Wiki storing the geo-shapes, bu the checks will stay the same.
In the future we might remove the constraint of the 'Data' namespace inside the check.

A follow up. I just noticed that Wikibase makes use of the internal geoShapeStorageFrontendUrl instead of the shared filerepo settings for the wiki itself. Would it be possible to expose geoShapeStorageFrontendUrl through the api? Also is there a similar variable for commonsMedia?

A Wikibase installation can change the URL to the Wiki storing the geo-shapes, bu the checks will stay the same.
In the future we might remove the constraint of the 'Data' namespace inside the check.

A follow up. I just noticed that Wikibase makes use of the internal geoShapeStorageFrontendUrl instead of the shared filerepo settings for the wiki itself.

Would it be possible to expose geoShapeStorageFrontendUrl through the api?

I opened T162561: Expose geoShapeStorageFrontendUrl through siteinfo for this

Also is there a similar variable for commonsMedia?

Just found T90492: [Task] Make Wikibase Repo work with a custom File collection, not only Wikimedia Commons which answers that part of the question.

Is there some documentation of this feature?