Page MenuHomePhabricator

RDF output should contain license info about the concrete rendering, not only the abstract description document.
Open, Stalled, LowPublic

Description

(Originally reported by User:Nevalicori at [1])

The RDF output generated by Special:EntityData/Q1.ttl and friends should have a license statement about Special:EntityData/Q1.ttl (resp Q1.rdf, Q1.n3, etc), not just about the format-neutral document URL Special:EntityData/Q1.

This could be achieved by repeating the license info for each format (or at least the format of the present document), or by using dct:isFormatOf[2] to link the different renderings (formats) of the description.

[1] https://www.wikidata.org/w/index.php?title=Wikidata:Contact_the_development_team&oldid=163110219#Wikidata_licensing_triples
[2] http://udfr.org/docs/onto/dct_isFormatOf.html


Version: unspecified
Severity: normal
Whiteboard: u=dev c=backend p=0
URL: https://www.wikidata.org/wiki/Special:EntityData/Q1.rdf

Details

Reference
bz71991

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 22 2014, 3:50 AM
bzimport set Reference to bz71991.
bzimport added a subscriber: Unknown Object (MLST).
Lydia_Pintscher removed a subscriber: Unknown Object (MLST).
Lydia_Pintscher removed a subscriber: Unknown Object (MLST).

Looking at wikibase:repo/includes/rdf/RdfSerializer.php and friends, there are a few places where this could be inserted fairly safely, depending upon project preference about abstraction cleanliness.

By the looks of the way the methods are separated, RdfSerializer::buildGraphForEntityRevision() looks intended to build a clean graph without any particular awareness of the requested serialisation, with RdfSerializer::serializeRdf() performing the actual serialisation of the actual graph.

This presents a little bit of a quandry: I can't see any reason why either method (or the helper method RdfSerializer::serializeEntityRevision()) couldn't perform graph modifications themselves, but it feels messy. On the other hand, RdfBuilder doesn't know what format the built graph will be serialised as (nor should it), but it also doesn't obviously know what all of the serialisations might be, either, so a generic solution there would require a bit more effort.

As a first pass, inserting something akin to:

/* Indicate that the concrete representation is licensed as the abstract document is */
$concreteResource = $builder->getGraph()->resource( $dataURL . '.' . $this->format->getDefaultExtension() );
$concreteResource->addResource( RdfBuilder::NS_CC . ':license', 'http://creativecommons.org/publicdomain/zero/1.0/' );

around here would at least achieve the desired result, if a little messily than one might like.

(For bonus points one could throw in the dct:hasFormat or dct:isFormatOf, too, though I'd make keep the licensing statement attached to the concrete representation regardless to limit the hoops that processors need to jump through before they can determine whether something is licensed for their needs or not; self-interest side-note: I have such a discriminating processor).

Note: I don't have a running copy of Wikibase here, so this is entirely mental programming, apologies for any stupid errors.

Not sure I understand this one - is it about putting license in each entity:Q* ? I think that would be a bit redundant, especially in dumps. Especially given that we have the same license for everything.

Smalyshev lowered the priority of this task from Medium to Low.Mar 20 2015, 7:04 PM

Hold on, you're not sure you understand but you've lowered the priority anyway?

The problem is that the current licensing triple is redundant, because it's useless: there is currently no triple in the concrete serialisation which actually describes the licensing of that document.

Noting that HTTP is stateless, the request you make for the Turtle is a GET for /wiki/Special:EntityData/Q1.ttl. That document contains triples which relate it to /wiki/Q1. There is also a licensing triple whose subject is /wiki/Special:EntityData/Q1, but there is no information relating that subject to the other two.

You could invent some data (e.g., by using dct:hasVersion), except that I'm sceptical that any consuming application would be able to jump through those hoops without being given special knowledge of Wikidata (which defeats the purpose of linked open data).

In short, you're generating machine-readable licensing data that only a human being can actually interpret. The simplest change is to change the current subject of the document description triples from the abstract document to the concrete serialisation. That is, change the triples whose subject is /wiki/Special:EntityData/Q1 to /wiki/Special:EntityData/Q1.ttl (et al).

I lowered the priority because whatever is the licensing question is, I'm pretty sure it does not prevent any usage of the RDF data, as licensing of the Wikidata data is widely known and there's no copyright issue with it. Thus, I would consider it to be of a lower priority than data issues that we're dealing now. That doesn't mean it should not be considered or fixed - that just means it's less pressing currently than, say, figuring out how to represent dates or labels or quantities right. Of course, I can be mistaken in this regard and then the priority would be raised.

That is, change the triples whose subject is /wiki/Special:EntityData/Q1 to /wiki/Special:EntityData/Q1.ttl (et al).

Not sure it's a good idea since .ttl is just a format, we're describing the data item, not a format. I.e. Q1.ttl, Q1.rdf and Q1.nt describe the same thing, just use the different bytes to do it. Thus, I don't think it is right to make them generate different triples. Unfortunately, I don't know any way in RDF to have a triple relating to "this document" (as opposed to "data presented in this document") but if you do please suggest. Then we could attach license info to that.

Thanks for clarifying the priority issue.

While it might be repetitious, it's not unusual to license different serialisations differently (WIkidata doesn't, but other sites do: the HTML representation may well be under a more restrictive license than the RDF, for example). From a processor's point of view (at least, one that actually cares about licensing and isn't a human being), it needs to be able to understand that "this resource that I retrieved is under X licence" (which is really the point of licensing predicates).

As the resource URI really is /wiki/Special:EntityData/Q1.ttl, that's the resource which needs to have licensing data expressed about it.

You could say “all serialisations of this document are licensed under these terms”, but you'd have to relate each serialisation to the abstract document URI and be fairly confident that consumers would adopt it as a practice. We could update our processor to follow a <concrete-uri> dct:isVersionOf <abstract-uri> to look for licensing data, although I don't know if anybody else would go to the trouble (it's hard enough to get licensing data included in the first place!)

An alternative approach would be to send it as a Link header (with rel=license) (which is what some others do), at which point the RDF itself becomes slightly moot.

@Nevalicori I'd really like to have a good way to say "this document" is under license X. What do you think would be a good way to express "this document"? We could use "#", but when processing/importing the document, that would probably resolve to the "file:" URI of the local file, which would be correct, but not very useful to have in a triple store. We could also try to inject the request URL and use it as the URI for the rendering, but the request URL the application server sees may well not be what the client used to retrieve the data...

Note that "this rendering" isn't just about the different serializations, but also about revision, language filter, inclusion/exclusion of some aspects like sitelinks, using "truthy" statements or "reified" statements, etc. There's a lot of things that would need to be represented in that URI.

@daniel: do you mean 'this concrete serialisation', 'this abstract document', or 'a specific version of this abstract document'?

(Realistically, you need a URI pattern that handles all three)

At the moment, there are definitely two canonical versions of those:

/wiki/Special:EntityData/Q1.ttl - the concrete serialisation of the current version of the document

/wiki/Special:EntityData/Q1 (or data:Q1)- the current version of the abstract document which can be serialised in a number of formats

(When dereferenced, the latter always redirects to the former - of course, it could in the future not redirect, but instead just serve the equivalent content to the former and ideally send an appropriate Content-Location instead).

Versioning can be handled through a million ways, and I'm guessing some of the wikidata stack does that already through query-parameters, so let's leave that for the moment.

I might be being dim and missing the point, but I think what you're driving at is...

For not-necessarily-canonical copies (for example, a copy that's been saved to disk and may have a file:/// URI), you can express triples with a subject of <> as the concrete serialisation and express a relationship between that and the canonical URI for the document, data:Q1 (and it's up to you whether you attempt to express any kind of relationship between <> and, say, /wiki/Special:EntityData/Q1.ttl; if the two are actually the same URI, you risk stating that a subject is derived from itself, if you're not careful!)

Is that what you meant? Is that the answer?

Oh, sorry — you don't want it to resolve to the file:/// URI - i.e., you always want it to be the canonical URI. In which case, you already have those - and in the PHP that generates the RDF it can be obtained with...

$dataURL . '.' . $this->format->getDefaultExtension()

(see the minor proposed code-change at the beginning of the ticket :)

Addshore changed the task status from Open to Stalled.Jan 23 2019, 1:13 PM

@Lucas_Werkmeister_WMDE will look at at this as per the ticket analysis meeting

It seems to me that this whole task assumes that the Turtle serialization, /Q1.ttl, is really a separate document. But that’s far from clear to me: my first reaction would be to say that /Q1 is the real document, and consumers shouldn’t care whether it redirects to Q1.ttl or serves Turtle directly, and always look for license statements about /Q1. This also seems to match our documentation, which names Special:EntityData/Q1 as the data URI and only mentions Q1.ttl as a convenience feature for situations where content negotiation is not possible.

However, I’m not sure if this interpretation is consistent with the HTTP 303 See Other response we currently use to redirect from /Q1 to /Q1.ttl. According to the RFC, that response code is used to redirect to a different resource – perhaps we should use 302 Found instead to redirect to a different URI for the same resource?

Hi @Lucas_Werkmeister_WMDE ,

Not quite an assumption per se; HTTP is stateless, so it doesn't really matter what kind of redirect you used to arrive at /Q1.ttl; if that ends up being the Request-URI, then that's the URL which needs to appear in the document for an automated processor to be able to interpret licensing statements properly.

With respect to which HTTP status code is actually correct… that's a slightly different issue; and it's a little tricky because there are very few LOD clients out there against which to baseline behaviour. However, in the past where I've implemented them, whether a redirect is a 301/302 or a 303 does necessarily influence behaviour to an extent:

you request <id-of-thing>:

a) if the response is 301/302, the redirect target is interpreted as an alias for <id-of-thing> (i.e., you're potentially redirecting the concept, and so the target should be added to the list of candidate URIs to look for in the document you eventually retrieve)

b) if the response is 303, the redirect target is the URL of a document containing data about <id-of-thing>

In the Wikidata case, (b) happens in the redirect from /entity/Q1 to /Special:EntityData/Q1; the complicating factor is that Wikidata is then redirecting a second time, after content negotiation. In principle, I'd be inclined to agree that 302 Found with Vary: Accept would be the correct response here (because /Q1 is, conditional upon the value of the Accept header, an alias for /Q1.ttl). However, I have no idea how processors would behave in that scenario. It's probably fine, though :)