Page MenuHomePhabricator

Prevent mirrors of Wikipedia from including NOINDEXED pages
Closed, DeclinedPublic

Description

Author: swalling

Description:
This may be actually impossible, but I'm filing a bug to discuss strategies for preventing mirrors of Wikipedia from including pages we NOINDEX. A good example of this is user pages or user talk pages, and the new Draft namespace on English Wikipedia.

Technically speaking these pages are free content just like anything else on Wikipedia (with the exception of fair use images, etc.). However, there are good reasons for us to not want content to be indexed by search engines and found by readers.

Numerous times, I've had Wikipedians bring up the valid point that mirrors erode our ability to control search indexing, because they mirror content we NOINDEX, but do not replicate the contents of our robots.txt.

There may practically be no way to prevent this. Even if that's the case, we should record why that is and WONTFIX this, as a point of reference.


Version: wmf-deployment
Severity: enhancement
See Also:
https://bugzilla.wikimedia.org/show_bug.cgi?id=58805

Details

Reference
bz58758

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 22 2014, 2:40 AM
bzimport set Reference to bz58758.
bzimport added a subscriber: Unknown Object (MLST).

(In reply to comment #0)

This may be actually impossible, but I'm filing a bug to discuss strategies
for
preventing mirrors of Wikipedia from including pages we NOINDEX. A good
example
of this is user pages or user talk pages, and the new Draft namespace on
English Wikipedia.

I can't think of any possible way to /enforce/ this, nor should we. We definitely shouldn't redact the info from the API or dumps (which I assume are the two most common ways of mirroring us).

Now, we might be able to expose the NOINDEX to reusers and encourage people to respect it, but I can't see any way of preventing people from using the content if they really want it.

(Also, this isn't really a search thing for me and Nik. NOINDEX is core, not Cirrus)

Maybe we could drop the NOINDEX'ed namespaces (and maybe even pages) from the primary dumps?

However, looking at it, the main dumps people use seems to be pages-articles-multistream, which is only :, Template:, File:, Category: and Project: – so Draft: wouldn't enter?

(In reply to comment #3)

Maybe we could drop the NOINDEX'ed namespaces (and maybe even pages) from the
primary dumps?

However, looking at it, the main dumps people use seems to be
pages-articles-multistream, which is only :, Template:, File:, Category: and
Project: – so Draft: wouldn't enter?

Indeed. Any we wouldn't want to drop them from the "full" dump since they do still need to get dumped :)

I doubt Ariel wants a new "full dump except things that are NOINDEXED" :)

ok, this is 'crazy' i'm sure, but it's the only thing i could come up with.

Make noindex add the *.domainname into the page_props of the article db row. Have noindex add this value into the html structure of the page, non visible but with an id.

DB gets dumped.
DB gets imported

Different host + mediawiki + dump, leads to rendering of same wikipedia.org domain name into the html structure.

JS in mediawiki core checks pages for presence of the hidden noindex element. Finds it doesn't match what it expects, JS blanks the page.

For scraped content, the only thing I can come up with, is to have this noindex element somewhere smack in the middle of the content (still hidden of course) and have it contain the blanking script fully inline and then hope they scrape the html dom instead of the text content and hope they are stupid enough to not filter out scripts :D

I don't think we should prevent people from downloading it if they want it. All (non-deleted) content should be available for download. We could maybe encourage people to download the articles only dump, but its important that all content is available.

Comment 5 made me cry and I would revert any such hackery on the spot :p

(In reply to comment #6)

I don't think we should prevent people from downloading it if they want it.
All
(non-deleted) content should be available for download. We could maybe
encourage people to download the articles only dump, but its important that
all
content is available.

This.

(In reply to comment #0)

This may be actually impossible, but I'm filing a bug to discuss strategies
for preventing mirrors of Wikipedia from including pages we NOINDEX. A good
example of this is user pages or user talk pages, and the new Draft namespace
on English Wikipedia.

Preventing them is a WONTFIX.

For reference, the user namespace is not NOINDEX by default on English Wikipedia, though NOINDEX works.

Technically speaking these pages are free content just like anything else on
Wikipedia (with the exception of fair use images, etc.).

Yes, this (along with the Right to Fork) is why we must not do this. If we exclude the pages from the dumps, it will make the freedom of the content much less meaningful. It would also encourage people to mirror by crawling the HTML (or even worse, mirroring it live), which is a poor practice and loses a lot of information from the dumps.

Numerous times, I've had Wikipedians bring up the valid point that mirrors
erode our ability to control search indexing, because they mirror content we
NOINDEX, but do not replicate the contents of our robots.txt.

Free content means giving up some control over what people do with it. The edit screen used to say, "If you do not want your writing to be edited mercilessly and redistributed at will, do not submit it." It no longer says that, but it's just as true under our current licenses.

Wikipedia has a high overall search engine ranking, and sites simply mirroring drafts (which by definition are generally not ready for the primetime) probably won't rank that high. But I accept this could change, does not apply to many other sites, and that there are probably exceptions even on Wikipedia.

People have to comply with our license (attribution, stating license, etc.), but they are allowed to distribute everything with or without marking it NOINDEX. It is reasonable to encourage mirrors to preserve the robot policies on their own HTML output, though.

Since a3aac44 in 2010 (pages last saved before then don't seem to have it judging by a check of the akwiki dump), NOINDEX and INDEX have been stored in the page_props table (along with all other DOUBLEUNDERSCORE magic words). This is dumped, so it is relatively easy to check this on a per-page basis.

I don't think the namespace robot policies are currently anywhere in the dump. I've filed this as bug 58805.