Page MenuHomePhabricator

Pages should not be indexed by search engines through interwiki links from other wikis
Closed, DuplicatePublic

Description

I've noticed some pages have being indexed by Google as if they were from English Wikipedia. The url above shows the result of a query like
"B:Pt:Página_principal"
which has the following url as the only result:
http://en.wikipedia.org/wiki/B:Pt:Página_principal

For another example, see the result for "Teoria de números/Números primos":
http://www.google.com.br/search?q=%22Teoria+de+n%C3%BAmeros%2FN%C3%BAmeros+primos%22

This shouldn't happen.


Version: unspecified
Severity: enhancement
URL: http://www.google.com.br/search?q=%22B%3APt%3AP%C3%A1gina_principal%22

Details

Reference
bz28242

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 21 2014, 11:33 PM
bzimport set Reference to bz28242.
bzimport added a subscriber: Unknown Object (MLST).

(In reply to comment #0)

This shouldn't happen.

And we can't fix Google.

This may be due to the sort of redirect that is in place there. But when I tried to investigate, I got a 403 Forbidden error from http://en.wikipedia.org/wiki/B:Pt:P%C3%A1gina_principal, but I think that was because of the User Agent.

We send 302 Moved Temporarily status codes (Should they be moved permenantly?):

bawolff@Bawolff-L:/var/www/w/$ HEAD -S -H 'User-agent: test' \
http://en.wikipedia.org/wiki/B:Pt:P%C3%A1gina_principal

HEAD http://en.wikipedia.org/wiki/B:Pt:P%C3%A1gina_principal --> 302 Moved Temporarily
HEAD http://en.wikibooks.org/wiki/Pt:P%C3%A1gina_principal --> 302 Moved Temporarily
HEAD http://pt.wikibooks.org/wiki/P%C3%A1gina_principal --> 200 OK
Cache-Control: private, s-maxage=0, max-age=0, must-revalidate
Connection: close
Date: Sat, 26 Mar 2011 18:07:16 GMT
Age: 7188
Server: Apache
Vary: Accept-Encoding,Cookie
Content-Language: pt
Content-Length: 73670
Content-Type: text/html; charset=UTF-8
Last-Modified: Thu, 03 Mar 2011 19:23:27 GMT
Client-Date: Sat, 26 Mar 2011 20:07:05 GMT
Client-Peer: 208.80.152.2:80
Client-Response-Num: 1
X-Cache: HIT from sq40.wikimedia.org
X-Cache: MISS from sq36.wikimedia.org
X-Cache-Lookup: HIT from sq40.wikimedia.org:3128
X-Cache-Lookup: MISS from sq36.wikimedia.org:80

Please always use sentence case when changing bug summaries (initial capital letter, lowercase everything else, except words that are always capitalized like proper nouns and variables, no trailing punctuation).

In any case, this seems like a duplicate of bug 8753 if we're to believe the new bug summary ("interwiki links should have the nofollow attribute"). However, it's unclear whether this new bug summary is accurate. The old bug summary and the opening comment are about a particular symptom ("Pages should not be indexed through interwiki links from other wikis") while the updated bug summary is about a specific solution. This discrepancy needs to be addressed.

(In reply to comment #4)

However, it's unclear whether this new bug summary is accurate. The old bug
summary and the opening comment are about a particular symptom ("Pages should
not be indexed through interwiki links from other wikis") while the updated bug
summary is about a specific solution. This discrepancy needs to be addressed.

Specific solutions to particular symptoms are appropriate since we can only implement specific solutions.

Putting the proposed solution in the summary is appropriate, since it focuses the bug. If the solution is implemented, the bug can be closed if it addresses the particular symptoms.

If, at some later time, people are dis-satisfied with the solution, a new bug should be opened with their specific issues.

(In reply to comment #5)

If, at some later time, people are dis-satisfied with the solution, a new bug
should be opened with their specific issues.

That doesn't seem particularly fair to the person who took the time to file a bug about their specific problem. You're changing the nature of their request and then telling them that if they don't like how the new request is implemented, they can file another bug? That seems completely backward and wrong.

If you think the issue of interwiki links not having the "nofollow" attribute needs attention, reopen bug 8753. I'm mostly reverting the bug summary here for now.

Fixed in r84820 by making it send 301 (permenent) redirects instead of 302 (Temporary) redirects.

Based on googling, google will report the target as the page's url when following a 301, but will report the original url when following a 302. Furthermore, interwiki redirects of that form are really permanent, so they should have a 301 redirect.

I only changed what happens when you go to a url of the form http://en.wikipedia.org/wiki/B:Pt:P%C3%A1gina_principal . Pages with #Redirect[[B:Some page on wikibooks]] on them will still do 302's since they are arguably non-permenant. (Although if it has chained interwikis, the actual page with #Redirect will be a 302, but the rest in the chain will be 301)

Marking as fixed. I unfortunately don't have any real way to test this though since I don't control google.

Looks like this bug is back. See bug 26115.

  • This bug has been marked as a duplicate of bug 26115 ***