Automatically add encoded URL lines for entries in MediaWiki:robots.txt
Closed, DeclinedPublic
Actions

Assigned To

Authored By

	RyanLane
	May 27 2011, 12:45 AM

Description

Crawlers may still index pages that should be disallowed due to encoded characters, see the following example:

Disallow: /wiki/Wikipedia:Arbitration/
Disallow: /wiki/Wikipedia3AArbitration/
Disallow: /wiki/Wikipedia3AArbitration%2F
Disallow: /wiki/Wikipedia:Arbitration%2F

MediaWiki should generate these extra rules automatically for users.

Version: unspecified
Severity: enhancement

Details

Reference: bz29162

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Open	Feature	None	T16720 robots.txt (tracking)
		Declined		Krinkle	T31162 Automatically add encoded URL lines for entries in MediaWiki:robots.txt

Event Timeline

• bzimport raised the priority of this task from to Lowest.Nov 21 2014, 11:31 PM

• bzimport added a project: WMF-General-or-Unknown.

• bzimport set Reference to bz29162.

• bzimport added a subscriber: Unknown Object (MLST).

RyanLane created this task.May 27 2011, 12:45 AM

Better still, use <link rel="canonical" href="http://en.wikipedia.org/wiki/X"> for this purpose.

More info: http://googlewebmastercentral.blogspot.com/2009/02/specify-your-canonical.html

FT2.wiki wrote:

Very useful, however we need to check what happens if some of the alternate URLs to the canonical page are excluded by robots.txt and some aren't.

Does it apply the robots.txt rule that it has for the "canonical" page, to all alternatives? Or does it get confused?

Example:

/Wikipedia%3AArbitration%2FExample is stated to have /Wikipedia:Arbitration/Example as its canonical link. However one of these is NOINDEXed via robots.txt or in its header and one isn't. Knowing the canonicity helps to identify these as "duplicates" and "the same page". But does it guarantee both of these will be treated as NOINDEXED if one of them is and the other isn't? Or do we still have to cover all variants of the URL in robots.txt?

FT2.wiki wrote:

To clarify, URL variants where robots.txt or header tags prohibit spidering will probably be excluded from spidering in the first place. So Google will be left to collate those URL variants it came across where robots,txt or header tags _didn't_ prevent spidering -- and a "canonical" setting which states these are all the same page.

Ie this setting could help avoid duplicates but my guess is it probably _won't_ prevent URLs not stopped by robots.txt or header tags from being listed in results.

Changing product/component to Wikimedia/Site requests, MediaWiki:Robots.txt is a WMF hack, there's no such feature in MediaWiki core.

(In reply to comment #0)

MediaWiki should generate these extra rules automatically for users.

(In reply to comment #4)

MediaWiki:Robots.txt is a WMF hack, there's no such feature in MediaWiki core.

Now, how to prioritize...

MediaWiki provides a canonical url that spiders should use.

For encoding normalisation, MediaWiki (anno 2015) performs a 301 redirect to the canonical url. As such, these non-standard encoding urls will never be responded to with an article that a spider could index.

To exclude individual pages from search engines, use __NOINDEX__ instead.

Meno25 unsubscribed.Jul 9 2015, 2:57 PM

Automatically add encoded URL lines for entries in MediaWiki:robots.txtClosed, DeclinedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Automatically add encoded URL lines for entries in MediaWiki:robots.txt
Closed, DeclinedPublic
Actions

Related Objects
Search...