Page MenuHomePhabricator

Blank lines at the end of global robots.txt cause syntax problems when MediaWiki:robots.txt is appended
Closed, ResolvedPublic

Description

The Robots Exclusion Standard (http://www.robotstxt.org/orig.html), which defines the syntax of robots.txt files, says, among other things, that:

  1. there may not be more than one record with "User-Agent: *", and
  2. sections may not contain blank lines.

Thus, the only standard-conforming (and thus reasonably reliable) way for users editing the local part of robots.txt, as specific in [[MediaWiki:Robots.txt]], to include rules pertaining to all robots is to include them at the very top with no blank lines preceding them, so that they get appended directly to the "User-Agent: *" section of the global robots.txt.

Unfortunately, even this doesn't currently work right, since the global part of Wikimedia's robots.txt contains some blank lines at the end. Please remove said lines or replace them with comments.

(Actually, the current implementation is somewhat annoyingly fragile in general. It might be better for MediaWiki itself to parse the content of both the local and global parts of robots.txt (it's not hard), preferably with fairly relaxed parsing rules, and merge them properly into a single file guaranteed to have correct syntax. While it it, the software could try to provide notification of any unrecognized lines and other potential errors detected during the parsing.)


Version: unspecified
Severity: normal
URL: http://en.wikipedia.org/wiki/MediaWiki:Robots.txt

Details

Reference
bz15663

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 21 2014, 10:18 PM
bzimport set Reference to bz15663.
bzimport added a subscriber: Unknown Object (MLST).

mike.lifeguard+bugs wrote:

Adding JeLuF to CC, as they wrote this, IIRC.

mike.lifeguard+bugs wrote:

JeLuF, could you please take a look at this and/or bug 15878? There are reports that search spiders are indexing what they shouldn't be (http://meta.wikimedia.org/w/index.php?title=Talk:Spam_blacklist&oldid=1589272#COIBot_reports_showing_up_in_Google_results).

It's been over a year, would someone please fix this bug? All it should take to fix the immediate issue is to remove the blank lines from the end of the global robots.txt or to replace them with comments.

jeluf wrote:

Fixed.

There was a hardcoded \n\n in robots.php causing the problems. Should now be fine.