Page MenuHomePhabricator

Include Sitemaps file on wikimedia's robots.txt
Closed, DeclinedPublic

Description

Author: mathias.schindler

Description:
Sitemaps is a formerly Google only, now a project supported by some of the
larger search engines. It is an XML based site map of a given web site.
MediaWiki does support creating Sitemap XML files and we do it on a regular
basis. Sitemaps now supports storing the xml file in the robots.txt to make it
findable for other search engines.

Feature Request: Please mention the location of the corresponding sitemaps xml
file in each of our robots.txt files as described in
http://sitemaps.org/protocol.html#submit_robots

The result "could" be faster and more efficient indexing of Wikimedia content on
our servers and less useless requests on unchanged sites.


Version: unspecified
Severity: enhancement

Details

Reference
bz9563

Event Timeline

bzimport raised the priority of this task from to Lowest.Nov 21 2014, 9:37 PM
bzimport set Reference to bz9563.
bzimport added a subscriber: Unknown Object (MLST).

river wrote:

this can't be done easily at the moment because robots.txt is shared between
all wikis.

Yes, it would be great to break away from having to do any special
contacting of search engines. If they are interested in indexing us,
they know where to look: robots.txt's "Sitemap:" entry.

We no longer would need to have a "*oogle/*ahoo! webmaster tools
account" or any special catering or even knowing about who the search
engines are.

So how to do it?:

In http://sitemaps.org/protocol.php:

Sitemaps & Cross Submits
To submit Sitemaps for multiple hosts from a single host, you need to
"prove" ownership of the host(s) for which URLs are being submitted in
a Sitemap... You can do this by modifying the robots.txt file on
www.host1.com to point to the Sitemap on www.sitemaphost.com...
...You can specify more than one Sitemap file per robots.txt file.
Sitemap: <sitemap1_location>
Sitemap: <sitemap2_location>

Note: we do not see mention of being able to use sitemap INDEX files
on the "Sitemap:" line of robots.txt, just plain sitemap.xml.gz files.

But that is good enough for me:
I'm putting e.g.,
Sitemap: http://radioscanningtw.jidanni.org/sitemap-radioscanningtw-wiki_-NS_0-0.xml.gz
Sitemap: http://taizhongbus.jidanni.org/sitemap-taizhongbus-wiki_-NS_0-0.xml.gz
Sitemap: http://taizhongbus.jidanni.org/sitemap-taizhongbus-wiki_-NS_5-0.xml.gz
Sitemap: http://transgender-taiwan.org/sitemap-transgender-wiki_-NS_0-0.xml.gz
in my robots.txt, which is shared between my three wikis and
hoping for the best.

Here I can pick and choose amongst all the different namespaces made
by generateSitemap.php achieving Bug #12860), without having to get
tangled up with editing sitemap-index-*.xml that generateSitemap.php
also creates, or its bugs: paths: Bug #9675, xsd: Bug #13527. Indeed,
I will rm sitemap-index-*.xml .

Wait, Wikimedia sites are big, so maybe you can use
Sitemap: http://aaa.../sitemap-index-yyy.xml
Sitemap: http://bbb.../sitemap-index-zzz.xml
according to http://sitemaps.org/protocol.php:

Specifying the Sitemap location in your robots.txt file:
If you have a Sitemap index file, you can include the location of just
that file. You don't need to list each individual Sitemap listed in
the index file.

if that's indeed what it is trying to say.

OK, and here is the makefile I will use

T=transgender-taiwan.org
R=radioscanningtw.jidanni.org
B=taizhongbus.jidanni.org
S=$T $R $B
robots.txt:robots-base-jidanni.txt $(addsuffix .SITEMAPS,$S)
> $@ echo \#Made by $(MAKEFILE_LIST), will get overwritten
>> $@ cat $<
ls sitemap-*-NS_{[0-5],1[2345]}-*.xml.gz|perl -pwe \
'if(/trans/){s@^@$T/@}elsif(/bus/){s@^@$B/@}$(\
)elsif(/radio/){s@^@$R/@};s@^@Sitemap: http://@' >> $@
%.SITEMAPS:
cd ../$*/maintenance && \

	    php generateSitemap.php --server=http://$* --fspath=../

rm sitemap-index-*-wiki_.xml

Sure hope this paragraph from http://sitemaps.org/protocol.php,

When a particular host's robots.txt, say
http://www.host1.com/robots.txt, points to a Sitemap or a Sitemap
index on another host; it is expected that for each of the target
Sitemaps, such as http://www.sitemaphost.com/sitemap-host1.xml, all
the URLs belong to the host pointing to it. This is because, as noted
earlier, a Sitemap is expected to have URLs from a single host only.

won't spoil my plans.

So I removed my previous sitemap from my Google Webmaster Tools
account, with confidence from
http://www.google.com/support/webmasters/bin/answer.py?answer=64748

You can tell Google and other search engines about your sitemap by
adding the following line to your robots.txt file

We still recommend that you submit your sitemap through your Webmaster
Tools account so you can make sure that the Sitemap was processed
without any issues and to get additional statistics about your
site.

Well, Google will now just have to poke around my robots.txt to find
where my sitemap is, just like the other search engines will. Now I
need not have any special knowledge or (proactive) contact with any
particular search engine company.

jeluf wrote:

The problem is that we don't have a robots.txt per wiki. All wikis share one robots.txt. We can't add Sitemap:-lines to the robots.txt because we'd need different entries per wiki.

Are you sure you need different entries per wiki?
I just put them all in the same file, expecting that the search engine will
ignore the ones it wants to.

$ HEAD http://transgender-taiwan.org/robots.txt \
http://radioscanningtw.jidanni.org/robots.txt \
http://taizhongbus.jidanni.org/robots.txt |grep Length
Content-Length: 1763
Content-Length: 1763
Content-Length: 1763

My logs (but only two days so far) show search engines get the ones they want.
Can you find a statement that what I did is against the protocol?