Page MenuHomePhabricator

Install sitemap extension into bugzilla, and then update bugzilla robots.txt
Closed, DeclinedPublic

Description

ATM, MediaZilla isn't indexed by search engines, which means that searching for MediaWiki bugs will *never* get one here, but directs one at most to one of those many fishy websites that pair up the bug mailing list with advertisements. Even if one then reads the bug number and searches for "mediawiki bug 4711", one still doesn't get here. So please remove robots.txt.


Version: unspecified
Severity: normal
See Also:
https://bugzilla.wikimedia.org/show_bug.cgi?id=13881

Details

Reference
bz33406
TitleReferenceAuthorSource BranchDest Branch
Rely on kafka partitioning to aid parallel executionrepos/search-platform/cirrus-streaming-updater!88pfischerparallelismmain
Customize query in GitLab

Event Timeline

bzimport raised the priority of this task from to Lowest.Nov 22 2014, 12:00 AM
bzimport set Reference to bz33406.

If you know the bug number, it is quicker to use http://bugzilla.wikimedia.org/4711 to find the bug.

Fulltext, freeform search would be great, but a lot can be done with advanced search.

Given the recent struggles here with vandalism and load, though, increasing our problems by introducing bots isn't a high priority right now.

Just a disclaimer: I do not get paid for the time I spend here. If WMF wants me to jump through some hoops, that's fine, but no, thanks.

There's a nice Google Tech Talk by Spolsky where he explains the design principles of stackoverflow.com and the road bumps that impede workflows.  If WMF has some data that vandalism and load on the bugtracker outweigh the ease of use for and value of potential patches from MediaWiki users, so be it.

(In reply to comment #1)

If you know the bug number, it is quicker to use
http://bugzilla.wikimedia.org/4711 to find the bug.

Fulltext, freeform search would be great, but a lot can be done with advanced
search.

Given the recent struggles here with vandalism and load, though, increasing our
problems by introducing bots isn't a high priority right now.

True about going directly, if you know, most people wouldn't know that bug 1234 actually is [1]

I'm not sure what the issue is with having Google among others index our bugzilla instance. It doesn't open us up to any more spam

Looking at [3] it seems we have the default BZ robots.txt installed one

A bit of searching around [2] among others, seems to suggest we'll need to install the sitemap extension [4]

I also don't think a blanket removal of the robots.txt is a good idea. However, doing by example [5], and updating to something along those lines seems very sane. I'm not sure why the default is so limiting. The sitemap extension also includes an improved robots.txt

We can easily get ops to update the robots.txt, because it's a quick fix, but might need to find a bit more time to get ops to actually install the extension, and then presumably possibly a submission to Google webmaster tools

[1] https://bugzilla.wikimedia.org/1234
[2] http://bugzillatips.wordpress.com/2011/05/04/search-bugzilla-using-google/
[3] https://bugzilla.wikimedia.org/robots.txt
[4] http://code.google.com/p/bugzilla-sitemap/
[5] https://bugzilla.mozilla.org/robots.txt

I SERIOUSLY doubt that robots.txt is doing ANYTHING to help lower issues we have here. Vandalism works even though we have a robots.txt so naturally it's completely ignoring that. And I know for a fact that e-mail addresses are already being harvested from our bugtracker, so robots.txt isn't helping there.

The only thing that robots.txt is doing is keeping out all the good bots, all we have now are the bad ones.

Are we sure we want this? I would imagine it would be similar as to why we don't really want the lists indexed because the amount of craft it could protenitally introduce into the results we don't want.

(In reply to comment #4)

I SERIOUSLY doubt that robots.txt is doing ANYTHING to help lower issues we
have here. Vandalism works even though we have a robots.txt so naturally it's
completely ignoring that. And I know for a fact that e-mail addresses are
already being harvested from our bugtracker, so robots.txt isn't helping there.

Just to be clear, I wasn't saying that we are keeping vandalism at bay by having a stricter robots.txt file. As pointed out in comment #1, there are plenty of links to the tracker all over the internet that vandals could follow if that was how they found bug trackers to play with.

In the past (perhaps less so currently?) "well behaved" spiders that respected robots.txt have routinely wreaked havoc on sites like this one that are, essentially, a bunch of cgi scripts that result in a process being forked for each request.

So, last week, we dealt with some apparent vandalism when someone brought the server to a halt by requesting a particular URL over and over.

My point was simply that if we suddenly make bugzilla visible to spiders who respect robots.txt, they would probably send a ton of queries to the server (e.g. several spiders from each search engine) to quickly discover the newly available data.

That sort of sudden visibility could very well look a lot like the vandalism we saw last week.

That said, something like https://bugzilla.mozilla.org/robots.txt is a good thing to consider.

User-agent: *
Disallow: /*.cgi
Disallow: /*show_bug.cgi*ctype=*
Allow: /
Allow: /*index.cgi
Allow: /*show_bug.cgi
Allow: /*describecomponents.cgi
Allow: /*page.cgi

Sitemaps have already actively been submitted to Google, there was just a failure with Yahoo.

Replacing ./robots.txt. (The old version will be saved as
"./robots.txt.old". You can delete the old version if you do not need
its contents.)
Pinging search engines to let them know about our sitemap:

  Live: OK
Google: OK
   Ask: OK

Submitting https://bugzilla.wikimedia.org/page.cgi?id=sitemap/sitemap.xml to Search::Sitemap::Pinger::Yahoo=HASH(0x7b6a970) failed: 403 Forbidden

Yahoo: FAILED

But also note that it might take a while.

(In reply to comment #0)

Even if one then reads the bug number and searches for
"mediawiki bug 4711", one still doesn't get here. So please remove robots.txt.

robots.txt has been updated, the sitemap read, the site re-indexed, and you can see bugzilla links in search results.

Beware, though, that this probably isn't what you want. You're query doesn't really give any better results now.

The only way I can get the bug report is this google string: "bug 4711 site:bugzilla.wikimedia.org". Searches on live, and ask were similarly unfruitful.

(In reply to comment #8)
Thanks. When I try to download:

[...]
Submitting https://bugzilla.wikimedia.org/page.cgi?id=sitemap/sitemap.xml to
Search::Sitemap::Pinger::Yahoo=HASH(0x7b6a970) failed: 403 Forbidden

Yahoo: FAILED

[...]

I get an empty file (after a short delay). That doesn't look right to me.

(In reply to comment #9)

[...]
Beware, though, that this probably isn't what you want. You're query doesn't
really give any better results now.

The only way I can get the bug report is this google string: "bug 4711
site:bugzilla.wikimedia.org". Searches on live, and ask were similarly
unfruitful.

That's right. I have no SEO knowledge, but I do notice that the pages don't contain "MediaWiki" in any (prominent) place.

I suggest to wait a few days or weeks to see if the pages get karma by incoming links; if not I'll file a new bug.

https://bugzilla.wikimedia.org/page.cgi?id=sitemap/sitemap.xml doesn't seem to be returning any content. This bug kind of has a murky scope, but if this is the correct sitemap URL and it's supposed to be returning something, this bug should be re-opened.

(In reply to comment #12)

if this is the correct sitemap URL and it's supposed to be returning something,
this bug should be re-opened.

Daniel already reopened the RT ticket, too.

(In reply to comment #13)

if this is the correct sitemap URL and it's supposed to be returning something,
this bug should be re-opened.

Daniel already reopened the RT ticket, too.

Any update on this? https://bugzilla.wikimedia.org/page.cgi?id=sitemap/sitemap.xml still gives no content.

Is the Bugzilla configuration (or patched sources) accessible somewhere? I didn't see anything obvious on Gerrit.

(In reply to comment #14)

(In reply to comment #13)

if this is the correct sitemap URL and it's supposed to be returning something,
this bug should be re-opened.

Daniel already reopened the RT ticket, too.

Any update on this?
https://bugzilla.wikimedia.org/page.cgi?id=sitemap/sitemap.xml still gives no
content.

Is the Bugzilla configuration (or patched sources) accessible somewhere? I
didn't see anything obvious on Gerrit.

We haven't moved the bugzilla customizations to gerrit yet. We probably should.

sumanah wrote:

Adding Andre in case he can help with this.

(In reply to comment #0)

ATM, MediaZilla isn't indexed by search engines

This statement isn't correct anymore. I get results from bugzilla.wikimedia.org on google.com (though not with the ranking that would be perfect).
Don't know exactly what the benefits of aforementioned bugzilla-sitemap would be compared to the current situation.

(In reply to comment #15)
We haven't moved the bugzilla customizations to gerrit yet. We probably should.

Might be worth a separate ticket.

(In reply to comment #15)

We haven't moved the bugzilla customizations to gerrit yet. We probably should.

We have https://gerrit.wikimedia.org/r/gitweb?p=wikimedia%2Fbugzilla%2Fmodifications.git;a=summary but it's not up to date (e.g. missing the SiteMap extension which is deployed).

What would https://bugzilla.wikimedia.org/page.cgi?id=sitemap/sitemap.xml offer?
It's not clear to me what needs to be done to fix this report.

Tim: It's not clear to me what needs to be done to fix this report. Could you please clarify? Otherwise I might close this as WORKSFORME as I simply don't know what's missing...

See [[Site map]]. The current sitemap is broken (it's an empty file), which is also invalid XML.

Another benefit from sitemap, apart for letting the search engine to know about all bugs, is to have a "last modified" field on each page to be indexed. If a particular page (or bug in this case) has already been indexed by the search engine, it won't reindex it again unless the last modified is newer than the cached copy: that should save some CPU and bandwidth because old bugs won't be re-crawled again.

(In reply to comment #19)

Tim: It's not clear to me what needs to be done to fix this report. Could you
please clarify? Otherwise I might close this as WORKSFORME as I simply don't
know what's missing...

Essentially, as Jesús said, the sitemap extensions seems to be broken as deployed as it doesn't return any sitemap. Its installation was suggested by Sam in comment #3. For starters, it would be nice if someone could relay the status of RT #2198.

Without a working configuration, it's hard to assess whether the bad search rankings are due to this error.

I see. To compare with a working version (Mozilla), run

wget -qO- https://bugzilla.mozilla.org/page.cgi?id=sitemap/sitemap.xml

$:andre\> wget https://bugzilla.wikimedia.org/page.cgi?id=sitemap/sitemap.xml
--2013-01-04 21:03:59-- https://bugzilla.wikimedia.org/page.cgi?id=sitemap/sitemap.xml
Resolving bugzilla.wikimedia.org... 208.80.152.149
Connecting to bugzilla.wikimedia.org|208.80.152.149|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 0 [text/xml]
2013-01-04 21:04:21 (0.00 B/s) - “page.cgi?id=sitemap%2Fsitemap.xml” saved [0/0]

Note that right now google doesn't even do anything with the sitemap because of bug #46328

A couple of things:

From https://bugzilla.wikimedia.org/robots.txt:


User-agent: *
Disallow: /*.cgi
Disallow: /*show_bug.cgi*ctype=*
Allow: /
Allow: /*index.cgi
Allow: /*show_bug.cgi
Allow: /*describecomponents.cgi

Allow: /*page.cgi

http://www.robotstxt.org/faq/robotstxt.html seems to indicate that wildcards are unsupported in robots.txt files:


Wildcards are _not_ supported: instead of 'Disallow: /tmp/*' just say 'Disallow: /tmp/'.

There also seems to be an assumption that Allow rules can override previous Disallow rules. I'm not sure if this is actually the case. If *.cgi is disallowed, will *show_bug.cgi become allowed with a later directive?

https://encrypted.google.com/search?hl=en&q=site%3Abugzilla.wikimedia.org indicates that, as stated in comment 24, bugzilla.wikimedia.org is not being indexed by Google at all currently.

(In reply to comment #24)

Works for me at this moment. Maybe it was a temporary issue. It displays a list of 17 elements (in XML).

Agreed: https://www.google.com/search?q=site%3Abugzilla.wikimedia.org

(In reply to comment #26)

(In reply to comment #24)

Works for me at this moment. Maybe it was a temporary issue. It displays a
list
of 17 elements (in XML).

[...]

Those are 17 links to bugzilla.*mozilla*.org.

(In reply to comment #27)

Those are 17 links to bugzilla.*mozilla*.org.

My apologies; clearly I meant https://bugzilla.wikimedia.org/page.cgi?id=sitemap/sitemap.xml, and yes, now it is working (although very slow, and still delivering a blank page).

(In reply to comment #28)

(In reply to comment #27)

Those are 17 links to bugzilla.*mozilla*.org.

My apologies; clearly I meant
https://bugzilla.wikimedia.org/page.cgi?id=sitemap/sitemap.xml, and yes, now
it
is working (although very slow, and still delivering a blank page).

So the same as:

(In reply to comment #8)
Thanks. When I try to download:

[...]
Submitting https://bugzilla.wikimedia.org/page.cgi?id=sitemap/sitemap.xml to
Search::Sitemap::Pinger::Yahoo=HASH(0x7b6a970) failed: 403 Forbidden

Yahoo: FAILED

[...]

I get an empty file (after a short delay). That doesn't look right to me.

which I wrote 2012-01-04? :-)

Unfortunately, I'm not privy to RT #2198, so maybe there have been (unsuccessful) discussions there.

(In reply to comment #24)
(In reply to comment #25)

From https://bugzilla.wikimedia.org/robots.txt:
User-agent: *

Different bug I'd say. :) Looks like robots.txt this is not in operations/puppet/files/apache/sites/bugzilla.wikimedia.org, wondering where it is (or if it's puppetized at all).

On Fedora 20, I checked out upstream Bugzilla 4.4 branch via bzr and applied https://git.wikimedia.org/summary/wikimedia%2Fbugzilla%2Fmodifications.git on it.

Running ./checksetup.pl, extensions/Sitemap fails with perl module "Search-Sitemap" not found. Ubuntu does not list a package on http://packages.ubuntu.com either, but it might still be packaged for other distributions (e.g. there is a package called "perl-Search-Sitemap" for openSuse at http://ftp.uni-stuttgart.de/opensuse-buildservice/devel:/languages:/perl:/CPAN-S/openSUSE_12.3/noarch/perl-Search-Sitemap-2.13-5.1.noarch.rpm ).

A recent mailing list thread at https://groups.google.com/forum/#!msg/mozilla.support.bugzilla/j60P0Uw9fOU/PPDgFZMIrtsJ even implied that it might not be needed anymore, but not sure if that is correct.

http://bzr.mozilla.org/bugzilla/extensions/sitemap/trunk/files has not been updated since 2010.

Strongly proposing WONTFIX.
There is no distro-packaged Search::Sitemap available and the code is ancient and not even half-working. Let's remove this from production, new Bugzilla, and https://git.wikimedia.org/summary/wikimedia%2Fbugzilla%2Fmodifications.git

Using it on boogs.wmflabs.org I get everytime:

Pinging search engines to let them know about our sitemap:
Submitting http://boogs.wmflabs.org/page.cgi?id=sitemap/sitemap.xml to Search::Sitemap::Pinger::Ask=HASH(0x903e608) failed: 500 Can't connect to submissions.ask.com:80 (Bad hostname)

 Ask: FAILED
Live: OK

Submitting http://boogs.wmflabs.org/page.cgi?id=sitemap/sitemap.xml to Search::Sitemap::Pinger::Yahoo=HASH(0x8d93158) failed: 403 Forbidden

 Yahoo: FAILED
Google: OK

There were some failures while submitting the sitemap to certain search
engines. If you wait a few minutes and run checksetup again, we will
attempt to submit your sitemap again.

Fine, but surely this is not the only way to fix the core issue?

(In reply to comment #0)

ATM, MediaZilla isn't indexed by search engines, which means that searching
for MediaWiki bugs will *never* get one here

I like:

(In reply to comment #7)

That said, something like https://bugzilla.mozilla.org/robots.txt is a good
thing to consider.

Sitemap is not needed for search engines to index mediazilla, since all bugs are sent to wikibugs-l and end listed in various web pages. But they aren't indexing mediazilla because of this entry in robots.txt:

Disallow: /*.cgi

But without a sitemap, search engines don't know when a bug is updated, and end reindexing the entire site every time, producing a lot of overhead on the servers and bringing the site down. With a sitemap, only updated bugs since the last site index would be crawled again (supposedly), reducing the overhead on the site, although not sure to what extent.

From comment 32, it just needs to generate a sitemap, there's no need to ping search engines about it's existence. They'll know about it when they fetch robots.txt again and find a sitemap file location there. I don't see why it's pinging search engines.

Sitemap on bmo (bugzilla.mozilla.org) seems to be generated with a different extension, or a modification of this one, according to this:
https://code.google.com/p/bugzilla-sitemap/issues/detail?id=1

From what I see, that patch doesn't ping search engines, and also saves the sitemap on the server and sends it to the search engines instead of regenerating the sitemap *every time* the URL is requested, during a period of time defined in SITEMAP_AGE. This should be more convenient. Maybe we can get the extension that's using bmo somewhere? Or at least consider using that patch if it looks sane.

WONTFIXing in favor of fixing bug 13881.