Page MenuHomePhabricator

http://wikipedia.org/index.html takes you somewhere unexpected.
Closed, DeclinedPublic

Description

Author: steve

Description:
Go to http://wikipedia.org/index.html - note that you're not at the Wikipedia home page!

Background: If I enter the URL "https://en.wikipedia.org/zebra", or "https://wikipedia.org/zebra" or even "https://en.wikipedia.org/wiki/zebra" I get a 404 page that helpfully redirects me to the English language Zebra article after a 5 second pause. That's a nice touch. (although less so if you're a non-English speaker!)

However, it has an unintended consequence. If I go to Wikipedia's main page using https://wikipedia.org/index.html or https://wikipedia.org/index.htm or https://wikipedia.org/index.php - I get redirected to the technical article "Webserver directory index" (via the redirect "index.html" or whatever). This is not a useful behavior! For non-english speakers, it's a very bad thing!

I think most people would expect http://wikipedia.org/index.html to take them to the main page.

In case you doubt the depth of the problem, note that the "index.html" redirect comes up as a remarkably frequently-accessed page. In 2008 it was the 5th most visited page in the entire encyclopedia - with the only actual article to beat it being the one about the 2008 Olympic games! About 1.5 million hits per month go to the *articles* index.html, index.php and index.htm - which is an insanely unlikely number for a relatively obscure topic. That suggests that about 1.5% of the people trying to get to our home page (many of whom are non-English speakers) are winding up at this very obscure article about webserver directories instead of the home page!

IMHO, we should change that 404 page to treat "index.html", "index.php" and "index.htm" as special cases and redirect you to the main page (preferably without the 5 second delay) instead of this rather obscure article!

I think this should be a trivial check in whatever creates our 404 page - and will improve the Wikipedia experience for 1.5 million people every month. It should be fixed.


Version: wmf-deployment
Severity: enhancement

Details

Reference
bz70721

Event Timeline

bzimport raised the priority of this task from to Lowest.Nov 22 2014, 3:53 AM
bzimport set Reference to bz70721.
bzimport added a subscriber: Unknown Object (MLST).

This was discussed already on IRC and I don't consider it important enough to create a redirect rule for this specific cornercase.
It might be unexpected for some users but not too hard either to find the main page with yet another click, as the error page offers you that link.

steve wrote:

WHAT?! This is affecting at least 1.5 million people per month! How can you possibly not consider that important?

You say "it might be unexpected for some users" - but since (for sure) this obscure topic isn't remotely that interesting, we know for sure that it's unexpected for 1.5 million visitors each month. So it's DEFINITELY unexpected.

I took an informal poll around the office here at work (we develop web software) and not one person expected the result you actually get here.

I can't believe that you don't think it's worth fixing! The fix has gotta be really trivial to do - and you don't think it's worth doing it to help 1.5 MILLION people?!

Why would people be going to index.html?

That's most likely not people, but bots. I'm sure you could get some stats about the user-agents and so on from Analytics if you asked nicely, that would help inform our actions.

(I'm going to mark the bug UNCONFIRMED until we get some data, so that we can either fix it if it's a real issue or wontfix it properly?)

(In reply to Bartosz Dziewoński from comment #4)

That's most likely not people, but bots. I'm sure you could get some stats
about the user-agents and so on from Analytics if you asked nicely, that
would help inform our actions.

Agreed, that's the most likely case. Either bots or a misbehaving browser (or plugin) of some sort. In any case, stats will help us figure out what's actually going on here.

To the analyticsmobile!

steve wrote:

Even if it's a herd of misbehaving bots...wouldn't we want those bots to end up at the "expected" place?

The "index.htm" page is almost certainly being hit by bots (or perhaps just one bot) - it gets a steady 680 hits per day...plus or minus a handful.

But "index.php" and "index.html" get millions of hits (in 2008, "index.html" was the second most popular article on the entire site!!)...I don't think bots would hit the ".html" and ".php" URL's that much more often than ".htm"...but regular people and strange browser behavior certainly might.

(In reply to Steve Baker from comment #7)

Even if it's a herd of misbehaving bots...wouldn't we want those bots to end
up at the "expected" place?

Depends on what the bot expects? Maybe the bot is lazy and just puts en.wikipedia.org/articlename in and expects content? In which case, moving it to a new location for them might break behavior. Maybe the solution isn't redirecting, but getting the bot author to fix their code :)

Anyway, stats shall help (I've pinged analytics to please weigh in here).

The "index.htm" page is almost certainly being hit by bots (or perhaps just
one bot) - it gets a steady 680 hits per day...plus or minus a handful.

Good to know.

But "index.php" and "index.html" get millions of hits (in 2008, "index.html"
was the second most popular article on the entire site!!)...I don't think
bots would hit the ".html" and ".php" URL's that much more often than
".htm"...but regular people and strange browser behavior certainly might.

I'm not entirely convinced regular people are doing this. I've never once seen a person type a url and actually include the index.html part unless they're copying it from something. Strange browser behavior is more likely here imho.

steve wrote:

Depends on what the bot expects? Maybe the bot is lazy and just
puts en.wikipedia.org/articlename in and expects content?

Doesn't work...why would these hypothetical bots be viewing this particular article tens of thousands of times more often than any other comparable article? Why is it accessing "index.html" *and* "index.php"? (They both redirect to the same article)

The comparable place "index.htm" (no 'l') gets almost exactly 680 hits per day...THAT is robotic behavior and the quantity of bots is believable. 680 'broken' (arguably) bots from around the world accessing the 'wrong' (arguably) location. But 200,000 bot hits per day with this same behavior?

I actually don't think it matters - whether it's people or bots - wouldn't we want them to get to the expected place? Sure, if it were very few hits - but it's not. 1.5 million per month is 1.5% of all hits to the real front page.

(In reply to Steve Baker from comment #9)

I actually don't think it matters - whether it's people or bots - wouldn't
we want them to get to the expected place? Sure, if it were very few hits

  • but it's not. 1.5 million per month is 1.5% of all hits to the real front

page.

I'm not opposed to a redirect, I just want us to be well informed as to what's going on :)

I scanned one day 1:1000 sampled squid log, so multiply all numbers by 1000

I find 1587 lines with index.html, of which only 34 without curid.

Most lines are like https://en.wikipedia.org/wiki/index.html?curid=32681660

Out of 1587 only 68 had a user agent that did not contain crawl,spider,bot or http (http is by unofficial convention only user for bots)

Of the lines with index.html?curid= the following bots were found:

 8 Android (compatible baidu spider)
13 AhrefsBot

113 Googlebot

1 Mail.RU_Bot
3 YandexBot

1337 bingbot

21 iPhone etc (but compatible GoogleBot) 
 1 Sogu web spider

Of course bingbot doesn't have to be Bing really. Some bots cloak.

Does this answer your question?

(In reply to Erik Zachte from comment #11)

I find 1587 lines with index.html, of which only 34 without curid.

Most lines are like https://en.wikipedia.org/wiki/index.html?curid=32681660

So these don't actually visit index.html, it's just the stats that are wrong.

Using 34/1587 as the percentage of real visits, we arrive at about 1000 hits per day. This is comparable with other articles on these subjects, like "Web server" or "HTTP". This thousand includes both humans and bots, right?

Out of 1587 only 68 had a user agent that did not contain crawl,spider,bot
or http (http is by unofficial convention only user for bots)

I'm curious how many of the non-curid URLs are non-bots.

Either way, this seems to be just a stats issue and we do not actually have millions of humans every month accidentally learning everything about webserver directory indices. I suggest re-closing this bug as WONTFIX. Steve?

steve wrote:

Ah...so to put it simply:

http://wikipedia.org/index.html does indeed redirect to the article of that name...but http://wikipedia.org/index.html?whatever doesn't...but it *does* increment the stats for the article of that name.

Then, most of the hits we're recording for this article are of the "index.html?whatever" variety, so there isn't a problem for those people.

(Interestingly: http://wikipedia.org/Zebra?curid=32681660 takes you to the same place!)

OK - then I guess we can call this a don't-fix issue. I'll pass the news on to the affected article talk pages so they can understand what's going on.

Thanks for getting this to everyone's attention :)