Page MenuHomePhabricator

urls of the form wiki/article?curid=something should be indexable by robots on wikinews
Closed, ResolvedPublic

Description

In the last code update, the way the robots meta tag works appear to have changed. Pages with the curid parameter are now set to noindex,follow . Wikinews (english), for various reasons (see [[n:Wikinews:Google news]] for gory details), we use pages with their curid appended at end so that google news will syndicate us. This change has stopped us from being syndicated by google news. I would like to request, with urgency, that pages of the form http://en.wikinews.org/wiki/Some_article?curid=some_numb have the robots policy of index,follow . or just have no meta robots tag at all.

Thanks
-bawolff


Version: unspecified
Severity: critical
URL: http://en.wikinews.org/wiki/Airliner_crash_wounds_four_in_Durban,_South_Africa?curid=139870

Details

Reference
bz20818

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 21 2014, 10:49 PM
bzimport set Reference to bz20818.
bzimport added a subscriber: Unknown Object (MLST).

brian.mcneil wrote:

This is rather annoying, as the ability to provide DPL display links to pages with the curid was *precisely* for en.wikinews to get listed in Google News.

I don't think those get squid-cached properly...

wiki wrote:

I realize that having these pages be not be cached is a Bad Thing (TM), and there is logic behind adding the noindex/follow to these essentially duplicate pages... but for Wikinews, this is a real problem. We worked very hard to get included on Google News, and now we're... off, and it is completely out of our hands. Is there any possibility that this change (adding the noindex/follow) PLEASE be rolled back, at least until a better solution can be found?

I don't think those get squid-cached properly...

curid requests should provide a Content-Location header. Then squids
could use their cache for that url. Currently it seems to take
Content-Location into account only for purging (and on squid3,
wm is using 2.7.STABLE6), but it seems like a sensible feature.
It's probably dependant on http://bugs.squid-cache.org/show_bug.cgi?id=1631
though.

wmf.amgine3691 wrote:

As an alternative solution, I've built an RSS/Atom feed which can be served to Google News as a sitemap, so they do not need to have curid. The script is sensitive to flaggedrevs, uses DynamicPageList-style url parameters to sort out news articles, includes configurable maximum/minimum returns, and has limitations to the number of category/notcategory parameters to search on.

It's just beginning to be tested, and I'm looking for beta volunteers.

wmf.amgine3691 wrote:

Google News SiteMap: an Atom/RSS feed Special page extension.

This Special page extension creates an Atom/RSS feed based on categories/notcategories/namespace and other url-passed criteria alà DynamicPageList (Wikimedia).

It's a bit crufty at the moment, including stuff from DPL which isn't relevant to an xml feed, but it does work. It is not fully tested. And it would get Wikinews back onto Google News.

attachment GNSM.tar ignored as obsolete

Doesn't look ready for wmf deployment imho.
Too much code, most of it likely unneeded. And there's no usage description, so
it's hard to understand what it is expected to do, even.

Moreover, I don't see how this extension could fix the problem.
Wikinews:Google_news states that the usage of curid= is due to Google News only
following links which contain numbers.

The issue could be fixed by having dpl add a dummy parameter and tricking the
squids to ignore it, adding invalidations for titles with curid=...

Google news gives us two options for allowing them to index us.
*Treat any article linked from the main page with a number in the url as news (This has the problem of we don't like numbers in our article titles, hence the curid. Plus it doesn't let us put developing articles on main page, lest one of the titles has a number in it)
*Option two, treat anything in a google news sitemap (Which is slightly different from a normal sitemap. Essentially an xml document listing pub date, categories, title, and url) as recent published (on their website they say they want anything published in the last three days to be on sitemap

cheers.

wmf.amgine3691 wrote:

Platonides: I agree, it's probably not ready for deployment. I'm looking for feedback on that which I haven't been able to find via other communication routes. Unfortunately, most of the code is valid because the feed is also designed for additional uses. What isn't required is the display-related elements (wgUser).

A Google News SiteMap is registered with and polled by GN. They prefer this method over spidering a website because they only get the latest links. In effect it's an API, one which can also be used by more than just Google News.

wmf.amgine3691 wrote:

Google News SiteMap: an Atom/RSS feed Special page extension.

Update to GNSM Special page extension

  • decrufted DPL parameters
  • Tested most parameters
    • Remaining untested: suppress errors, usecurid, usenamespace not relevant at this point
  • added brief usage notes

attachment GNSM(2).tar ignored as obsolete

wmf.amgine3691 wrote:

Changes on the Google News side now require sitemap xml feeds only. I'm writing an additional feed class to produce this; however it may require a different feed item as well as the url containers hold slightly different values.

wmf.amgine3691 wrote:

Google News SiteMap: an Atom/RSS/SiteMap feed Special page extension.

Version 0.2.something

This version provides complete Atom/RSS/SiteMap xml output.

  • support for Sitemap <news:keywords>
  • support for Sitemap <lastmod>
  • support for Sitemap <url> tweaked
  • support for Sitemap <news:pubdate> tweaked

attachment GNSM.tar ignored as obsolete

wmf.amgine3691 wrote:

There are a couple minor fixes, removing some debug code, adding # of days parameter, error message, but I won't be able to update for a few hours at least. Platonides said xe'd be reviewing this today, so just wanted to make known there's a slight lag.

wmf.amgine3691 wrote:

Comment on attachment 6745
Google News SiteMap: an Atom/RSS/SiteMap feed Special page extension.

Concedo

brian.mcneil wrote:

This has *again* become very urgent for Wikinews.

A hack was introduced to have a hidden list of URLs with numbers in them on the main page. Google is no longer picking these up and insists on the URL containing a minimum of 3 digits. Redirects are not working.

Can this be addressed as a matter of urgency, or Amgine's proposed patch/extension be seriously reviewed with a view to putting it into use.

Unless I am unable to I wish to reopen that for a proper review.

/me mutters "Where's Brion?"

brian.mcneil wrote:

Comment on attachment 6745
Google News SiteMap: an Atom/RSS/SiteMap feed Special page extension.

I do not know if this is obsolete, but it is urgently needed for enWN to maintain a listing in Google News.

I understand it was previously given a half-assed review and the vital internal SQL not security audited. This *needs* done. Wikinews must be listed in Google News to generate user contributions from the competition that's coming up, and Jimmy Wales doing WikiVoices next week on Wikinews where we want him to write an article and get it listed in Google News.

I do not want a Skypecast recorded where I'm telling Jimmy, "Uh, Yes, you need to thing of a title with a three digit number in it so your article will appear in Google News".

Based on the fact that it is now considered fairly key that WN is listed on GNews to keep its readership and contributions up, I have upgraded from major to critical. This needs fixed. Now.

/me mutters "Where's Brion?"

Working for identi.ca?
He's not even CCed for this bug, so I don't think he's going to read you.
Note that the caching concerns were addressed at bug 21302.

  • I'm looking for the latest copy of the patch. I'd fix any issues found under code review.
  • Gigs(and I) were wondering if there is a test server with this code running someplace already? Else I'll set one up.

brian.mcneil wrote:

I believe a test is running on wiki.enwn.net. That's ShakataGaNai on-wiki you'd need to speak to. (wiki@consoletek.com).

I didn't pay much attention to them setting up the code management system for that though, so I don't know what the state of things is.

wiki wrote:

That would be me (ShakataGaNai that is). For some unknown reason the test environment is hosed in all sorts of spectacular ways. Oddly enough the only part that does work is GNSM, as seen here http://wiki.enwn.net/index.php/Special:SpecialGNSM . I can setup a new demo environment if need be (one that actually works). Ping me off bug at this address if you want me to.

Note that I've committed it into SVN yesterday (r60172) in order to facilitate more collaboration.

Why are you all just waiting patiently instead of emailing me? This is a serious issue.

wmf.amgine3691 wrote:

We aren't waiting for you. There's been at least one working solution for more than a month. A temporary hack was working for most of that time.

The relevant breaking change was made in r45360, in January 2009. I don't think it would be good to revert it. The use of curid by DPL is incorrect, any random string would have done just as well to fool the google bot, and any other random string wouldn't have had the undesired side-effect. I've committed and deployed a change to set a dpl_id parameter on links when the DPL parameter "googlehack" is set. As far as I can see, this will fix the issue as reported. Just change your templates to use googlehack instead of showcurid.

Requests to review and enable a mostly unrelated site map extension should be made on a separate bug report.

brian.mcneil wrote:

The problem is is that, precisely as the parameter name reveals, this is a HACK.

For initial listing in Google News Wikinews had to completely cease listing any developing stories on the main page. This is a significant disincentive to attracting new contributors. (If any developing story simply had "1234" in the title it would automatically be indexed by Google News).

I do not want to disparage the key MediaWiki developers; they are stretched thin and any issue on Wikipedia is far, far more visible than on Wikinews. However, Amgine's extension is a powerful, and flexible general-purpose solution to the issue of RSS, Atom, and Google News feeds from any MediaWiki install.

Can we have assurances that the permanent solution (and addition of a gallery option to DPL) will be reviewed seriously and - hopefully - implemented in the near future? The gallery option is of great interest to Commons where they would like to de-emphasise lower resolution Featured Images which are generally older.

Per Tim's suggestion, I have filed a separate bug for GNSM, Bug 21919 . Well the googlehack parameter of the DPL is certainly a step in the right direction, we would still really appreciate having a google sitemap.