Page MenuHomePhabricator

Write and implement Google News SiteMaps (GNSM) extension for Wikinews
Closed, ResolvedPublic

Description

See discussion on Bug 20818. (I'm not sure if this should be filed under a different product then 'Wikimedia', it didn't really seem to fit anywhere). per Tim's suggestion I'm refiling this in a separate bug from Bug 20818

Basically, currently a variety of hacks are used to get Wikinews on Google news. Google news spiders our main page, and uses a variety of techniques to try to determine which links are links to news articles, and which are not. These techniques have a tendency not to always work. (For example the following articles were not listed despite the hack: http://en.wikinews.org/wiki/Payment_pending;_Canadian_recording_industry_set_for_six_billion_penalties%3F , http://en.wikinews.org/wiki/Boeing_787_%22Dreamliner%22_makes_maiden_flight , http://en.wikinews.org/wiki/Israel_%22illegally_annexing%22_east_Jerusalem,_EU_reports and we have really no idea why)

Google news also allows for news websites to submit there articles via a "Google News Sitemap" which is a custom xml format that describes pages on a website, with a minimal amount of meta data. When submitting news via a sitemap, Google relaxes their standards for detecting what a news url is, and anything that they do not index, they give a reason for why it was not indexed ( http://www.google.com/support/news_pub/bin/answer.py?hl=en&answer=93994 ) This would hopefully reduce the above mentioned issues of articles not appearing on google news, and in any case where an article still did not appear, we would at least know why.

To that end, as many of you know, Amgine has developed an extension called GNSM (for Google news site map), which more or less takes the intersection extension, and allows the results to be output as a sitemap that is accessed via a special page. (extension is currently at: http://svn.wikimedia.org/viewvc/mediawiki/trunk/extensions/GNSM/ demo page at http://wiki.enwn.net/index.php/Special:SpecialGNSM ).

This has the additional benefit of potentially (should it be enabled) of also giving Wikinews on-site RSS feeds. Currently we rely on off wiki methods of generating our rss feeds. (Someone [CSpurrier?] runs a script which feeds an rss feed which is forwarded to feedburner for the main feed, some of the secondary feeds are from the toolserver, which has a tendency to break). It would be nice to have feeds generated on-wiki, if simply because it would look more professional to have the rss feed on the same domain as the site.

To conclude, as has been said before, well we understand that in some people's opinion GNSM may not be fully ready to be deployed as of yet, we would really really appreciate it if it was developed to the point where it is ready, and as soon as possible reviewed/deployed on wikinews.

Thank you.
-bawolff


Version: unspecified
Severity: enhancement

Details

Reference
bz21919

Event Timeline

bzimport raised the priority of this task from to Low.Nov 21 2014, 10:57 PM
bzimport set Reference to bz21919.
bzimport added a subscriber: Unknown Object (MLST).

Created attachment 6899
diff to add support for mapping between category schemas.

I've added a diff that does the following
*Fixes a syntax error in the i18n file
*Provides a method to map category to keywords for the sitemap (Google uses different category names then we do: http://www.google.com/support/news_pub/bin/answer.py?hl=en&answer=116037 ). It adds a new message gnsm_categorymap which takes a list of items of the form: "*mediawiki_category_name_without_leading_ns_prefix|keyword_to_output"
or "*mediawiki_cat_name_no_ns|MASK" to supress outputting that category. (the extension should perhaps automatically mask hidden categoris, i wasn't sure).

Cheers.

Attached:

wiki wrote:

Someone munged the FR translations in SpecialGNSM.i18n.php . Among the bad characters in almost every line, specifically 'gnsm_noincludecats' contains an extra apostrophe (') in "despace". Could someone please fix this?

Secondarily I've updated http://wiki.enwn.net/index.php/Special:Google_News_SiteMap to SVN (As seen http://wiki.enwn.net/index.php/Special:Version )

/me notes patch above that fixes apostraphe issue :P

I've created a vm with the latest versions of mediawiki and gnsm from svn
( http://svn.wikimedia.org/svnroot/mediawiki/trunk/extensions/GNSM )

Please send me mail with your pubkey if you would like access.

pdhanda wrote:

Just as a reference it may be good to look at these:
http://www.google.com/support/news_pub/bin/answer.py?hl=en&answer=74288
and
http://www.google.com/support/webmasters/bin/answer.py?answer=82299

I'll also apply Bawolff's patch and test all the changes.

pdhanda wrote:

Amgine,
I noticed that the parameters work as part of the url
http://wiki.enwn.net/index.php/Special:Google_News_SiteMap/feed=rss&order=descending
but not as request parameters
http://wiki.enwn.net/index.php/Special:Google_News_SiteMap?feed=rss&order=descending

http://www.mediawiki.org/wiki/Manual:Special_pages#The_Body_File
It may be better to use $wgRequest->getValues() instead of the $par

Bawolff,
Do you thing the category mapping may be better as part of the local setting? It may be easier to configure per site.
-p

Note, Amgine is currently on a road trip, and might have limited internet connectivity, and might take longer to respond. (however that might have just been for today.)

I am fairly certain that Amgine would not be opposed to changing the feed to use the request parameter method.

As for category mapping, I think it would be easier doing it as a system message, since:
*Google's categories they accept might change, or be expanded
*The category scheme that we use might change
*Each language would use a different scheme
Since we can modify a system message without bothering/waiting for the devs, i feel that using a system message would be easier.

However with that being said, if you think it'd be better to do in some other fashion, please don't hesitate. I'm more concerned that such a functionality exists rather then how it is implemented.

Cheers,
Bawolff

brian.mcneil wrote:

I'm going to ask Josh at Google if there's someone from there can take a look at this. We're effectively developing a general-purpose addon which a main use of will be to get a MediaWiki install indexed by Google using a sitemap.

The category matching tables that would be needed on-wiki to give Google appropriate categories are one thing but, I noticed on one of the other highlighted Google help pages that company stock ticker listings can be provided. If the table with mappings of [[Category:Microsoft]] to a stock ticker can be indicated by a parameter this is another very useful thing to set up on Wikimedia wikis.

(In reply to comment #6)

Amgine,
I noticed that the parameters work as part of the url
http://wiki.enwn.net/index.php/Special:Google_News_SiteMap/feed=rss&order=descending
but not as request parameters
http://wiki.enwn.net/index.php/Special:Google_News_SiteMap?feed=rss&order=descending

http://www.mediawiki.org/wiki/Manual:Special_pages#The_Body_File
It may be better to use $wgRequest->getValues() instead of the $par

For stuff like this, you should really use $wgRequest->getVal('feed') instead of $par. You should really only use $par for one parameter, like Special:Listusers/sysop (alias for Special:Listusers?group=sysop); otherwise you end up splitting and parsing $par yourself, which is kind of unnecessary. If you want to use $par for something, the feed parameter makes most sense IMO.

pdhanda wrote:

Bawolff, makes sense.

The fix for the request params is pretty simple, I can make it if we don't hear back from Amgine soon.

Thanks for the feedback everyone.
-p

gigs wrote:

I was planning on doing that as the first stage in cleaning it up.

I've got the test environment going and plan to take a look at it tonight.

wmf.amgine3691 wrote:

Sorry for the delays: I'm road tripping into a winter storm here, but I should have internet access fairly regularly for the next two days (or longer if we get snowed in.)

$par vs. $wgRequest: I used the first/fastest tool I found - start to finish here was about 3 weeks. After 5 weeks of trying to get exactly this kind of feedback, I gave up on the project. It's GPL, so fix/expand as you see fit. I don't own it, I just tried to fix a problem Wikinews asked me to address.

pdhanda wrote:

Brian, Jason,
I did a little cleanup. Let me know when you have verified it and whether you are satisfied with it now.
-p

brian.mcneil wrote:

Priyanka,

I'm looking at the changes you checked in (I'm a noob to php)

#par is used by dpl_param on line 305 thus:

$feedType = explode('/',$par,2)

What is actually in $par? It appears the result of the operation on line 305 should produce an array based on items separated by '/'.

If $par is the url path after "Special:GNSM/" then line 305 will only work if the "?" and parameters after it are stripped before it gets to there. That would also mean that "Special:GNSM/<feedtype>/<junk path and name>?option=xxx" would work.

Of course, "Special:GNSM/<feedtype>/<url-escaped-feed-title>?option=xxx" would be nice and allow for more descriptive URLs.

gigs wrote:

I don't have access to svn right now, (on vacation), but I have already done most of the changes to migrate to wgRequest and get away from par entirely.. I have not implmented the multiple categories yet and it's still got some bugs. I'll try to get some time in on it tomorrow and see if I can integrate the new changes that were checked in.

What's current status on this? Has activity died down again?

wmf.amgine3691 wrote:

(In reply to comment #16)

What's current status on this? Has activity died down again?

Told you so.

pdhanda wrote:

Assigning this to Rob Lanphier so that he can figure out how to proceed.

I'm going to unassign this to signal that we'd like volunteer help on this. Mark Hershberger and I will both remain on the cc list, and can help a volunteer get this through the process.

wmf.amgine3691 wrote:

Let me see, written by volunteers, begged for feedback for months, rewritten by staff, left on hold for review for 10 months...

What do you want, Rob Lanphier?

Amgine, this extension only came to my attention this weekend. I think you'll do best by asking nicely for help on wikitech-l.

wmf.amgine3691 wrote:

No thanks. I've washed my hands of it.

This extension is awaiting review, approval, and roll-out to en.wikinews, with what was critical blocker priority for the en.wikinews community at the time.

I believe there have been some rubber bands, sellotape and matchsticks holding things together in the mean time.

I understand that we have some capacity to review and roll out?

I would like to get this rolled out as quickly as possible. What would you like me to do to proceed?

Hi Kim, this is being discussed on IRC on MediaWiki-General: http://toolserver.org/~mwbot/logs/%23mediawiki/20101026.txt

Roan had a few concerns (see log) that would need to be addressed before deploying something like this. Mark Hershberger (hexmode on IRC) can also help with the review process.

We still need someone to volunteer as the main developer to make the fixes. Kim, is that what you're signing up for?

gigs wrote:

I will try to find some time to look at it again. I had a version that was partially rewritten.

I concur with Roan's observations about the code quality. It's really not up to modern PHP standards.

Don't count this as "volunteering to be the main developer", but count it as a "I'll try to look at it"

(In reply to comment #25)

I will try to find some time to look at it again. I had a version that was
partially rewritten.

Cool, Can you show me?

Kim, Jason,

I'm contacting you via email to see if we can work with you to get this fixed up and deployed.

Mark.

Copying this from my notes, will put on CR if it doesn't format here.

  • From RK:
    • Detection for FlaggedRevs presence is fragile and I'm not sure it'll even work any more
    • Also, there's an arbitrary, unlimited number of joins against categorylinks, that's a no-go (has potential to make the DB servers sweat)
    • It's doing DIY category intersection
    • Also, at least one of the two sort modes is likely to be very DB-inefficient. Can't tell for sure cause I can't read the query too well, the code is not very pretty and the query is complex and very dynamic
    • Amgine: That's fine; if you can give me a general idea of what the query is gonna look like, I can poke at it (to optimize the above)
    • Amgine: What I've given you just now should keep a dev busy for a few hours already. Am of course available for general help, but don't think it'd be useful to do an exhaustive review before those things have been taken care of
  • From me:

Use XMLWriter or DOM to create XML. That way you don't have do
xmlEncode().

  • Check http://www.google.com/support/webmasters/bin/answer.py?hl=en&answer=74288 it looks like there are some things in the GNSM format that should be handled.
    • should limit number of urls (1000?)
    • is there a way to integrate multiple site maps
    • document robots.txt?
  • Hate the use of params as member variable.
    • You should just copy things from wgRequest after checking them.
    • Having a member variable named params seems like it is the beginning of confusion.
    • Also put wgRequest directly in the foreach, or at the very least closer to the point of use to avoid confusion with the others.
    • If you don't intend for the result of $wgRequest to be used, then it shouldn't be in $params at all.
  • code indention looks lacking.
  • Incorrect use of wfEmptyMsg(??) not sure about this but the function def doesn't match function declaration in GlobalFunctions while examples elsewhere abound.

See r76544 for a easier to read review.

gigs wrote:

Thanks Mark. That will be helpful.

As an update, Kim and I are trying to find my partially refactored version. I'm sure we have a copy somewhere. :)

And I have now found the partially refactored version.

(In reply to comment #32)

Kim, Could you commit that code?

Sure. One small snag.

Going through the steps to commit,

While attempting to merge and compare with code that was already there,

the location:

http://svn.wikimedia.org/svnroot/mediawiki/trunk/extensions/GNSM

no longer appears to exist.

Further, the google query:

GNSM site:http://svn.wikimedia.org/

returns no hits.

Has this been deleted or moved? What is the current location?

Ok, there's some pretty major discrepancies. All the files have been renamed, for starters. I'm going to have to take a better look this coming week. ^^;;

gigs wrote:

I know I merged two of the files. The renaming, I don't remember if I did it or if it's happened in SVN since my development started.

As far as my changes, mostly from memory:

Started to clean up the parameter handling to use wgrequest instead of the home rolled thing. This means the documentation is going to be wrong about the naked options (i.e. you need to use var=option instead of just /option)

Major cleanup of the incorrect indent levels. All spaces should be tab chars now per the MW standards.

Merged the XML generation into the main file. Still need to move that to XMLWriter or something similar, but the structure of what it's doing is similar now. Moved some of it out of a class where it didn't really belong.

Cut out some sections of cruft, and simplified many of the deeply nested "if" statements to use early return logic.

Broke the "notcategories" handling, I don't remember why but there was a good reason. I think it was part of the params conversion.

For an "easy" changelog, please see http://www.mediawiki.org/w/index.php?path=/trunk/extensions/GoogleNewsSitemap&title=Special:Code/MediaWiki/path

There hasn't been much functional changes (bar people going through and removing stuff while doing other general cleanup along the way)

I split the code files

A version of Mark Hershbergers review above, is also on Code Review http://www.mediawiki.org/wiki/Special:Code/MediaWiki/76544#c11017

gigs wrote:

I think the easiest thing would be to check my copy in losing Reedy's changes.

Mine seems to be further along, and it should include all the whitespace fixing that Reedy did as well, since we duplicated effort there.

I don't care about the naming if the SVN name makes more sense.

Feel free to put it over the top

gigs wrote:

So at this point, most of the major code style problems have been addressed. There's still a lot of overlap between this and DPL though. I don't really have a clear way forward because I don't know which DPL features we want to keep in this and which ones can be stripped out.

Jason:

Sorry I didn't see your update earlier. The most recent changes I see are from you about a month ago (2010-12-31). Let me know if those are your most recent changes and I or someone else will look at what more needs to happen.

wmf.amgine3691 wrote:

There is a renewed need for this functionality on en.WN. Is there any way WMF devs could take a look at this bug?

gigs wrote:

The code style issues raised before should be addressed now. SQL is parameterized, and the output is using an XML library.

The only remaining problem is that there remains a large overlap between this and DPL. Without a clear statement of what functionality can be removed and what is required, no one can go forward with stripping out unnecessary DPL-based functions.

It's annoying. Though, if it's not an easy task to strip out the duplication, IMHO, it's not really blocker for deployment, as long as the code is up to standard

Though, the echo's need tidying up...

gigs wrote:

I don't think we ever want to output the full skinned html here.

I poked at this last weekend, and a little bit this week.

I think most of the issues brought up in previous reviews have been addressed, and that this might be ready for another review.

Some remaining issues that I'm not sure and would really like some input on:
Major ones:
*Caching. Currently it caches the feed in memcache for up to a maximum of 12 hours (It also does checks to see if the feed is still valid, so the moment someone adds something to the relevant categories the cache is no longer considered valid, so everything is nice)
Is 12 hours an appropriate max time limit (Given that cache entries are checked that they're still good before being used).
Should we be sending maxage/s-maxage headers to cache on the client side and/or in squids. (Currently its sent with cache-control private i believe)
*If we did do caching on that level, the longest (s)maxage that would be acceptable would be very short (like half an hour) since we want these things to be up to date
Should cache times differ between Sitemap and RSS/ATOM formats.

*It does a check to see if cl_timestamp is null, and returns the current timestamp if so. This seems very wrong. I'm not sure what circumstances a null cl_timestamp can happen (There was one but it looked accidental and is now removed). I'm not sure if checking for a null cl_timestamp and throwing an exception if that happens is the right way to go. (Is there any circumstance cl_timestamp can be null?)

*The flagged revision support is flaky.
This may be a valid concern. Its outside of my knowledge of flagged revs what the best approach is. It appears that DPL also detects dpl in the same way.

Minor ones:
*The default parameters could be tweaked. I'm not sure what the ideal defaults would be. A lot depends on which of the use-cases we want to optimize for.
*[bikeshed issue] This extension does rss feeds based on category membership. That feature may be of general interest to some people totally independent from the Sitemap aspect. Perhaps a more general name should be chosen like IntersectionFeed. But thats really not important.

Response to some of the review comments:
*"It's doing DIY category intersection"

That is sort of the feature request. (The intersection part. The DIY part I'm not sure what was meant)

*"Also, at least one of the two sort modes is likely to be very DB-inefficient. Can't tell for sure cause I can't read the query too well, the code is not very pretty and the query is complex and very dynamic"

The lastedit (which is really order by last_touched, which doesn't correspond to edit) sort order is rather scary (I'm not familiar with db scaling and stuff ery much, but I think it does a full table scan of the page table, and a file sort which is very very bad from my understanding). Anyways, lastedit is not very important to the use-case and could be removed.

the Categoryadd sort order is not as bad (Still not great). It might filesort if the smallest category specified is not the first category specified. It is no worst than the already deployed intersection (DynamicPageList) extension (Which has many scary queries ;) Actually its the same query as intersection.

*"General hating on direct output/echo/$wgOut->disable()".

This is done the same way as its done in core. In ideal world, this extension might be better done as an api module, but I don't think thats a critical issue. Most special pages that output an xml document call $wgOut->disable(). All the feed stuff in core calls $wgOut->disable() and outputs with echo.

*"is there a way to integrate multiple site maps?"
We are not anywhere near outputting more than 1000 articles over a 2 day period. So I don't feel this is a near or long term concern.

Thanks,
-bawolff

The other thing I wanted to mention is the order parameter (order ascending or descending). Do we really need that. Is there any valid use case for wanting an rss feed of the earliest 20 articles?

(In reply to comment #47)

**Is 12 hours an appropriate max time limit (Given that cache entries are
checked that they're still good before being used).

IMO, yes. Someone who really knows about caching would have to give you a more definitive answer. See Roan or Tim.

How are the cache entries checked? (I'm about to look at the code, so I may find out.)

**Should we be sending maxage/s-maxage headers to cache on the client side
and/or in squids. (Currently its sent with cache-control private i believe)

I think so. Again, ask Tim or Roan to be sure.

**Should cache times differ between Sitemap and RSS/ATOM formats.

No idea. Why would you want them to be different?

  • (Is there any circumstance cl_timestamp can be null?)

Dunno.

*The flagged revision support is flaky.

We should definitely get Aaron Schultz to comment.

(More in a bit.)

Adding Aaron to get his input on FlaggedRevisions support.

(In reply to comment #48)

The other thing I wanted to mention is the order parameter (order ascending or
descending). Do we really need that. Is there any valid use case for wanting an
rss feed of the earliest 20 articles?

I don't think so, no.

(In reply to comment #47)

*"General hating on direct output/echo/$wgOut->disable()".

This is done the same way as its done in core. In ideal world, this extension
might be better done as an api module, but I don't think thats a critical
issue. Most special pages that output an xml document call $wgOut->disable().
All the feed stuff in core calls $wgOut->disable() and outputs with echo.

I'm pretty sure (but could be wrong) that using the excuse, "core works that way" is the wrong reasoning for an extension to use. If you know of another widely deployed extension that does output this way, then I think you have a case.

Now, going to look at the code.

I'm pretty sure (but could be wrong) that using the excuse, "core works that
way" is the wrong reasoning for an extension to use. If you know of another
widely deployed extension that does output this way, then I think you have a
case.

I'm pretty sure if we didn't do that, we'd need to re-implement most of the RSS stuff in the extension, which seems pointless given the classes in core. As for other extensions We've got SiteMatrix (when you do ?action=raw), OpenID, OAI, and some others.

gigs wrote:

I don't see any point in not using direct output either, and I don't see a reasonable alternative to it in this situation.

(In reply to comment #49)

(In reply to comment #47)

**Is 12 hours an appropriate max time limit (Given that cache entries are
checked that they're still good before being used).

IMO, yes. Someone who really knows about caching would have to give you a more
definitive answer. See Roan or Tim.

12 hours seems good to me. Given that you're checking the validity using a query before using the cached data, I guess a higher value like 24 or 48 might be better, but I have no strong feelings either way.

How are the cache entries checked? (I'm about to look at the code, so I may
find out.)

Basically it queries cat_pages for every category (using a single query) and MAX(cl_timestamp) for every category (using a big UNION). This is in getCacheInvalidationInfo() .

**Should we be sending maxage/s-maxage headers to cache on the client side
and/or in squids. (Currently its sent with cache-control private i believe)

I think so. Again, ask Tim or Roan to be sure.

Yes, you should set Cache-Control: public, max-age=1800, s-maxage=1800 to get the half-hour caching you suggested. If you do that and ensure stable URLs (i.e. always use the same URL for the same data) there'll only be one app server regenerating each sitemap once every half hour at most, which sounds very reasonable to me.

**Should cache times differ between Sitemap and RSS/ATOM formats.

No idea. Why would you want them to be different?

I agree with Mark here, unless you have a good reason for them to be different, make them the same.

**Should cache times differ between Sitemap and RSS/ATOM formats.

No idea. Why would you want them to be different?

I agree with Mark here, unless you have a good reason for them to be different,
make them the same.

The reasoning for maybe doing that, is that people would want the published articles appearing on google the moment they're published, where the general rss feeds are less important. Thus it might make sense to cache the rss/atom output in squid for half an hour or what not, but have the GNSM output not be squid cached.

It should now have an s-maxage of 1800 in r87140.

I was originally going to output the header directly in order to set the max-age to 1800 as well, but it appears that the underlying RSS feed class in core already uses $wgOut->sendCacheControl so I just used $wgOut's methods for setting s-maxage, which means it doesn't set the max-age header for client side caching, and it won't set the s-maxage header if the user is logged in, which is unideal, but I don't think its really a problem.

I've deployed this to testwiki. I encourage people to test it out a bit then we can look at a push to the live site.

*Bulk BZ Change: +Patch to open bugs with patches attached that are missing the keyword*

(In reply to comment #57)

I've deployed this to testwiki. I encourage people to test it out a bit then we
can look at a push to the live site.

Could you also deploy r86451 of FlaggedRevs? The flagged revs part requires at least that revision. Thanks.

I've also noticed some messages seem to be missing (?)

Ok, there was some serious issues with how it did flagged revs (As in did a db error/otherwise didn't work), but I believe i fixed them in r87236. They appear to have been in the extension since the get go, so I guess I somehow managed to never test the flagged revs part of this extension until now :s. Whoops.

On a related note, the extension won't recognize that the cache (the part stored in $wgMemc that lasts 12 hours) is outdated if a page can suddenly be included because it suddenly got sighted (or if it got all revisions unsighted). I do not think this is a significant issue for Wikinews since any change in sighting also involves adding (respectively removing) category:Published to the article. And if in doubt the extension respects ?action=purge. However it could probably be addressed by storing and checking how many articles are sighted in namespace X (which i believe is stored somewhere in the db since its shown on special:validationStatistics), similar to what it does for categories currently.

Has been deployed to enwikinews -> FIXED

wmf.amgine3691 wrote:

<coughs> Temporarily undeployed. I hope.

Somewhat befuddeled, I went information-hunting on irc.

GNSM is temporarily undeployed due to cpu usage spike on wn server. Several people are On Planes atm. A new deployment will be tried on friday.

CPU spike might have been coincidental with activation of GNSM. Next try will probably be monday (not friday). Actual cause of CPU-spike has not yet been determined.

btw, before I forget, when this does get re-enabled, enwikinews should have the following config setting:
$wgGNSMcommentNamespace = 102;

Since enwikinews uses namespace 102 instead of the normal talk namespace for "comments". (Not that important, since only outputted in rss mode, and no one uses it, but might as well).

Alrighty, sitemap appears to work:

In order to actually submit it to google, we need someone with access to "Google Webmaster Tools" for the en.wikinews domain. Hopefully someone in ops has such access? See http://www.google.com/support/webmasters/bin/answer.py?answer=74289

It'll just require someone to verify authenticity, and set it up

There are 3 ways to do this from memory

wmf.amgine3691 wrote:

(In reply to comment #66)

In order to actually submit it to google, we need someone with access to
"Google Webmaster Tools" for the en.wikinews domain.

Bawolff: have you verified with User:Brian_McNeil that this step has not already been accomplished?

I think "someone" already has Google Webmaster Tools access.

(In reply to comment #68)

(In reply to comment #66)

In order to actually submit it to google, we need someone with access to
"Google Webmaster Tools" for the en.wikinews domain.

Bawolff: have you verified with User:Brian_McNeil that this step has not
already been accomplished?

Yes, I discussed with him on irc. It has not been done yet. He mentioned that he had some contact info for google on an old computer that would be difficult to find. I don't think he has access to Google Webmaster tools.


btw, just to clarify the relevant url we'd want to submit to google would be: http://en.wikinews.org/wiki/Special:NewsFeed?hourcount=48&notcategories=AutoArchived|Archived|No_publish|disputed

(In reply to comment #71)

Shouldn't adding

Sitemap:
http://en.wikinews.org/wiki/Special:NewsFeed?hourcount=48&notcategories=AutoArchived|Archived|No_publish|disputed

to http://en.wikinews.org/wiki/MediaWiki:Robots.txt work too?

According to the docs, if you've previously submitted one and just need to change the url, you can do that, but the first time you submit a news sitemap you need to do it through webmaster tools. See http://www.google.com/support/webmasters/bin/answer.py?answer=74289

NeilK might be worth a talking to, iirc he was the wmf employee talking to google about our sitemaps support.

(CCing NeilK).

neilk wrote:

I don't know anything special about the Google News Sitemap format. I was looking into the Image extensions for Commons. Then I got interested in the question of why we don't do sitemaps generally. I've collected some info about that which I can share here, but I think that WikiNews is thinking somewhat differently about Sitemaps.

It seems that you're really interested in getting recent items indexed fast, so GNSM works (as far as I can tell) more like an RSS feeder -- there's a single URL you hit to get the latest stuff, and if you miss an hour or two, too bad.

But for Wikipedia and Commons, we want the entire collection indexed, and hopefully by more than just Google, and with richer metadata. This is going to take some kind of incremental approach, long term, although we can do some of it even with the old batch-oriented sitemap scripts.

(In reply to comment #74)

I don't know anything special about the Google News Sitemap format. I was
looking into the Image extensions for Commons. Then I got interested in the
question of why we don't do sitemaps generally. I've collected some info about
that which I can share here, but I think that WikiNews is thinking somewhat
differently about Sitemaps.

It seems that you're really interested in getting recent items indexed fast, so
GNSM works (as far as I can tell) more like an RSS feeder -- there's a single
URL you hit to get the latest stuff, and if you miss an hour or two, too bad.

But for Wikipedia and Commons, we want the entire collection indexed, and
hopefully by more than just Google, and with richer metadata. This is going to
take some kind of incremental approach, long term, although we can do some of
it even with the old batch-oriented sitemap scripts.

Yes, quite correct. Normal sitemaps, and "news" sitemaps serve rather different purposes. At this point we just need someone with access to Google Webmaster tools.

wmf.amgine3691 wrote:

As per hexmode in IRC; I'd like to ask pdhanda to sign up for and verify for google's webmaster tools for en.Wikinews, and then force Bawolff to be responsible for it (since xe is both a community member and current maintainer of the extension.)

wmf.amgine3691 wrote:

Is this bug still waiting for someone to verify for google's webmaster tools? This is a 5 minute fix.

I now have the account webmaster@wikimedia.org and will stash the password with Robla and CTWoo. After that, updating this will take just as long as it takes, but no longer.

Created attachment 8649
file google wants

For not-to-distant future reference.

Attached:

+patch, we don't remove this keyword ;p