Page MenuHomePhabricator

Get dbpedia off OAI
Closed, ResolvedPublic

Description

cf an email from Sebastian Hellman in February 2013

"We built *a lot* of infrastructure which depends on the updates.
So if the OAI-PMH stream would suddenly not work anymore, it would jeopardize three or four open-source projects and would cause a lot of problems further down the data chain (i.e. people who get the data from us ).

So yes, we are still using the OAI-PMH stream and we evven plan to extend the usage to more language versions of Wikipedia and many language versions of Wiktionary.
Of course, we are willing to change to the MediaWiki API, if necessary (and we also have to man power to achieve this within several months).
There were two major reasons, why we didn't switch, yet:

  1. we have a running system, there is no real incentive to switch unless you tell us to.
  2. we didn't have a contact from Wikimedia. I wrote one or two emails in the past, but didn't get a response.
  3. We did not find any good documentation on how to get *all* updates from Wikipedia. Query RC and then do Special:Export requests?
  4. We were afraid to get blocked, since we would be over the 1 request per second limit.

We would be happy, if we could get into contact and settle this matter to be compatible with the future. We are in contact with WikiData already (Anja Jentzsch worked on DBpedia before)."

I re-enabled oai auditing earlier today, and it would seem at the time of writing this email, that dbpedia are the only user of the OAI interface...


Version: wmf-deployment
Severity: enhancement
See Also:
https://bugzilla.wikimedia.org/show_bug.cgi?id=68866
https://bugzilla.wikimedia.org/show_bug.cgi?id=67623

Details

Related Objects

View Standalone Graph
This task is connected to more than 200 other tasks. Only direct parents and subtasks are shown here. Use View Standalone Graph to show more of the graph.
StatusSubtypeAssignedTask
ResolvedReedy
Resolvedori

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 22 2014, 3:31 AM
bzimport set Reference to bz68538.
bzimport added a subscriber: Unknown Object (MLST).

Note that old search used OAI for internal updates at least on some wikis, but this should be gone soon with full CirrusSearch deployment.

What's the situation with the new rcstream etc things -- can these be adapted to send page text as well, or do we have a better way for them to do that kind of data fetch?

Since 20140724215329

mysql:wikiadmin@db1038 [oai]> select oa_client, ou_name, count(oa_client) from o                                                                                        aiaudit left join oaiuser on oa_client = ou_id group by oa_client;
+-----------+--------------+------------------+
| oa_client | ou_name      | count(oa_client) |
+-----------+--------------+------------------+
|         0 | NULL         |             1055 |
|         6 | lsearch2     |           126808 |
|        12 | fresheye.com |             5967 |
|        13 | dbpedia      |            38854 |
+-----------+--------------+------------------+
4 rows in set (0.37 sec)

Just purged out some old rows (now only got < March 2015). Which means lucene isn't using it anymore (yay!)

mysql:wikiadmin@db1038 [oai]> select oa_client, ou_name, count(oa_client) from oaiaudit left join oaiuser on oa_client = ou_id group by oa_client;
+-----------+--------------+------------------+
| oa_client | ou_name      | count(oa_client) |
+-----------+--------------+------------------+
|        12 | fresheye.com |            52162 |
|        13 | dbpedia      |           410536 |
+-----------+--------------+------------------+
2 rows in set (0.22 sec)

mysql:wikiadmin@db1038 [oai]>
In T70538#1475535, @ori wrote:

How do we move forward on this?

For a start, running Reedy's query again. :)

Looks like they're still using it

The amount of queries is decreasing though on a monthly basis

mysql:wikiadmin@db1038 [oai]> select oa_client, ou_name, count(oa_client) from oaiaudit left join oaiuser on oa_client = ou_id group by oa_client;
+-----------+--------------+------------------+
| oa_client | ou_name      | count(oa_client) |
+-----------+--------------+------------------+
|        12 | fresheye.com |           134100 |
|        13 | dbpedia      |           910646 |
+-----------+--------------+------------------+
2 rows in set (0.63 sec)

mysql:wikiadmin@db1038 [oai]> delete from oaiaudit where oa_timestamp < 20150401000000;
Query OK, 365163 rows affected (16.82 sec)

mysql:wikiadmin@db1038 [oai]> delete from oaiaudit where oa_timestamp < 20150501000000;
Query OK, 253020 rows affected (11.95 sec)

mysql:wikiadmin@db1038 [oai]> delete from oaiaudit where oa_timestamp < 20150601000000;
Query OK, 186407 rows affected (7.05 sec)

mysql:wikiadmin@db1038 [oai]> select oa_client, ou_name, count(oa_client) from oaiaudit left join oaiuser on oa_client = ou_id group by oa_client;
+-----------+--------------+------------------+
| oa_client | ou_name      | count(oa_client) |
+-----------+--------------+------------------+
|        12 | fresheye.com |            36701 |
|        13 | dbpedia      |           203462 |
+-----------+--------------+------------------+
2 rows in set (0.14 sec)

mysql:wikiadmin@db1038 [oai]> delete from oaiaudit where oa_timestamp < 20150701000000;
Query OK, 84656 rows affected (4.24 sec)

mysql:wikiadmin@db1038 [oai]> select oa_client, ou_name, count(oa_client) from oaiaudit left join oaiuser on oa_client = ou_id group by oa_client;
+-----------+--------------+------------------+
| oa_client | ou_name      | count(oa_client) |
+-----------+--------------+------------------+
|        12 | fresheye.com |            15963 |
|        13 | dbpedia      |           139551 |
+-----------+--------------+------------------+
2 rows in set (0.06 sec)

mysql:wikiadmin@db1038 [oai]>

Hi, due to the recent switch in the wikipedia api from http to https DBpedia stopped feeding for 1-2 weeks until we identified & fixed the problem.
Are there any plans for OAI? we certaninly want to keep getting update feeds but we can switch to a new service if we can get a similar api

Hi, due to the recent switch in the wikipedia api from http to https DBpedia stopped feeding for 1-2 weeks until we identified & fixed the problem.
Are there any plans for OAI? we certaninly want to keep getting update feeds but we can switch to a new service if we can get a similar api

There is https://wikitech.wikimedia.org/wiki/RCStream

Does RCStream has a plugin that keeps a local Wikipedia copy up to date?

One of our problems is the extensive use of the Wikipedia api for which we keep a local clone.
For every updated item we make 1 or 2 requests in our local api and additionally we re-parse all unmodified pages at least once every 30-45 days

Does RCStream has a plugin that keeps a local Wikipedia copy up to date?

@Legoktm, @Krinkle, @ori: Can any of you answer that and help DBpedia here?

Does RCStream has a plugin that keeps a local Wikipedia copy up to date?

Not that I'm aware of. Someone could probably write one though...

Does RCStream has a plugin that keeps a local Wikipedia copy up to date?

RCStream doesn't make edits to your wiki or provide database queries. But you can build on top of it. It provides real-time notifications for all public change events and log entries. (Edits, page creation, upload, delete, undelete/restore etc). The socket provides a constant stream of JSON-formatted events that can be filtered as widely or narrow as needed.

To fetch content, you'd best use RESTBase whenever possible for an easier-to-use interface and better latency. For anything else you'd query the MediaWiki API (when retrieving actual content in response to RCStream events).

We need OAI for 2 reasons
(1) to get the update stream (which is solved with RCStream) and also important (2) to have a local Wikipedia mirror where we can exceed to wikimedia api rate limits.

for (2) we do not have the resources or knowledge to adjust RCStream to perform edits. Would it be possible to get special (bot) account with exceeded (or no) rate limits? that would be a way to make OAI obsolete for us

We need OAI for 2 reasons
(1) to get the update stream (which is solved with RCStream) and also important (2) to have a local Wikipedia mirror where we can exceed to wikimedia api rate limits.

for (2) we do not have the resources or knowledge to adjust RCStream to perform edits. Would it be possible to get special (bot) account with exceeded (or no) rate limits? that would be a way to make OAI obsolete for us

Rate limits for what? API queries? (As obviously you're not making edits)

If so, yes, we have http://meta.wikimedia.beta.wmflabs.org/wiki/Special:GlobalGroupPermissions/global-bot

Read api calls. For each page we process we make two calls, one to get the wikitext and another to get the text from the first paragraph of the wiki page (abstract) using the TextExtracts extension.

For edit feeds that come from RCStream we can get the wikitext from the feed directly but still need another call for the abstract.
In addition to the edit feed we have two other feeders:

  1. coming from changed mappings between wikipedia infobox templates and DBpedia mappings (see http://mappings.dbpedia.org) and
  2. we re-parse all the articles that have not been processed (through the above two feeders) for more than 30 days

for (1) and (2) we 'd need two api calls for each page we process

Technically, there's no rate limiting for read requests etc - https://www.mediawiki.org/wiki/API:Etiquette

Using a decent user-agent header will help massively...

Or have you actually experienced any problems?

IIRC, when this project started (5 or more years) ago there was a 1 req / sec restriction and we were processing ~130 pages / minute.
This was the reason we chose the OAI + Wikipedia mirror approach

Can you suggest a specific header that we can use for this reason?
If there is not problem with the rate limit we will try to move to RCStream in the following months.

FYI, you made just shy of 150,000 OAI requests in august according to our audit logs. Not sure how that directly translates to data sent/recieved, but useful for metrics

Something identifying DBPedia in the User agent (and maybe a version) would be useful enough. You could break it down further depending on libraries that you were using for the request

We had some problems during summer with the switch from http to https in the wikipedia apis and some live clients were broken for some period.

I created an issue to track this here: https://github.com/dbpedia/extraction-framework/issues/415
and will keep you posted on updates

We had some problems during summer with the switch from http to https in the wikipedia apis and some live clients were broken for some period.

I created an issue to track this here: https://github.com/dbpedia/extraction-framework/issues/415
and will keep you posted on updates

Thanks. How much time do you anticipate you will need? I'd like to set a firm deadline for sunsetting OAI as soon as possible.

We are low on human resources atm but will try to push this a bit. Hopefully in the following (2-3) months

We are low on human resources atm but will try to push this a bit. Hopefully in the following (2-3) months

@Jimkont, OK -- but at minimum, I'd like to set a firm date for sunsetting OAI. Would January 1, 2016 work? That would be three and a half months from the date of your comment.

@Jimkont: Would it be possible for you to define a firm date for sunsetting OAI, so we can all work towards a deadline, please? Thanks in advance!

@Aklapper can we put the deadline 2 months back, end of February?Thanks

@Jimkont: OK. We'll plan to turn off OAI on March 1, 2016.

@Jimkont: Just a final heads-up (as there have been no news in this task for the last four months) that OAI will be deactivated on March 11th, 2016 as per T70866#2065929.

Change 276817 had a related patch set uploaded (by Ori.livneh):
Disable OAI extension

https://gerrit.wikimedia.org/r/276817

According to the oai audit table. The last access by dbpedia was on 2016/02/01

So I guess @Jimkont managed to do the migration.

ori claimed this task.

Thanks @Jimkont!