Page MenuHomePhabricator

Introduce page creation log
Closed, ResolvedPublic

Assigned To
Authored By
Slakr
Jun 22 2007, 5:35 AM
Referenced Files
F3998: createpg.patch
Feb 3 2016, 11:11 PM
Tokens
"Love" token, awarded by SD0001."Doubloon" token, awarded by RandomDSdevel."Love" token, awarded by Liuxinyu970226."Dislike" token, awarded by Pppery."Love" token, awarded by MGChecker.

Description

Problem statement

We currently have a "Page deletion log", but not a "Page creation log". This proposal is to add a create/create log type, and log an event with that type upon the creation of a new page (e.g. first revision).

Original description

Adds an article creation log to Special:Log

There was discussion on a making a page deletion log, and it came down to a bunch of indexes and such being added and things being changed around and all around confusion.

Thus, I decided to kill two birds with one stone and write this nifty little gadget. I figured if there was a "Deletion log" there should be a "Creation log" as well. This is a patch against MediaWiki 1.10.0, so that each time someone creates a page, it gets added to the log. This way, if the page gets deleted, it gets redlinked, and if it's alive, it's bluelinked. I figure it's a hack to the concept of a "deleted pages" log, but it's most definitely an enhancement to fishing through revisions to find the original page creator. Anyhooo...

There's a caveat. Since I don't know half of the languages that MW supports, there's going to be a problem. Adding a creation log requires a couple edits (check out the patch) to add full multi-language support, otherwise, it'll just turn up as "createpg" which is very user unfriendly. So, right now it only supports english out of the box. Sorry. :(

There's another caveat: it's semi-not backwards compatible to your current database. That is, the patch only works from installation onward in that entries in the creation log will only appear once someone creates a new page after you apply the patch. So, in order to get a full page creation log, either you (or someone else) will need to write a script to add the appropriate entries. Otherwise, it will work fine with your existing installation.

Instructions:

  1. Grab the patch, save it into your brand spankin' new mediawiki root directory.
  2. Run patch -p0 < createpg.patch
  3. If your installation's language is not primarily english, translate to your native language the 'createpglogtext', 'createdarticle', and 'createpglogpage' lines of languages/messages/MessagesEn.php.

Tested on MediaWiki 1.10.0, php 5.2.3 (fcgi, debug).

If you have any questions, comments, concerns, or if I totally botched something, please feel free to contact me.

Cheers,

Kurt Radwanski
irc: slakr@freenode or galaxynet.
en.wp: Slakr


Attached:


See also: T44135: Add page creator index to MediaWiki core

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

In the TechCom meeting today we decided that this is mostly a product decision with various impacts that need to be considered with regards to database performance...

@jcrespo: Any concerns from a DBA perspective? This change would add new entries into the logging table for each page creation. There is no plan to backfill for previous creations, so the database impact would be (roughly):

  • enwiki: +7000-8000 log entries per day
  • eswiki, frwiki, ruwiki, itwiki: +1000-2000 log entries per day
  • most other wikis: +>1000 log entries per day
  • wikidatawiki: ?

The code is here: https://gerrit.wikimedia.org/r/#/c/399897/6/includes/page/WikiPage.php

Please research the size (in bytes) of wikidatawiki on a worse case scenario, and giving the total size (in bytes) of the table for wikidata, commons and enwiki now and in a year with and without the feature. Please also research the number of extra IOPS on s3 (800 wikis). Tall tables are not an issue, the issue is large ones. We may not have the disk available, or it may become larger than we can handle for future schema changes requiring logical partitioning. We may or not have the disk to handle extra write operations.

The current sizes for reference:

enwiki:
+-----------+------------+----------------------+-----------------------+--------------------+
| tablename | table_rows | data_length in bytes | index_length in bytes | average row length |
+-----------+------------+----------------------+-----------------------+--------------------+
| logging   |   84080214 |          14876147712 |           29504831488 |                176 |
+-----------+------------+----------------------+-----------------------+--------------------+

commonswiki:
+-----------+------------+----------------------+-----------------------+--------------------+
| tablename | table_rows | data_length in bytes | index_length in bytes | average row length |
+-----------+------------+----------------------+-----------------------+--------------------+
| logging   |  248227776 |          61615374336 |           95083823104 |                248 |
+-----------+------------+----------------------+-----------------------+--------------------+

wikidatawiki:
+-----------+------------+----------------------+-----------------------+--------------------+
| tablename | table_rows | data_length in bytes | index_length in bytes | average row length |
+-----------+------------+----------------------+-----------------------+--------------------+
| logging   |  606370402 |         107029200896 |          152214437888 |                176 |
+-----------+------------+----------------------+-----------------------+--------------------+

With those sizes, if they are proportional it means there will be 70000 new records per day on wikidata, I am not sure about the others, but I am sure wikidata logging-related queries will stop working/the servers will run out of space- require extra provisioning.

I still have no data from s3, which would be worrying in terms of new iops.

With those sizes, if they are proportional it means there will be 70000 new records per day on wikidata, I am not sure about the others, but I am sure wikidata logging-related queries will stop working/the servers will run out of space- require extra provisioning.

Yeah, I'm pretty amazed how big the wikidata table is (especially since it's only 5 years old). Your estimate of 70,000 per day sounds like a reasonable ball park. I'll need to get access to the analytics server to give you a more exact number, so stay tuned. Given the astronomical growth of the wikidata logging table, do you think it's going to run out of space regardless? i.e. should we consider removing some logging options there and/or pruning the existing table? I have no idea why it is so huge and it already seems to be difficult to query. Even running a simple indexed timestamp query can take several minutes!

wikiadmin@db1106(wikidatawiki)>explain select * from logging where log_timestamp > 20151026200509 LIMIT 1;
+------+-------------+---------+------+---------------+------+---------+------+-----------+-------------+
| id   | select_type | table   | type | possible_keys | key  | key_len | ref  | rows      | Extra       |
+------+-------------+---------+------+---------------+------+---------+------+-----------+-------------+
|    1 | SIMPLE      | logging | ALL  | times         | NULL | NULL    | NULL | 606682265 | Using where |
+------+-------------+---------+------+---------------+------+---------+------+-----------+-------------+
1 row in set (0.00 sec)

wikiadmin@db1106(wikidatawiki)>select * from logging where log_timestamp > 20151026200509 LIMIT 1;
+-----------+----------+------------+----------------+----------+---------------+---------------+-----------+----------+-------------+---------------------------------------------------------------------------------+-------------+
| log_id    | log_type | log_action | log_timestamp  | log_user | log_user_text | log_namespace | log_title | log_page | log_comment | log_params                                                                      | log_deleted |
+-----------+----------+------------+----------------+----------+---------------+---------------+-----------+----------+-------------+---------------------------------------------------------------------------------+-------------+
| 262985215 | patrol   | patrol     | 20151026200510 |   201281 | Liangent-bot  |             0 | Q14279462 | 15947340 |             | a:3:{s:8:"4::curid";i:262851054;s:9:"5::previd";i:130223534;s:7:"6::auto";i:1;} |           0 |
+-----------+----------+------------+----------------+----------+---------------+---------------+-----------+----------+-------------+---------------------------------------------------------------------------------+-------------+
1 row in set (2 min 29.59 sec)

It would be good to see which types of logging actions are most heavily represented there, but I'm scared to try running any more complicated queries against it.

I still have no data from s3, which would be worrying in terms of new iops.

There's a bit of a chicken and egg problem here. It's not easy to get this data without the log already existing. I'm going to try to get access to the analytics server and see if I can track down some relevant EventLogging data there.

Here's the average number of extra logging table insertions we could expect on enwiki, commons, and wikidata:

  • enwiki: 6952 inserts/day
  • commonswiki: 15,511 inserts/day
  • wikidatawiki: 76,316 inserts/day

Getting the data for all of s3 will take a while...

It'd be nice to also log pages created from a redirect, like the PageTriage extension does. Currently XTools and similar tools are unable to report these creations. It is up to the user to manually keep track of them. I'm not sure how many more inserts a day that would result in, probably not that much for Wikidata and Commons but on enwiki this scenario is common.

@kaldari Have you talked to the people wanting this feature? Maybe the people that want it want it only for "frwiki" and "enwikivoyage", and not for wikidata, so it can be enabled only on wikis that is required. Maybe people only want page creations with certain restrictions, and not "all pages created, period" and the logging can be trimmed somehow; for example, Maybe a flag/row on recentchanges can be added instead, so we only get the new pages in the last month. Basically the question is, what is the "user story"? I am trying to comprehend that to provide the best idea for the implementation.

Note this sounds very similar to "adding wikidata to recentchanges", which without thinking, ended up filling up 90% of recentchanges rows on many wikis and causing watchlist and recentchanges issues- that is why I am asking what is the final goal- logging is a heavily indexed table with a lot of storage overhead. And of course I am not saying this cannot be done, I am just saying we need more information to know which is the best way to do it- otherwise, if logging should grow no matter what, we should be starting to work on a logical partitioning/sharding framework for mediawiki (or probably, integrating an existing one like http://vitess.io/ ).

@jcrespo: There was a discussion on English Wikipedia at the village pump. The main use case was to help identify PR/paid accounts and sockpuppets/long-term abuse. I don't think this would be as much of an issue on Wikidata, as no one creates Wikidata items for PR or SEO purposes (and rarely for abuse/vandalism). I imagine it would also be useful on Commons for identifying repeat copyvio offenders. Thus I don't think a flag in recent changes would meet the use case. I'm all for trimming the logging, but I'm also wondering which logs would make the most sense to trim. Like I wonder if the gazillion logs in Wikidata are all just automatic patrol actions by auto-patrolled bots, in which case we could probably just delete all the entries that are older than 30 days and reduce the size of the table by 90% (this is just a guess though).

Yeah, it looks like the vast majority of log actions are patrolling:

wikiadmin@db1090(itwiki)>select count(*) from logging;
+----------+
| count(*) |
+----------+
| 48016968 |
+----------+
1 row in set (8.85 sec)

wikiadmin@db1090(itwiki)>select count(*) from logging where log_type = 'patrol';
+----------+
| count(*) |
+----------+
| 43029327 |
+----------+
1 row in set (11.29 sec)

Which is dumb, since that data is only useful for 30 days AFAIK, and is already stored as a flag in recent changes. Why do we even log patrol actions??

@kaldari I see a relatively "simple" solution (at least for short term)- log only creation events when a page has been deleted- If the page exists and has not been deleted- we can gather that from the smaller "page" table (or revision?)- if the page has been deleted afterwards, it will be on logging, added only when it is deleted. This is far from ideal, and it requires 2 queries- to page and to logging, but it would avoid information duplication, while having "easier" the vandals actions (logging pages that have been deleted afterwards). Do you think that, or that with some changes, would be interesting? Most pages will never be deleted, so it could work?

@jcrespo That's definitely an interesting idea! It would solve the main use case, although it might be a bit confusing having creation logs only show up after the fact. Certainly worth considering though.

In my opinion, it should be possible to look at the logs (i.e., Special:Log/Page_title) of a wiki page in MediaWiki and see a chronology of "major" actions taken to the page. For a standard page, this would include page creation, page renaming, page protections, and page patrolling. For certain pages, this would also include page deletion. We're already doing most of this logging, we're just not including page creation in the logs, somewhat inexplicably. I think we should address this omission in this task.

The issue of Wikidata's logging table mentioned in T12331#3884542 is very interesting, but seems pretty off-topic here. A separate task to discuss whether auto-patrol logs on Wikidata are needed would be nice.

The issue of Wikidata's logging table mentioned in T12331#3884542 is very interesting, but seems pretty off-topic here. A separate task to discuss whether auto-patrol logs on Wikidata are needed would be nice.

I created a separate task at T184485.

@MZMcBride as long as it appears on Special:Log/Page_title things are ok- my proposal is compatible with that- it would only affect how things are stored internally at the database level. Nobody here is discussing what should happen, but how that should be implemented internally- the trivial way- more records every time may break other features or, literally, we may not have resources- which means we need to buy newer machines just to implement this, at least for wikidatawiki and or s3. A smarter implementation, which could have the same user-facing results, could be more efficient, allowing faster operation when we query those resources. Or it could be enabled on only some wikis (e.g. skip wikidatawiki) until we have the money and time to purchase those resources. That is the discussion here- implementing things so we do not break existing features, something that has happened in the past for not being careful.

@MZMcBride If your answer is related to my question of "what is the user story?", we need to dig deeper. Do you need every single wiki to do that, right now? Which wikis do need that earlier? -for the smaller wikis that is very easy, for the larger ones we may hit a perfomance barrier. Can we cover 90% of the needs quick so we do not have to wait for more available resources? Is it needed for vandalism fighting? Maybe an equivalent functionality can be implemented that provides equivalent information (or even more useful) while having a lower performance hit. That is the kind of answers we need to ask ourselves and each other.

I think I can answer a few of those questions... There's no pressing need for page creation logs. We've lived without them since the beginning of Wikipedia and waiting a bit longer won't kill anyone. The Wikipedias have a pretty strong use case. Commons has a pretty strong use case. The other projects use cases aren't as strong, but it would still be useful (for example, quickly looking up all the pages you've created or figuring out how many pages were created on a certain day). Since page creation is such an important action, it just makes sense to be in the logs. Personally, I favor fixing T184485 and then just adding page creation logging (in a straight-forward implementation) to all the wikis. But if it looks like that's not going to happen, I would want to consider your alternative proposal more seriously.

Knowing that, I would sugest to implement it conditionally- and we configure pretty much every wiki except enwiki, commonswiki, and wikidatawiki (obviously with a escalated deployment, to check for regressions). These three would need more thinking and care due to its edit volume, but that could be done later.

Given that any idiot vandal can come along and permanently add multiple rows to the revision table (and thousands of them regularly do!) and given that we're already dealing with other massively large database tables such as pagelinks or categorylinks, it's pretty difficult for me to care about logging growing very moderately to include page creations. I understand and appreciate that disk space and other resources are finite and that large tables can require more maintenance, but this seems like a particularly arbitrary place to draw a line.

In my opinion, it should be possible to look at the logs (i.e., Special:Log/Page_title) of a wiki page in MediaWiki and see a chronology of "major" actions taken to the page. For a standard page, this would include page creation, page renaming, page protections, and page patrolling. For certain pages, this would also include page deletion. We're already doing most of this logging, we're just not including page creation in the logs, somewhat inexplicably. I think we should address this omission in this task.

+1. The creation log is the last missing piece to a coherent persistent page history sketch.

@MZMcBride If your answer is related to my question of "what is the user story?", we need to dig deeper. Do you need every single wiki to do that, right now? Which wikis do need that earlier? -for the smaller wikis that is very easy, for the larger ones we may hit a perfomance barrier. Can we cover 90% of the needs quick so we do not have to wait for more available resources? Is it needed for vandalism fighting? Maybe an equivalent functionality can be implemented that provides equivalent information (or even more useful) while having a lower performance hit. That is the kind of answers we need to ask ourselves and each other.

To be honest, I really don't like the idea that there are common use case core logs that exist only in some installations of MediaWiki of a given version.

As quite a few issues were raised during the Last Call period of this RFC, it is not approved for implementation for the time being. It should remain in the "under discussion" stage until agreement is reached on the issues raised. Participants should feel free to request an RFC meeting if they feel it would be helpful.

@kaldari Do you think having an IRC meeting on soon this would be useful? Or do you think the current discussion here is sufficient to move forward?

@daniel: It seems that this task is basically blocked by T49415 (other than some kind of partial roll-out). Eventually, lots of things are going to be blocked by T49415. I think having an IRC meeting about T49415 would be more useful.

@kaldari would that be solved by T184485: Stop logging autopatrol actions? This has gone on Last Call, and if all goes well, it's approved in two weeks. Which raises the question - who would actually implement that?

Since T49415 is resolved, this should no longer be blocked any more. The only remaining hurdle to merging https://gerrit.wikimedia.org/r/#/c/399897/ is that we need to prevent page creation events from creating 2 different entries in recentchanges (one for the edit event and one for the creation log event). The consensus seems to be to not record the creation log event in recentchanges. Skimming through the logging code, it wasn't obvious how to do this, but I haven't had time to really investigate. Apparently, the patrolling action is also logged but doesn't insert into recentchanges, so we should be able to do whatever it's doing.

I think that can be done analogous to the implementation of the $wgAutopromoteOnceLogInRC configuration setting.

@jcrespo: I put the new page logging behind a feature flag which is set to false by default. That will allow us to not enable it on wikidatawiki (due to the volume concerns). If that sounds good to you, would appreciate a +1 on the patch :)

@kaldari Hello.

Is this log compatible with RevisionDelete and Suppression? It is important for us to be able to remove/hide nasty page titles there.

Also, will page titles remove themselves if and when the page is deleted with suppression (deleted with the checkbox "suppress data from administrators..." marked).

Regards.

@MarcoAurelio: I believe this should work the same as existing page move logs.

Change 399897 merged by jenkins-bot:
[mediawiki/core@master] Record a log entry on page creation

https://gerrit.wikimedia.org/r/399897

Are there any plans to activate page creation logs on wmf servers? I would really appreciate it.

@MGChecker: My plan is to activate it on Test Wikipedia this week, and then all WMF projects except Wikidata and Commons.

I'm guessing this will not log creations from redirects, as described in T12331#3881196 ? Maybe we could take on that next, as this is really hard to query for currently (even with the new mw-new-redirect tag). Perhaps a create/fromredirect log type? Relevant task at T184305

I would really like this addition with its own log action.

This is now live on all the wikis except Wikidata and Commons.

Good idea, but could you, please, add filtering by namespace? Thank you.

Good idea, but could you, please, add filtering by namespace? Thank you.

I suppose that would be T16711.

Change 470948 had a related patch set uploaded (by BryanDavis; owner: Bryan Davis):
[operations/puppet@production] wiki replicas: Add 'create' to the list of visible log types

https://gerrit.wikimedia.org/r/470948

Change 470948 merged by Bstorm:
[operations/puppet@production] wiki replicas: Add 'create' to the list of visible log types

https://gerrit.wikimedia.org/r/470948

This is now live on all the wikis except Wikidata and Commons.

When can it go live on Wikidata and Commons? Why is it not live there already?

Should we create a separate task for Wikidata and Wikimedia Commons?

Should we create a separate task for Wikidata and Wikimedia Commons?

We have consensus for it at https://commons.wikimedia.org/wiki/Commons:Village_pump#Page_creation_logs.

Should we create a separate task for Wikidata and Wikimedia Commons?

Yes.

I believe the issue is less about consensus and more about performance / database storage. See comments starting at T12331#3874399.

As I suggested on wiki, the space and iops used will be quite limited if the logs focused on non-image page creation/non-bot creation, as it is the activity of bots what is normally not very useful, and it was my understanding it was not needed for the issues risen.

I suggested to file a new ticket rather than commenting here: T288346 but apparently there is a technical limitation about not being able to filter by namespace or by account group :-(. I think it would be technically more feasable if that limitation was overcome. I wonder if there is something, even dirty, we could add somewhere to overcome this issue (please let's continue, if necessary, the conversation on that ticket).