Page MenuHomePhabricator

Archive and drop all article feedback related tables from all wikis
Closed, ResolvedPublic

Description

On enwiki:

+------------------+----------------------------+-----------+
| table_name       | index_name                 | rows_read |
+------------------+----------------------------+-----------+
| article_feedback | PRIMARY                    |      NULL |
| article_feedback | article_feedback_timestamp |      NULL |
| article_feedback | aa_page_id                 |      NULL |
| article_feedback | aa_page_user_token         |      NULL |
+------------------+----------------------------+-----------+
4 rows in set (0.25 sec)

NULL rows_read indicates the indexes havn't been used. Stats collection has been running here for over a week. Same results on S1 master and several slaves.

Indexes outweigh data on article_feedback quite a bit, taking up space and increasing write load:

+---------------------------------------+--------+-------+--------+
| CONCAT(table_schema, '.', table_name) | rows   | DATA  | idx    |
+---------------------------------------+--------+-------+--------+
| enwiki.article_feedback               | 69.00M | 7.13G | 26.04G |
+---------------------------------------+--------+-------+--------+

Surprising; now I'm suspicious of the stats... Are these indexes really likely to be simply unused?

Is there any infrequent maintenance or reporting job that requires them?


Version: wmf-deployment
Severity: minor

Details

Reference
bz57185

Event Timeline

bzimport raised the priority of this task from to Low.Nov 22 2014, 2:19 AM
bzimport set Reference to bz57185.
bzimport added a subscriber: Unknown Object (MLST).

(In reply to Sean Pringle from comment #0)

Surprising; now I'm suspicious of the stats... Are these indexes really
likely to be simply unused?

Is there any infrequent maintenance or reporting job that requires them?

No, they're just old tables for an extension disabled years ago (so called AFT/AFTv4).

Dario, is there any data which was not yet added to http://datahub.io/en/dataset/wikipedia-article-ratings ?

Reedy set Security to None.

(In reply to Sean Pringle from comment #0)

Surprising; now I'm suspicious of the stats... Are these indexes really
likely to be simply unused?

Is there any infrequent maintenance or reporting job that requires them?

No, they're just old tables for an extension disabled years ago (so called AFT/AFTv4).

Dario, is there any data which was not yet added to http://datahub.io/en/dataset/wikipedia-article-ratings ?

If @DarTar can tell us if the tables are still needed (or if they've all been exported/archive to the datahub above)....

If they're not, this task can be upgraded to dropping all the article feedback tables completely; data and indexes

And if we can drop the lot... aft_% or article_feedback% can all go

| aft_article_answer                           |
| aft_article_answer_text                      |
| aft_article_feedback                         |
| aft_article_feedback_properties              |
| aft_article_feedback_ratings_rollup          |
| aft_article_feedback_select_rollup           |
| aft_article_field                            |
| aft_article_field_group                      |
| aft_article_field_option                     |
| aft_article_filter_count                     |
| aft_article_revision_feedback_ratings_rollup |
| aft_article_revision_feedback_select_rollup  |
| article_feedback                             |
| article_feedback_pages                       |
| article_feedback_properties                  |
| article_feedback_ratings                     |
| article_feedback_revisions                   |
| article_feedback_stats                       |
| article_feedback_stats_types                 |

@Reedy, @Nemo_bis: there are field that were collected via AFTv4 and we didn't include in the open dataset (e.g. user tokens, IP addresses or user IDs) due to privacy restrictions. I don't think we need to keep the AFTv4 tables (article_feedback_*) and their index around, given that there is no more internal analysis being conducted at WMF on this data and the anonymized public dataset is good enough for external research. This applies also to AFTv5 data (public dataset published here, tables with an aft_ prefix)

My recommendation is to either dump the SQL and archive it on one of the internal stats machines or remove the raw data entirely from our servers if its retention (with PII included) is considered a violation of the data retention guidelines. I'd go for the latter, I'm copying @csteipp for a second opinion.

@Reedy, @Nemo_bis: there are field that were collected via AFTv4 and we didn't include in the open dataset (e.g. user tokens, IP addresses or user IDs) due to privacy restrictions. I don't think we need to keep the AFTv4 tables (article_feedback_*) and their index around, given that there is no more internal analysis being conducted at WMF on this data and the anonymized public dataset is good enough for external research. This applies also to AFTv5 data (public dataset published here, tables with an aft_ prefix)

My recommendation is to either dump the SQL and archive it on one of the internal stats machines or remove the raw data entirely from our servers if its retention (with PII included) is considered a violation of the data retention guidelines. I'd go for the latter, I'm copying @csteipp for a second opinion.

I believe @jcrespo has said that he always takes a backup of these things before dropping them. I suspect archiving them to the stats machines too (do you meant the backup file, or put the whole tables onto some stats mysql boxen?) wouldn't be out of the question.

Will retitle the bug. Thanks @DarTar! :D

Reedy renamed this task from Drop article_feedback indexes, unused to Archive and drop all article feedback related tables from all wikis.Nov 18 2015, 12:24 AM
jcrespo raised the priority of this task from Low to Medium.Mar 21 2016, 11:04 AM

These tables take a significant amount of space in production, and a potential source of privacy issues if they were exposed fully on labs. I would like to proceed with its deletion ASAP.

I would like to not retain any copy, as we currently do not have such a service outside of dumps. Backups of these tables will disappear after 3 months.

Ok with that? (I did not see confirmation from the people being asked)

mysql -A -h s1-master.eqiad.wmnet information_schema -e "SELECT (DATA_LENGTH + INDEX_LENGTH)/1024/1024/1024 AS \`size (GB)\`, table_name FROM TABLES WHERE table_schema='enwiki' AND table_name like 'article\_feedback%'";
+-----------------+------------------------------+
| size (GB)       | table_name                   |
+-----------------+------------------------------+
| 33.173828125000 | article_feedback             |
|  0.260559082031 | article_feedback_pages       |
|  1.680664062500 | article_feedback_properties  |
|  0.000015258789 | article_feedback_ratings     |
|  1.291992187500 | article_feedback_revisions   |
|  0.079330444336 | article_feedback_stats       |
|  0.000030517578 | article_feedback_stats_types |
+-----------------+------------------------------+

Per Dario above, all the data deemed useful and usable has been published at https://datahub.io/en/dataset/wikipedia-article-feedback-corpus , so we're ok with deleting.

OTOH, if a directory exists somewhere (rutherfordium?), readable only by ops/roots, where to dump some compressed sql files, that could not harm much.

@Nemo_bis I always save things before deletion, however, I save those on my home, as a mere "in case something breaks", so I can recover them quickly. That is only "operational", temporary storage. I do not think there is a place for long-term storage -and maybe it shouldn't- for things like this.

I will proceed with the production deletion, but a long term decision, either full purge or a service for preservation has to be created (it can be discussed on a separate ticket).

Testing its deletion on db1073.

I'd suggest also to delete the AFT log entries from the logging table, as they don't display properly on the wikis, they no longer convey any useful information, and all the data would still be available for posterity in relevant dumps.

Should I file a subtask for this?

TTO please do, although that is more of a "content handling" issue, not a schema schange, so that can be done by anyone with a maintenance script. With this I mean it is less dangerous and should not be blocked on me (but I will help).

Mentioned in SAL [2016-08-12T09:38:06Z] <jynus> dropping aft tables from db1052 T59185

TTO please do, although that is more of a "content handling" issue, not a schema schange, so that can be done by anyone with a maintenance script. With this I mean it is less dangerous and should not be blocked on me (but I will help).

Thanks Jaime. I repurposed T115303: Expunge old AFTv5 log entries on WMF wikis for this.

Will deleting these tables has to be blocked on that, or can I proceed as intended?

Will deleting these tables has to be blocked on that, or can I proceed as intended?

No, not blocked. Go ahead :)

Mentioned in SAL [2016-08-12T10:06:42Z] <jynus> dropping aft tables from all enwiki hosts after archiving its contents T59185

Mentioned in SAL [2016-08-12T10:47:34Z] <jynus> dropping aft tables from all other hosts after archiving its contents T59185

jcrespo claimed this task.

I have dropped the following ones:

| aft_article_answer                           |
| aft_article_answer_text                      |
| aft_article_feedback                         |
| aft_article_feedback_properties              |
| aft_article_feedback_ratings_rollup          |
| aft_article_feedback_select_rollup           |
| aft_article_field                            |
| aft_article_field_group                      |
| aft_article_field_option                     |
| aft_article_filter_count                     |
| aft_article_revision_feedback_ratings_rollup |
| aft_article_revision_feedback_select_rollup  |
| article_feedback                             |
| article_feedback_pages                       |
| article_feedback_properties                  |
| article_feedback_ratings                     |
| article_feedback_revisions                   |
| article_feedback_stats                       |
| article_feedback_stats_types                 |

From:

enwiki

dewiki

eswiki
huwiki
metawiki

ptwiki
zhwiki

eswikinews
hiwiki
ptwikibooks
srwiki
testwiki

Which are the ones I found.

I have retained just in case backups on es2002; but the intention is to drop those eventually too.

Nice, thanks! Attaching for future reference:

db1057_disk.png (708×997 px, 33 KB)