Page MenuHomePhabricator

enable revision.rev_sha1 on enwiki
Closed, ResolvedPublic

Description

Full list of schema changes being built here:
http://etherpad.wikimedia.org/119deployment


Version: unspecified
Severity: normal

Details

Reference
bz34104

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 22 2014, 12:14 AM
bzimport set Reference to bz34104.

afeldman wrote:

How complete is the list of schema changes on the etherpad? Is there a target date prior to the start of deployments to freeze the schema?

The sha1 field addition may be the most time consuming change ever done at wmf. As it's backwards compatible, I'd prefer to make the schema changes sooner than later.

  • array( 'addField', 'revision', 'rev_sha1', 'patch-rev_sha1.sql' ),
  • array( 'addField', 'archive', 'ar_sha1', 'patch-ar_sha1.sql' ),

The other two currently listed should be backwards compatible as well.

  • array( 'addIndex', 'page', 'page_redirect_namespace_len', 'patch-page_redirect_namespace_len.sql' ),
  • array( 'addIndex', 'logging', 'type_action', 'patch-logging-type-action-index.sql'),

While patch-drop-user_options.sql can be delayed til 1.20.

(In reply to comment #1)

How complete is the list of schema changes on the etherpad? Is there a target
date prior to the start of deployments to freeze the schema?

That's it for core, I've still got to go through the extensions, but I don't seem to recall there being too many (any?) changes, or if there is, it's certainly not going to be as bad

Do need to get people to confirm whether they want the 1.18wmf1 (hence, no changes at all) versions of extensions bringing into 1.19wmf1, or they want the trunk version of extensions (which there might be changes in)

I'll see about getting the list done for trunk generating, and we can cross stuff out if they aren't needed

(In reply to comment #1)

The sha1 field addition may be the most time consuming change ever done at wmf.
As it's backwards compatible, I'd prefer to make the schema changes sooner
than later.

  • array( 'addField', 'revision', 'rev_sha1', 'patch-rev_sha1.sql'

),

  • array( 'addField', 'archive', 'ar_sha1', 'patch-ar_sha1.sql'

),

At least it's only adding the column, not populating it at the same time!

Other than an index to be added on the mediawikiwiki code review table (CREATE INDEX /*i*/repo_path ON /*_*/code_paths (cp_repo_id, cp_path); -- 1.4M rows), there are no other database updates that I can see that need anything doing.

Obviously, adding the index is backwards compatible and can just be added, maybe even without master switching

Translate for example has a new table, that's fine to be just done ad-hoc

Rest of features stuff is deployed pretty much at trunk level, so Roan/whoever will have applied the schema updates needed

Maintenance script for CentralAuth needs running somewhere near deploy time, but that's trivial

afeldman wrote:

The revision.rev_sha1 column addition will be problematic on enwiki in our current production environment.

I took the enwiki slave db38 out of the production and ran:

mysql -e "SET SQL_LOG_BIN = 0; ALTER TABLE /*$wgDBprefix*/revision ADD rev_sha1 varbinary(32) NOT NULL default ''" enwiki

It took over 27 hours to complete, and may take another ~8 hours for replication to catch up. We can go ahead and do this on every db individually, but it will take quite a while, and may require pushing back the release cycle. It also increased ibdata1 by 200GB.

Due to the huge amount of time required, we may want to complete the other migrations on db38 and then rebuild all of the other enwiki db's from a hotbackup of it, except for analytics systems where this isn't an option. We could also take the opportunity to redo db38 from a full sql dump of enwiki in order to switch to file-per-table mode, but this could take an additional week.

Or we can scuttle this change for a future release and until then, examine ways to partition the revision table or store it differently, obtain SSD's for enwiki db's, or store sha1's in their own table and push the cost to queries that join against it. As its currently unindexed, it doesn't appear it will be of any use except in offline cases anyways.

FYI, I'm also experimenting with Percona's pt-online-schema-change tool, since our revision table meets its requirement of having a single column unique index.

On db1043, an enwiki slave in eqiad, I am running :

pt-online-schema-change t=enwiki.revision --alter "ADD rev_sha1 varbinary(32) NOT NULL default ''" --sleep 0.1

This writes 1000 rows at a time to a __tmp_revision table and then sleeps 100ms between chunks. Replication couldn't keep up at all with more aggressive settings. With these settings, replication kept up for around 12 hours, but is now 3.5 hours behind. It's been running for 25 hours and the tool estimates it will require 29 more hours to complete, by which time it will be further behind on replication. We'd need a larger sleep and/or smaller chunks to avoid seriously backing up replication, which may not even work.

Revision rows:
enwiki - 460014306
dewiki - 87901921
commons - 42765689

So, dewiki is the next biggest wiki, with enwiki being over 5 times larger (in respect of the number of revision rows)

I do believe the sha1 is basically populated and such throughout the code, but it's only real use is in the xml dumps

We would need to comment out explicit reads/writes on the field, but shouldn't be too painful code wise.

Would this mean the smaller wikis should be doable somewhat easily? ie dewiki should be under 10 hours (giving some margin for guestimation)

Enwiki is obviously will be one of the latter wikis to be upgraded to 1.19

archive rows:
enwiki - 35621907
dewiki - 13702308
commons - 5344901

I'm presuming doing archives ar_sha1 is going to be quite a bit quicker? revision is nearly 13 times larger than archive is on enwiki

Could we get away with just doing a delayed release of this. So the first XML dumps after 1.19 don't have any revision sha1 hashes? Oh well. Obviously they're not greedily populated, so it's not going to be much of an issue.

afeldman wrote:

rev_sha1 is doable on all wikis other than en in a reasonable time frame, including as an online change for all wikis < de. ar_sha1 is fine on all wikis off the bat.

Can use of the sha1 tables be a wfConf setting that is set to false for enwiki? If so, default could likely be true.

<insert additional grumpy remark about bloating production oltp databases with data only of use for offline analysis>

If rev_sha1 is the main issue.. For WMF deployment, there are only 9 uses that might need some work...

Anything doing actual database queries needs making optional, but a lot of the uses in code fall back fairly nicely. I'd guessing only 4 or 5 of them will need actively patching, so that's more than doable

e.g.

			if ( !isset( $row->rev_sha1 ) ) {
				$this->mSha1 = null;
			} else {
				$this->mSha1 = $row->rev_sha1;
			}

Asher is going to be out of the office until Feb 26. Here's the latest update from him:

"Everything is done except for the s1, s5, and s6 masters. (s1 is not getting the revision sha1 table pre-1.19, so that alter isn't included in "everything")

"I was planning to swap those three masters and re-run the migration script early next week but if someone else would like to do it, let me know. If so, make sure the eqiad secondary in the replication tree gets switched to the new master as well along with the dns and puppet changes noted at http://wikitech.wikimedia.org/view/Switch_master. "

afeldman wrote:

All migrations are done, except for revision.rev_sha1 on enwiki. That alter has only been applied to the host promoted to master yesterday - I had previously run it on one host as a timing benchmark and chose that host to become master to avoid the need for a second switch.

I'd like to start that last enwiki migrations after the 1.19 deployment is complete but its important that the logic on whether to use revision.rev_sha1 is not based on the existence of that column on the master. It exists but any write to it will break replication to all slaves and likely bring down the site.

(In reply to comment #9)

All migrations are done, except for revision.rev_sha1 on enwiki. That alter
has only been applied to the host promoted to master yesterday - I had
previously run it on one host as a timing benchmark and chose that host to
become master to avoid the need for a second switch.

I'd like to start that last enwiki migrations after the 1.19 deployment is
complete but its important that the logic on whether to use revision.rev_sha1
is not based on the existence of that column on the master. It exists but any
write to it will break replication to all slaves and likely bring down the
site.

We can just leave the whole feature disabled on enwiki for the time being, it's going to make very little difference! :)

Leaving this open since we still need to get the sha1 stuff done for enwiki, but bumping down the priority and taking this off as a 1.19wmf1 deployment blocker.

tightening summary and moving to 1.20 deployment

afeldman wrote:

With the last enwiki analytic db completed today, this feature may now be enabled.

(In reply to comment #13)

With the last enwiki analytic db completed today, this feature may now be
enabled.

After running a population script :)

(In reply to comment #14)

(In reply to comment #13)

With the last enwiki analytic db completed today, this feature may now be
enabled.

After running a population script :)

We didn't run it for any of the other wikis, did we?

Aaron, will this automatically happen as a result of deploying 1.20wmf1 to enwiki? If so, are we ready for it to happen automatically happen?

(In reply to comment #16)

Aaron, will this automatically happen as a result of deploying 1.20wmf1 to
enwiki? If so, are we ready for it to happen automatically happen?

Indeed, we haven't backported the changes.

We might as well just remove the enwiki entry in wmfUseRevSha1Columns and let it get on with it

No point trying to back the code out of 1.19wmf1

Done

It's going to be lazy populated, unless we find a reason to do it via a maintenance script. Which for obvious reasons, on enwiki especially, it's not going to be quick

(In reply to comment #18)

It's going to be lazy populated, unless we find a reason to do it via a
maintenance script. Which for obvious reasons, on enwiki especially, it's not
going to be quick

What does "lazy populated" mean here?

(In reply to comment #19)

(In reply to comment #18)

It's going to be lazy populated, unless we find a reason to do it via a
maintenance script. Which for obvious reasons, on enwiki especially, it's not
going to be quick

What does "lazy populated" mean here?

Hmm. It's not even lazy populated. I thought it was updated (if null) on load of revisions or something, it's a while since I'd looked at the code.

It's only going to be set on creation of new revisions. This is no different to any of the other wikis. If we run the maintenance script to populate these (on enwiki especially), it's going to take a long time.

It's a case of whether "we" (the community, analysts, whoever) need the information, in which case, it's worth

That should be opened as a new bug request if desired.

(In reply to comment #20)

It's only going to be set on creation of new revisions. This is no different to
any of the other wikis. If we run the maintenance script to populate these (on
enwiki especially), it's going to take a long time.

You've mentioned that it's going to take a long time twice now. It's not a race. Nobody's going to require that you hold your breath while the script runs. It doesn't matter how long it takes to populate (and the sooner someone starts it, the sooner it'll be finished!).

It's a case of whether "we" (the community, analysts, whoever) need the
information, in which case, it's worth

That should be opened as a new bug request if desired.

I can think of few ideas more daft than spending all of this time and energy adding the columns only to not populate them. A separate bug makes sense. Filed as bug 36081.

There is a script to populate this which should be run soon.