Page MenuHomePhabricator

Migrate file tables to a modern layout (image/oldimage; file/file_revision; add primary keys)
Open, MediumPublic

Description

The most recent version of each image is stored in the image table, with historic versions stored in the oldimage table. This structure resembles cur and old, with all the nastiness that comes with it. In MW 1.5 we ditched cur/old in favor of page/revision, but we never did a similar thing for files.

We talked about this a bit at the 2011 Amsterdam Hackathon and decided that, while it's complex and hairy, we want to do it at some point in the future.


RFC

In a nut shell

Currently, the oldimage table contains information about non-current file revisions, and the image table contains information about the current file revisions as well as the unversioned information about a file in general. This means that to query all revisions, MediaWiki has to query two tables. It also means that whenever a file gains a new revision, the "current" data has to be moved across from one table to another.

The approved changes will rename oldimage to filerevision and add the missing rows for the current file revisions. Then, the image (now: file) table can be reduced to just containing the generic file information. This matches the way the page and revision tables have been organised in the database since MediaWiki 1.5 (before the page/revision tables existed, MW used curpage/oldpage tables which is similar to these old image tables, and is why today we still have a notion of curid=pageid and oldid=revid).

Approved changes

  • Add fields:
    • img_id: New primary key for image.
    • img_latest: Pointer to oldimage.oi_id for the current revision of a file (similar to how page.page_latest points to revision.rev_id)
    • oi_id: New primary key for oldimage.
    • oi_img: Pointer to image.img_id for the file this revision corresponds to (similar to how revision.rev_page points to page.page_id)
  • Rename tables:
    • Rename image to file.
    • Rename oldimage to filerevision.
  • Add missing rows in filerevision for current revisions, based on rows from image.
  • Reduce fields in file to only be the minimum amount of derived data we need indexed for the "current" revision. Currently indexed:  img_timestamp, img_user_text, img_sha1, img_media_type, img_major_mime, img_minor_mime, img_size.
    • img_timestamp: Remove. Only used for file history, redundant now with the missing rows now present in filerevision.
    • img_user_text: Remove. Only used for file history (ApiQueryAllImages, SpecialMIMEsearch), these queries can instead query or join filerevision directly.
    • img_sha1: Keep. Needed for reverse look-up in duplication detection on current file revisions.
    • img_media_type img_major_mime, img_minor_mime: Keep. Needed for SpecialMIMESearch. Note that (while out of scope of this task), there has been talk about creating a separate RFC for removing these fields in favour of a SearchEngine-based approach.
    • img_size: Remove. Not needed as indexed field. Only used by checkImages.php and Special:MediaStatistics (SUM query), these can directly query/join against filerevision.
    • img_width, img_height: Remove. Not needed as indexed field. Only existed because oldimage didn't contain current revisions, these fields are redundant in filerevision.
    • img_bits: Remove. Not needed. Redundant now with filerevision.
    • img_description: Remove. Not needed. Redundant now with filerevision.
    • img_user: Remove. Not needed. Redundant now with filerevision. Used by ApiQueryAllImages and NewFilesPager (SpecialNewimages) which could query/join filerevision instead. Though, should probably use recentchanges.rc_bot instead. (unrelated)

Proposed migration strategy

Exact schema migration script to be written as part of the implementation and fine-tuned as needed during code review. We may need two separate strategies due to the size of the migration (one for the default db updater, and an opt-in maintenance script for large farms such as Wikimedia Foundation). A few ideas so far:

  • Tim Starling, T589#2747454:
    • Rename image to filerevision. Create a view called image.
    • Add new fields to filerevision.
    • Create file table.
    • (Disable uploads.)
    • Populate file from filerevision.
    • Also move oldimage rows into filerevision. (These will be made invisible from the image view.)
    • (Deploy new MediaWiki version that uses file/filerevision.)
    • (Re-enable uploads)
    • Drop image and oldimage.
  • Jaime Crespo mentioned the idea of potentially doing the migration offline while serving traffic from codfw instead of eqiad.

Details

Reference
bz26741

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Also, file moves would take some thought - currently those change the name of the archive entries, I think, and content blobs are supposed to be immutable. Maybe we should do the switch to sha1 filenames first?

Also revision deletion would have to trigger file version deletion, but only if all revisions of the page that reference a certain version of the file have been deleted. That would have to be a whole new mechanism, and does not fit neatly into MCR which treats content objects as completely opaque.

In T28741#4940685, @Tgr wrote:

Also revision deletion would have to trigger file version deletion, but only if all revisions of the page that reference a certain version of the file have been deleted. That would have to be a whole new mechanism, and does not fit neatly into MCR which treats content objects as completely opaque.

Good catch, this indeed needs more thought.

As I said multiple times, hypernormalization would bring performance concerns- right now image and revision are handled on separate tables. Unless you also implement MCR sharding, we should try to make separate concerns into separate tables. Let's not convert Mediawiki into Drupal. That doesn't mean image could not be a "slot", but leave metadata to its own separate table.

As I said multiple times, hypernormalization would bring performance concerns- right now image and revision are handled on separate tables. Unless you also implement MCR sharding, we should try to make separate concerns into separate tables. Let's not convert Mediawiki into Drupal. That doesn't mean image could not be a "slot", but leave metadata to its own separate table.

Currently, every upload currently creates a row in the oldimage (or file) table as well as the revision table and the slots table (but not the content table). If we used MCR for storing the primary image meta-data, every upload would create a row in the revision, the slots, and the content table (as well as the image table if the file did not previously exist). filearchive and oldimage would go away, and with them the need to copy rows between these tables.

The idea is indeed to leave metadata of the current revision in its own separate table, but as a secondary copy, not primary information; and to remove the need for maintaining it for old or archived revisions.

I'm not sure I understand correctly what you mean by MCR sharding. What table would need to be sharded in your opinion? And which tables would grow significantly more than they do now?

Since we now have MCR, we could do T96384 and get rid of the need for filerevision.

Can we have a decision whether should we go with MCR or not (probably by TechCom)? The original RFC was proposed exactly ten years ago and it might need a revisit.

Can we have a decision whether should we go with MCR or not (probably by TechCom)? The original RFC was proposed exactly ten years ago and it might need a revisit.

RFC T589 was approved in 2017 with knowledge of MCR. As I understand it, adding primary keys (the approved direction) would be a prerequisite regardless of whether and when we refactor it using MCR. The MCR refactor would also likely involve non-trivial product and user-facing decisions and is perhaps better resourced as its own separate endavouror at a later date (and ideally by a team willing and able to commit to its long-term ownership).

The approved solution is fairly self-contained and seems more urgent given the benefits it will yield on database performance (the story for revision/archive in T20493, applies to image/oldimage as well).

I imagine the first step for MCR would be to create a slot which just contains a filerevision ID. Completely merging filerevision into revision would be a lot more work which should definitely be attempted in a single step; and I'm not sure we have compelling reasons for doing it at all. So this task is a step towards MCR images.

T90300: Suppressed username shown on File pages probably has some overlap with this.

Thanks for seeing interested people here!

One thing that people may be aware but that I don't see it mentioned is that changes to image (file), in addition to oldimage (filerevision), may also impact filearchive. Eg. if we migrate to a numerical PK, we may want to keep that PK for filearchive. Further refactoring could be done there at the same time, but that would be out of scope of this particular RFC- it should, however, be taken into account to make PKs/refactoring compatible. filearchive is in an even worse state that oldimage in terms of consistency to cause metadata loss.

I offer to start working on a data cleanup/consistency check as a previous step before refactoring to speed up later changes, if someone is available for helping me.

In T28741#6746635, @Tgr wrote:

I imagine the first step for MCR would be to create a slot which just contains a filerevision ID. Completely merging filerevision into revision would be a lot more work which should definitely be attempted in a single step; and I'm not sure we have compelling reasons for doing it at all. So this task is a step towards MCR images.

I'm not convinced it would be a lot more work. Introducing filerevisions just to make it redundant again seems like massive overhead. Moving to MCR doesn't have to be done in a single step any more than introducing filerevision would be done in a single step.

@daniel how would you imagine the end stage of the MCR migration? Put what's currently an image table row into a JSON object and store it as content on some non-main slot?

The reason commonswiki.image is large is because of img_metadata -- predominantly text extracted from PDFs. I sampled 63269 images using img_sha1, and I found that the average size of an img_metadata field is 9712 bytes. It's enormous. Without compression, the storage space used by this field on commons would be about 616 GB. Breaking down metadata sizes by MIME types shows that 81% of img_metadata by size is PDFs, 10% is DjVu, 9% is JPEGs, and the rest is a rounding error. So the size of PDF/DjVu text layers on commons is about 561 GB.

So in terms of addressing @jcrespo's complaint at T222224#6738823 about backup sizes, the most important thing we can do is to rethink storage of PDF and DjVu text layers. CodeSearch indicates that this text is used for exactly two things: Wikisource (ProofreadPage) transcription defaults and search indexing. In both cases, high latency is tolerable, so moving this text to external storage would work.

Some of the metadata is much hotter. For example, the page count of a PDF is needed every time such an image is referenced during parse or any other kind of thumbnailing, but it's currently in the same multi-megabyte serialized blob as the text layer. That's why I'm talking about reconsidering text layers, rather than just moving the existing metadata blob to external store.

I used img_sha1 like '12%'. The sampling ratio is not as obvious as I thought since this is a base-36 hash, so the first two digits are not uniform. I believe the correct sampling ratio is 36^29 / 2^160 = 1/1076.5. I edited my comment above to reflect that. The size of the field summed over the whole table is assumed to be the size of the sample multiplied by 1076.5.

In T28741#6749820, @Tgr wrote:

@daniel how would you imagine the end stage of the MCR migration? Put what's currently an image table row into a JSON object and store it as content on some non-main slot?

No, the image table would stay much the same, but it would be a "secondary" table, basically specialized page_props. oldimage would probably also stay as a secondary denormalized view on revisions, to provide quick access to the upload history with thumbnails etc. File archive can probably go away. The metadata blob would either be in a slot, or written to the blob store separately.

The key change is that the source of truth for all changes to the page, be it edit or upload, will be the revision table.

Hey, daniel, from a logical/data model point of view you have my support- anything that makes image workflow more like page workflow, (more unmutable/append only; stable, unique identifiers) would be a win.

My worry would be for commonswiki (and possibly enwiki) going back to having problems of scaling due to inserting + scanning lots of rows on revision (or more technically, increasing the current ones). Thanks to your work, and others, we "stopped the fires" of the revision table, but we didn't win the war :-D. Please consider that deeper refactorings -specially those increasing size and usage of a single table- may require sharding or any other physical optimization technique (e.g. revision_page and revision_image physically separated; or user_revision/page_revision/date_revision, etc. clustered around a particular ordering- vertical partitioning) that prevents agile schema changes/fast backups and recoveries and performance loss due to heavily used tables no longer fitting into memory.

I don't have specific suggestions or blockers, but I would like you to have into account the non-logical parts (operational cost). This is not only to make sysadmins life easier- fast schema changes means also faster deployments 0:-O. Of course, this normally only applies to the top 3 traffic wikis (enwiki, commonswiki and wikidatawiki), most other wikis do not have these limitations.

My worry would be for commonswiki (and possibly enwiki) going back to having problems of scaling due to inserting + scanning lots of rows on revision (or more technically, increasing the current ones). Thanks to your work, and others, we "stopped the fires" of the revision table, but we didn't win the war :-D.

Since we already insert a "dummy" revision for every upload, this wouldn't increase the size or number of writes on the revision table. It would be exactly the same. Having more slots on all the file pages would increase the size of the slots table (if we do it retroactively for existing data, by 30% to 50%). That table is very narrow, so I'd hope that this would be acceptable.

Please consider that deeper refactorings -specially those increasing size and usage of a single table- may require sharding or any other physical optimization technique (e.g. revision_page and revision_image physically separated; or user_revision/page_revision/date_revision, etc. clustered around a particular ordering- vertical partitioning) that prevents agile schema changes/fast backups and recoveries and performance loss due to heavily used tables no longer fitting into memory.

We recently reduced the width of the revision table. If the height of the revision table (and the slots and content tables) is still a concern, we should pick up efforts of partitioning/sharding again. This used to be on the roadmap last year, but got dropped as "no longer a concern, as far as I know. This is probably not the right place to discuss it, but it seems like it would be important to have a shared understanding of the urgency of such a project.

Don't worry, I see you are already very aware of the problem I mentioned, so that gives me confidence we will be able to sort it out when there is a concrete proposal. Sadly scaling problems don't go away, just other larger appear. :-)

IMHO, this is the main wmf db issue as of now, and relevant to this ticket:

  • The size of the image table on commons. I like @tstarling comments and I am guessing would be easier to implement (don't know)?
  • The lack of stable identifiers/numerical ids (general workflow) for image versions, which makes them difficult to backup/recover/identify/can cause data loss problems. Either Daniel proposal or the one on the original RFC would work, but my guess is those are for the most part independent.

I don't have further insights on how to best solve both, unless you have specific questions for me.

The reason commonswiki.image is large is because of img_metadata -- predominantly text extracted from PDFs. I sampled 63269 images using img_sha1, and I found that the average size of an img_metadata field is 9712 bytes. It's enormous. Without compression, the storage space used by this field on commons would be about 616 GB. Breaking down metadata sizes by MIME types shows that 81% of img_metadata by size is PDFs, 10% is DjVu, 9% is JPEGs, and the rest is a rounding error. So the size of PDF/DjVu text layers on commons is about 561 GB.

There is a related discussion happening in T271493.

This issues is really interesting. I took a look at this a bit.

  • Yes, the text part of pdf/djvu files in img_metadata are enormous. For example. The first pdf file in list of files in commons is File:!!Abajo los solteros!! - fantasía cómico-lírica gubernamental en siete cuadros y un real decreto, en prosa (IA abajolossolteros476riba).pdf and if you look at metadata section, you wouldn't see much data but if you look at metadata using API, you'll see a huge wall of data (assuming output of OCR and it seems actually next to useless). It's 64 KB, this will add up quickly.
  • Moving text data makes a lot of sense, to where? We need to decide it first. I have a proposal. I don't know very much about files and media handling in mw so sorry if it's bad. Why not loading the text layer by reading the metadata directly from the file in the filebackend? it doesn't happen often at all. Only usecase I found was Proofread (but I'm sure elastic also uses it) and both can just look the metadata from filebackend (or through some API?). If that's not a good idea. We can have another table and move that to x1/etc. Whatever DBAs would prefer.
  • I highly recommend fixing the text layer of metadata first. Because it makes the rest of the work much easier. Once the table is ten times smaller, adding and dropping indexes would be easier too.

(We probably should create another ticket, this is derailing a bit)

Small correction to Wikitech-l message "He recommends executing the oldimage migration, which was approved". My recommendation was just to express the very real ongoing backup/recovery problems so this gets higher priority than the link problem ATM, but I don't have a strong preference for this exact approved RFC, or any other idea, as long as it helps reducing the table size, and/or hopefully, have an easier iterable identifier, given the table is so big (and I thought this was the most relevant ticket). 0:-)

Whatever DBAs would prefer.

For reading from the backend, we don't have that experience yet (here I mean DBAs == Data persistence team). Maybe Performance Team members have a better idea as they worked on thumbnailing recently and may have more first hand experience? For a table, as long as it has easily iterable PKs, any vertical slice for unfrequent accessed data is a win- but some small testing would be needed. Eg. if most files have 64 bytes of metadata, and only a few have a lot, we can combine methods- but we need numbers for a representative sample.

The thumbnailing service is unrelated to media DB tables. The point of the migration to Thumbor was to separate thumbnailing concerns from MediaWiki entirely. The fact that I still work on Thumbor is due to lack of ownership. I have no intention of undertaking a project as large as this data migration as a side project while being engineering manager. I'm already probably biting more than I can chew with migrating Thumbor to Docker/Buster/Python 3 at the moment.

While performance will certainly improve by removing all this data from the table, this migration doesn't require cross-cutting expert knowledge only the Performance Team possesses and as such the Performance Team doesn't need to be more involved than in an advisory role.

The rough plan described by Tim to move the PDF/DJVU text metadata (to Swift, I imagine?) sounds totally fine to me. This requires MediaWiki work, to make that data consumable from its new location and to write a migration job.

The fact that we seem once again lacking a logical team to work on this needed data migration reinforces the need I've pointed out for years for us to have a Media Infrastructure team or sub-team that would take ownership of such matters, as well as Thumbor.

Hey, Gilles,

We weren't asking you or your team to do this- just if, due to your expertise, you had any kind of input on performance differences of mediawiki comparing table access to media retrieval, as maybe someone on your team knew latencies to database and media storage on critical path better than anyone (as you have nice metrics for this kind of things, and focus on user experience for editors a lot lately, specially for large files). Apologies for bringing it up.

The rough plan described by Tim to move the PDF/DJVU text metadata (to Swift, I imagine?) sounds totally fine to me. This requires MediaWiki work, to make that data consumable from its new location and to write a migration job.

The the MediaWiki perspective, the easiest place to put this data would be ExternalStore, along with page content. With BlobStore, the necessary abstraction exists. I'm not sure this would be optimal, but I believe it's the cheapest in terms of engineering effort.

The the MediaWiki perspective, the easiest place to put this data would be ExternalStore, along with page content. With BlobStore, the necessary abstraction exists. I'm not sure this would be optimal, but I believe it's the cheapest in terms of engineering effort.

The easier implementation would be to just load the file from the file backend and read its metadata directly. If it gets too heavy we can add a WAN cache on top but I'm not sure that would be needed. Am I missing something obvious?

The easier implementation would be to just load the file from the file backend and read its metadata directly. If it gets too heavy we can add a WAN cache on top but I'm not sure that would be needed. Am I missing something obvious?

Not sure that's actually easier to implement, but it would work as well, and wouldn't be too hard to do.

I don't think that pulling the original every time is desirable, it would cause a lot of unnecessary internal network traffic. Some of those documents are in the hundreds of MB. It could be a DDOS vector, even, if merely hitting a URL would trigger this mechanism.

As for reading the metadata "remotely" via range requests or only reading the beginning of the file, I'm not sure that's possible. In my experience a lot of media-related command line tools can't do it. I also don't know how that metadata can be distributed inside PDFs, a format famous for having a lot of ways things can be done.

Change 660422 had a related patch set uploaded (by Ladsgroup; owner: Ladsgroup):
[mediawiki/core@master] Introduce img_id column

https://gerrit.wikimedia.org/r/660422

Another idea that might be simpler to implement: Why not compressing the text tag? we do that for ES.

ElasticSearch uses the text extracted from document-type files, and search index updates happen every time the pages is edited or a template on the page is edited, so the data needs to be stored at some permanent and quickly accessible location. ExternalStore would be one option, or just split into a new image_metadata table (ES is somewhat wasteful since there is no need to store metadata of old image revisions; OTOH ES already handles compression, which is nice for long texts).

Actually ElasticSearch just dropped most of the text by putting 50KB limit on the text metadata (and dropping the rest) (T271493: Implement 50kb limit on file text indexing for to reduce increasing commonswiki_file on-disk size). I don't think compression should be the end goal but it would for sure make doing the rest of the work much easier (adding PK, reworking and renaming the table, MCR, etc.). It also depends on how much compression would help but I assume it would cut it at least to half.

Would it be reasonable to pull out the issue of moving PDF/DJVU text metadata out of the image table, and make a separate task for that for the relevant discussion? Should the refactoring of the image and oldimage tables block on that?

Would it be reasonable to pull out the issue of moving PDF/DJVU text metadata out of the image table, and make a separate task for that for the relevant discussion?

It definitely should. I'll make one right away.

Should the refactoring of the image and oldimage tables block on that?

I personally think yes and no. Yes because this table is big and doing any sort of alter table on it when it's this bad is going to be painful. Specially on commons where there's space strain in the db nodes. no because it doesn't need to fully address the issue. My ideal solution would be to first start compressing the values (or any sort of quick/easy fix) then do the alter tables like adding PK, renaming the table, etc. and once that's done find a long-term solution for pdf/djvu and at last the MCR work.

Replying here to @Marostegui's comment on the new ticket (T275268#6847723):

I understand why size-on-disk is a problem, and I can think of several ways to improve that.

What I am not clear on is in how far number-of-rows would still be a problem if bytes-per-row was low (perhaps under 100 bytes).

In other words, is it sufficient to focus on normalization, or do we need to look into sharding as well? If we do one, which is preferable? How can we decide if we need both?

I don't think that pulling the original every time is desirable, it would cause a lot of unnecessary internal network traffic. Some of those documents are in the hundreds of MB. … I also don't know how that metadata can be distributed inside PDFs, a format famous for having a lot of ways things can be done.

Generating thumbnails from PDFs can in some cases take ~20 seconds for a single page (it's executing ghostscript and "printing" the thumbnail). Even if the text layer in the PDF is stored in the page, instead of at the end of the file or whatever, there is a definite risk that extracting it will have "non-interactive" latency.

For DjVu this would probably be reasonable from that perspective: the text is stored with the page, and DjVuLibre tools can access it in both deterministic and relatively performant amounts of time (not sure about CPU/RAM efficiency, but I don't think the volume is large enough to make that concern relevant).