Page MenuHomePhabricator

Run refreshImageMetadata.php --force
Open, Needs TriagePublic

Description

1.18 added a lot of new image metadata properties (Exif and friends) that are now extracted that weren't previously. Please run refreshImageMetadata.php after the 1.18 deployment, which will go through every image and re-extract all the metadata.

Also, the script outputs the name of any image that had a larger img_metadata field before the refresh relative to after it. This could indicate a bug in new img_metadata stuff since almost all of the old properties are still extracted, so the img_metadata field should either stay the same or increase in size. (There is one exception to this, but it should be very rare). It'd be great if stdout of that script could be redirected to a file and posted somewhere, so I could investigate any cases of the metadata decreasing in size.

Cheers.

The command to be executed:

foreachwiki refreshImageMetadata.php --force --mediatype=BITMAP

See also:

Details

Reference
bz30961
TitleReferenceAuthorSource BranchDest Branch
Add banner to the top of the pagesebastian-berlin-wmse/wikipedia-homepage!18sebastian-berlin-wmsebannermain
Customize query in GitLab

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

(In reply to comment #5)

Still running ? Finished ? Halted ?

I believe its halted pending me fixing bug 31740 (which i plan to soon)

Yes, it was stopped for that bug, and I've now backported and deployed the bugfix and restarted the script.

It looks like it just crashed due to OOM, while doing commonswiki.

Core dumping indicates that the bulk of the leaked memory is in image metadata structures, such as XMP. It's possible that it is leaking LocalFile objects.

Thehelpfulonewiki wrote:

Has this been run successfully?

(In reply to comment #10)

Has this been run successfully?

I imagine not, as no one has addressed the memory leaks

Was scheduled for 1.18 deployment. Adding to 1.20wmf3 deployment now.

(In reply to comment #12)

Was scheduled for 1.18 deployment. Adding to 1.20wmf3 deployment now.

There's no point. The script needs fixing first

(In reply to comment #13)

(In reply to comment #12)

Was scheduled for 1.18 deployment. Adding to 1.20wmf3 deployment now.

There's no point. The script needs fixing first

Aha, I didn't read the previous comment right. Makes sense now.

(In reply to comment #6)

(In reply to comment #5)

Still running ? Finished ? Halted ?

I believe its halted pending me fixing bug 31740 (which i plan to soon)

That bug has been resolved as fixed in the mean time.

(In reply to comment #9)

Core dumping indicates that the bulk of the leaked memory is in image metadata
structures, such as XMP. It's possible that it is leaking LocalFile objects.

(In reply to comment #13)

(In reply to comment #12)

Was scheduled for 1.18 deployment. Adding to 1.20wmf3 deployment now.

There's no point. The script needs fixing first

What has to be fixed still? Can a new bug be opened please as blocker to this one. Other wise, how about another run?

What has to be fixed still? Can a new bug be opened please as blocker to this
one. Other wise, how about another run?

The script appearently has memory leaks. Personally I'm not sure how or why that happens in a garbage collected language like php (furthermore one needs a test wiki with more than 10 files on it to really reproduce that issue), so fixing is a bit beyond me.

(In reply to comment #15)

The script needs fixing first

What has to be fixed still? Can a new bug be opened please as blocker to this
one. Other wise, how about another run?

Could somebody please specify in a new bug report what's need to fix (any pointers etc) in the script, and make it block this report?

High priority for 20 months & no changes for 9 months => resetting to normal.

High priority was probably an exageration, and its especially not now. As people notice a file without metadata, they can purge it if it bothers them.

As for fixing the bug. Im not really sure how or why there is a mem leak in the script or how to debug it.

tomasz set Security to None.
Dereckson lowered the priority of this task from Medium to Lowest.Apr 14 2016, 11:48 AM
Dereckson subscribed.

If someone is planning to work on it, please raise the priority to low (or normal if you're going to do it soon), see Z398.

matmarex raised the priority of this task from Lowest to Medium.Jul 16 2016, 9:33 PM
matmarex added a project: Multimedia.
matmarex subscribed.

The script appearently has memory leaks. Personally I'm not sure how or why that happens in a garbage collected language like php (furthermore one needs a test wiki with more than 10 files on it to really reproduce that issue), so fixing is a bit beyond me.

The last attempt to run the script was on Oct 25 2011 (T32961#346087). I can't find out whether we were using PHP 5.2 or 5.3 back then, but it's worth noting that 5.2 was unable to garbage collect cycles (http://php.net/manual/pl/features.gc.collecting-cycles.php). Perhaps it was just that, and we could successfully run the script today. I think we should try that first and if it still OOM's, then look into it. And if it's still a problem, we're all four years older than when this was last touched, I bet @Bawolff could debug it with his eyes closed today ;)

@matmarex Please provide a test run procedure to determine how memory behaves.

matmarex renamed this task from Run refreshImageMetadata.php to Run refreshImageMetadata.php --force.Jul 16 2016, 9:45 PM
matmarex removed a subscriber: wikibugs-l-list.

I'd say start the script, let it run for a day or two while watching its memory usage with top or something, and verify that it doesn't grow indefinitely? Anyway, there are a few things which we should do first, otherwise we'll have to run it twice.

(Assigning to self so that I don't forget about this, I'm not a deployer and I'll need someone else to actually do the deed.)

I think there are no blockers here anymore. We should try doing this. But first:

  • @jcrespo The maintenance script will UPDATE all rows in the image table (I'd like it to be ran for all databases, the table is biggest on commonswiki with ~34 million rows). I just want to make sure that this is reasonable.
  • @aaron You previously removed metadata refreshing from action=purge, it was apparently implicated in T132921 (9120ee007ae32). Since this will refresh metadata for all files, it should probably be throttled heavily… Could you review the refreshImageMetadata.php script to see whether it is sane?

If you're both okay with doing this, we should probably start with testwiki for another sanity check (~5000 files). Then do all the smaller wikis and Commons (I did a select count(*) from image for each wiki in all.dblist to see the sizes: P4193).

As discussed earlier in this task, the last time we tried to run the script, it OOM'd. That was five years ago and on PHP 5.2 or 5.3, though, and I suspect that was just PHP's garbage collection bugs. But whoever runs the script (I nominate @aaron) should watch its memory usage.

Here are the tips:

  • Make sure you do not create lag- I have not checked it, but make sure your transactions are short, and you check for lag every second
  • Take your time and check concurrency- do not run one thread per wiki with 800 wikis on s3
  • If you intend to update a large amount of rows, add it to the week Deployments page ("week of") with an estimation on how much time it is going to take (see Long running tasks/scripts): https://wikitech.wikimedia.org/wiki/Deployments#Week_of_October_10th Also log when you start- specially for commons. This will make sure we do not run schema changes at the same time

It seems sane to me; and +1 to Jamie's comment.

@matmarex: You might want to go out of your way to skip audio/video file for now btw. Parsing that stuff is super expensive I believe, because as far as I've been told, it will fetch the entire file or something from the storage cluster to parse the metadata. Since video files tend to be large....

  • Make sure you do not create lag- I have not checked it, but make sure your transactions are short, and you check for lag every second

The script calls wfWaitForSlaves(); after every batch of 200 images, is that okay? Or should we do smaller batches?

  • Take your time and check concurrency- do not run one thread per wiki with 800 wikis on s3

Right, we usually run these sequentially (with foreachwiki).

  • If you intend to update a large amount of rows, add it to the week Deployments page ("week of") with an estimation on how much time it is going to take (see Long running tasks/scripts): https://wikitech.wikimedia.org/wiki/Deployments#Week_of_October_10th Also log when you start- specially for commons. This will make sure we do not run schema changes at the same time

At the moment I have no idea how long it's going to take. We should do one of the smaller wikis first, and estimate based on that.

@matmarex: You might want to go out of your way to skip audio/video file for now btw. Parsing that stuff is super expensive I believe, because as far as I've been told, it will fetch the entire file or something from the storage cluster to parse the metadata. Since video files tend to be large....

So, refreshImageMetadata.php --force --mediatype=BITMAP, I guess. (We do also have images of ridiculous sizes… but probably fewer.)


@aaron Would you be able to run the script for testwiki this week? (Less than 5000 files. We'll see how fast it goes, hopefully it'll take minutes at most…)

I guess that's a "no". I might not be around to watch this next week (I'll be at an off-site meeting), but I'll start poking people again the week after.

Terbium is currently busy with a Wikidata sanitize script.

For testwiki

Finished refreshing file metadata for 4644 files. 1 needed to be refreshed, 4643 did not need to be but were refreshed anyways, and 714 refreshes were suspicious.

real    12m46.582s
user    1m50.603s
sys     0m45.917s

Here are some statistics on the changes. I ran select img_name, img_metadata as img_metadata from image; before and after Reedy ran the maintenance, and compared the results with a crafty ugly little script.

Some of these are unexpected. We're apparently failing to extract GPS data from a number of files, that we could extract some years ago when they were uploaded? We should probably investigate this before running the script everywhere :/ I have a copy of the data if anyone is interested.

KeyAddedRemovedIdenticalChanged
MEDIAWIKI_EXIF_VERSION0039210
DateTime0022310
Model0025930
Orientation0024900
ImageLength0010670
ImageWidth0010680
Make0025940
XResolution0020550
YResolution0020550
ResolutionUnit0020540
Copyright00590
WebStatement00220
Copyrighted00480
UsageTerms00210
iimVersion001650
YCbCrPositioning0019490
ISOSpeedRatings0016080
ExifVersion0023140
DateTimeOriginal0021650
DateTimeDigitized0021640
ComponentsConfiguration0018920
FocalLength0019550
FlashPixVersion059113500
ColorSpace059218300
CompressedBitsPerPixel006190
ExposureTime0017560
FNumber0017500
ExposureProgram0015960
ShutterSpeedValue0014780
ApertureValue0014510
BrightnessValue0012410
ExposureBiasValue0012420
MaxApertureValue0012610
SubjectDistance003580
MeteringMode0017190
LightSource006350
Flash0017210
FlashEnergy003120
SensingMethod01087590
SceneType01074360
CustomRendered01265850
ExposureMode058210840
WhiteBalance058410980
DigitalZoomRatio0905710
SceneCaptureType058311800
Contrast01173860
Saturation01173860
Sharpness01194110
SubjectDistanceRange0274110
ImageUniqueID04142910
GPSTimeStamp02147460
GPSDateStamp02126710
version002060
streams00630
length00630
offset00580
Software0013390
SubSecTime030410
SubSecTimeOriginal041590
SubSecTimeDigitized040590
FocalPlaneXResolution031790
FocalPlaneYResolution031800
FocalPlaneResolutionUnit031800
GPSVersionID02441743
frameCount206990
loopCount206550
duration206990
bitDepth206550
colorType206550
metadata307360
JPEGFileComment005390
looped00440
ImageDescription00910
Writer00340
SpecialInstructions00350
Artist00750
Source00340
ObjectName00870
CityDest00280
OriginalTransmissionRef00340
iimCategory00360
iimSupplementalCategory00350
Keywords00790
FileSource01211660
CountryDest00400
Credit00340
FlashpixVersion00420
PixelXDimension001000
PixelYDimension001000
DateTimeMetadata00790
FocalLengthIn35mmFilm01034400
GainControl0771270
WhitePoint00130
PrimaryChromaticities00130
YCbCrCoefficients00130
UserComment0421750
ProvinceOrStateDest00220
SerialNumber00750
Lens00760
PhotometricInterpretation00290
SamplesPerPixel00270
BitsPerSample00290
Rating00110
GPSLatitude01893123
GPSLongitude01893123
GPSAltitude018426517
width001330
height001330
Contact0030
GETID3_VERSION0090
filesize0090
avdataoffset0090
avdataend0090
fileformat0090
audio0050
video0080
encoding0090
mime_type0090
playtime_seconds0090
bitrate0090
playtime_string0090
GPSImgDirectionRef03760
GPSImgDirection04760
CameraOwnerName0040
OriginalDocumentID00200
xml0020
description0080
SublocationDest0010
Compression0090
PlanarConfiguration0080
Label0030
RelatedSoundFile0500
ExposureIndex0600
animated0010
originalWidth00610
originalHeight00610
Producer00190
CreationDate00190
ModDate00170
Tagged00190
Pages00190
Encrypted00190
pages00190
File size00190
Optimized00190
PDF version00190
mergedMetadata00120
text00150
warning0030
GPSAltitudeRef0020
Author0080
Creator00160
page_data00150
page_count00150
first_page00150
last_page00150
exif00132
TIFF_METADATA_VERSION00150
Title00120
Subject0030
translations00450
ReferenceBlackWhite0040
title0070
GPSStatus0130
GPSMapDatum03350
Headline0060
CountryCodeDest0060
GPSDifferential0100
warnings0900
GPSSpeedRef0020
GPSSpeed0020
GPSMeasureMode0200
GPSDOP0200
error0010

At least the "Changed" ones are harmless.

Some values were previously recorded as rational numbers, as in the EXIF data; now they are stored as floats:

Changed: GPSAltitude old=0/1 new=0
Changed: GPSAltitude old=0/1 new=0
Changed: GPSAltitude old=0/1 new=0
Changed: GPSAltitude old=0/1000 new=0
Changed: GPSAltitude old=112/1 new=112
Changed: GPSAltitude old=15907/48 new=331.3958333333333
Changed: GPSAltitude old=180/1 new=180
Changed: GPSAltitude old=26178/83 new=315.3975903614458
Changed: GPSAltitude old=29553/94 new=314.3936170212766
Changed: GPSAltitude old=38365/124 new=309.39516129032256
Changed: GPSAltitude old=38365/124 new=309.39516129032256
Changed: GPSAltitude old=46799/1000 new=46.799
Changed: GPSAltitude old=46799/1000 new=46.799
Changed: GPSAltitude old=46799/1000 new=46.799
Changed: GPSAltitude old=48797/272 new=179.40073529411765
Changed: GPSAltitude old=63/1 new=63
Changed: GPSAltitude old=93785/283 new=331.3957597173145

Some have inconsequential differences due to some float precision stuff:

Changed: GPSLatitude old=32.24146391666667 new=32.241463920166666
Changed: GPSLatitude old=32.24146391666667 new=32.241463920166666
Changed: GPSLatitude old=32.24146391666667 new=32.241463920166666
Changed: GPSLongitude old=-110.94342988333334 new=-110.94342988566666
Changed: GPSLongitude old=-110.94342988333334 new=-110.94342988566666
Changed: GPSLongitude old=-110.94342988333334 new=-110.94342988566666

And some were sufferring from endianness issues:

Changed: GPSVersionID old=0.0.0.2 new=2.0.0.0
Changed: GPSVersionID old=0.0.0.2 new=2.0.0.0
Changed: GPSVersionID old=0.0.0.2 new=2.0.0.0

I've looked at a couple files in detail, and it looks like the "Removed" values are caused by a PHP bug: https://bugs.php.net/bug.php?id=72682 (the missing data is all after a MakerNote tag). It's already fixed in the latest version (or mostly fixed, someone is still complaining on the bug, but I've gotten good results locally with the files I tried), but we clearly have something older. (And the bug was apparently introduced by a security patch, https://bugs.php.net/bug.php?id=72603, which was probably backported everywhere…)

If possible, we should retry the maintenance script using HHVM rather than PHP, which shouldn't be suffering from this problem. Or we'll need to wait a while again until we upgrade PHP.

Reedy re-ran the script with HHVM. That looks much better, looks like nothing is getting lost, and even some new stuff is getting extracted. But there are also more changes?

KeyAddedRemovedIdenticalChanged
MEDIAWIKI_EXIF_VERSION0039210
DateTime30218051
Model54025930
Orientation53024900
ImageLength53010670
ImageWidth53010680
Make53025931
XResolution0020541
YResolution0020541
ResolutionUnit0020540
Copyright00590
WebStatement00220
Copyrighted00480
UsageTerms00210
iimVersion001650
YCbCrPositioning0019490
ISOSpeedRatings0016080
ExifVersion0023140
DateTimeOriginal0021641
DateTimeDigitized1021640
ComponentsConfiguration0018920
FocalLength1019550
FlashPixVersion0019410
ColorSpace0024220
CompressedBitsPerPixel006181
ExposureTime0017551
FNumber0017491
ExposureProgram0015960
ShutterSpeedValue0014771
ApertureValue0014501
BrightnessValue0012410
ExposureBiasValue0012411
MaxApertureValue0012601
SubjectDistance003571
MeteringMode0017190
LightSource006350
Flash0017210
FlashEnergy003120
SensingMethod008670
SceneType005430
CustomRendered007110
ExposureMode0016660
WhiteBalance0016820
DigitalZoomRatio006610
SceneCaptureType0017630
Contrast005030
Saturation005030
Sharpness005300
SubjectDistanceRange004380
ImageUniqueID007050
GPSTimeStamp1609600
GPSDateStamp1608830
version002060
streams00630
length00630
offset00580
Software0013390
SubSecTime00710
SubSecTimeOriginal001000
SubSecTimeDigitized00990
FocalPlaneXResolution101100
FocalPlaneYResolution001101
FocalPlaneResolutionUnit001110
GPSVersionID004210
frameCount206990
loopCount206550
duration206990
bitDepth206550
colorType206550
metadata307360
JPEGFileComment005390
looped00440
FileSource002870
ImageDescription00910
Writer00340
SpecialInstructions00350
Artist00750
Source00340
ObjectName00870
CityDest00280
OriginalTransmissionRef00340
iimCategory00360
iimSupplementalCategory00350
Keywords00790
CountryDest00400
Credit00340
FlashpixVersion00420
PixelXDimension001000
PixelYDimension001000
DateTimeMetadata00790
FocalLengthIn35mmFilm005430
GainControl002040
WhitePoint00130
PrimaryChromaticities00130
YCbCrCoefficients00130
UserComment004960
ProvinceOrStateDest00220
SerialNumber00750
Lens00760
PhotometricInterpretation00290
SamplesPerPixel00270
BitsPerSample00290
Rating00110
GPSLatitude1605040
GPSLongitude1605040
GPSAltitude16044917
width001330
height001330
Contact0030
GETID3_VERSION0090
filesize0090
avdataoffset0090
avdataend0090
fileformat0090
audio0050
video0080
encoding0090
mime_type0090
playtime_seconds0090
bitrate0090
playtime_string0090
GPSImgDirectionRef00790
GPSImgDirection00800
CameraOwnerName0040
OriginalDocumentID00200
xml0020
description0080
SublocationDest0010
Compression0090
PlanarConfiguration0080
Label0030
RelatedSoundFile0050
ExposureIndex0060
animated0010
originalWidth00610
originalHeight00610
Producer00190
CreationDate00190
ModDate00170
Tagged00190
Pages00190
Encrypted00190
pages00190
File size00190
Optimized00190
PDF version00190
mergedMetadata00120
text00150
warning0030
GPSAltitudeRef0020
Author0080
Creator00160
page_data00150
page_count00150
first_page00150
last_page00150
exif00132
TIFF_METADATA_VERSION00150
Title00120
Subject0030
translations00450
ReferenceBlackWhite0040
title0070
GPSStatus0040
GPSMapDatum00380
Headline0060
CountryCodeDest0060
GPSDifferential0010
warnings0900
GPSSpeedRef0020
GPSSpeed0020
GPSMeasureMode0020
GPSDOP0020
error0010

Looking at the new changes:

Some DateTimes have changed by a few seconds. Looking at the problematic files, they seem to have duplicate DateTime tags, with actually different values; and apparently old code took the first one, while new code takes that last one. That should be harmless.

Changed: DateTime old=2012:06:28 20:26:27 new=2012:06:28 20:26:23
Changed: DateTime old=2012:07:06 15:06:07 new=2012:07:06 15:06:03
Changed: DateTime old=2012:07:10 09:12:08 new=2012:07:10 09:11:54
Changed: DateTime old=2012:07:10 09:12:08 new=2012:07:10 09:11:54
Changed: DateTime old=2012:07:10 10:44:25 new=2012:07:10 10:44:22
Changed: DateTime old=2012:07:10 15:26:42 new=2012:07:10 15:26:38
Changed: DateTime old=2012:07:10 18:13:33 new=2012:07:10 18:13:30
Changed: DateTime old=2012:07:11 11:20:57 new=2012:07:11 11:20:54
Changed: DateTime old=2012:07:11 11:39:15 new=2012:07:11 11:39:12
Changed: DateTime old=2012:07:11 11:42:34 new=2012:07:11 11:42:32
Changed: DateTime old=2012:07:11 11:42:59 new=2012:07:11 11:42:57
Changed: DateTime old=2012:07:11 11:44:16 new=2012:07:11 11:44:14
Changed: DateTime old=2012:07:11 11:49:14 new=2012:07:11 11:49:11
Changed: DateTime old=2012:07:11 11:49:14 new=2012:07:11 11:49:11
Changed: DateTime old=2012:07:11 11:52:09 new=2012:07:11 11:52:06
Changed: DateTime old=2012:07:11 11:54:58 new=2012:07:11 11:54:55
Changed: DateTime old=2012:07:11 11:54:58 new=2012:07:11 11:54:55
Changed: DateTime old=2012:07:11 11:54:58 new=2012:07:11 11:54:55
Changed: DateTime old=2012:07:11 11:54:58 new=2012:07:11 11:54:55
Changed: DateTime old=2012:07:11 12:00:33 new=2012:07:11 12:00:30
Changed: DateTime old=2012:07:11 12:06:32 new=2012:07:11 12:06:29
Changed: DateTime old=2012:07:11 12:09:41 new=2012:07:11 12:09:38
Changed: DateTime old=2012:07:11 12:29:08 new=2012:07:11 12:29:04
Changed: DateTime old=2012:07:11 13:54:13 new=2012:07:11 13:54:08
Changed: DateTime old=2012:07:11 14:12:27 new=2012:07:11 14:12:24
Changed: DateTime old=2012:07:11 14:12:27 new=2012:07:11 14:12:24
Changed: DateTime old=2012:07:11 14:12:27 new=2012:07:11 14:12:24
Changed: DateTime old=2012:07:11 14:24:58 new=2012:07:11 14:24:55
Changed: DateTime old=2012:07:11 14:24:58 new=2012:07:11 14:24:55
Changed: DateTime old=2012:07:13 07:27:23 new=2012:07:13 07:27:16
Changed: DateTime old=2012:07:13 08:08:36 new=2012:07:13 08:08:31
Changed: DateTime old=2012:07:13 10:05:46 new=2012:07:13 10:05:38
Changed: DateTime old=2012:07:13 10:22:46 new=2012:07:13 10:22:40
Changed: DateTime old=2012:07:13 17:12:22 new=2012:07:13 17:12:17
Changed: DateTime old=2012:07:13 17:35:03 new=2012:07:13 17:34:59
Changed: DateTime old=2012:07:17 15:30:50 new=2012:07:17 15:30:46
Changed: DateTime old=2012:07:20 12:09:43 new=2012:07:20 12:09:40
Changed: DateTime old=2012:07:23 17:58:22 new=2012:07:23 17:58:18
Changed: DateTime old=2012:07:23 19:08:02 new=2012:07:23 19:07:57
Changed: DateTime old=2012:08:20 14:15:10 new=2012:08:20 14:15:04
Changed: DateTime old=2012:08:25 07:40:58 new=2012:08:25 07:40:54
Changed: DateTime old=2012:08:29 13:47:29 new=2012:08:29 13:47:25
Changed: DateTime old=2012:08:29 20:48:39 new=2012:08:29 20:48:30
Changed: DateTime old=2012:08:30 13:12:31 new=2012:08:30 13:12:23
Changed: DateTime old=2012:08:30 14:34:36 new=2012:08:30 14:34:24
Changed: DateTime old=2012:09:01 11:17:45 new=2012:09:01 11:17:40
Changed: DateTime old=2012:09:22 14:55:18 new=2012:09:22 14:55:15
Changed: DateTime old=2012:09:22 17:40:24 new=2012:09:22 17:40:19
Changed: DateTime old=2012:09:22 19:30:38 new=2012:09:22 19:30:29
Changed: DateTime old=2012:09:24 12:52:25 new=2012:09:24 12:52:14

The various instances of single changes appears to be a case of T97253, which is exactly what we're trying to fix, yay! (https://test.wikipedia.org/wiki/File:BumpassHell_8328.jpg) That file was apparently also affected by T5892 once upon a time, heh.

Changed: Make old=IF new=Canon
Changed: XResolution old=1124073500/1852796513 new=180/1
Changed: YResolution old=1851867904/1344302703 new=180/1
Changed: DateTime old=owerShot S110 new=2005:10:25 12:56:28
Changed: ExposureTime old=11796480/65536 new=1/800
Changed: FNumber old=808583168/825898288 new=72/10
Changed: DateTimeOriginal old=0:25 12:56:28 new=2005:10:25 12:56:28
Changed: CompressedBitsPerPixel old=976367904/842675765 new=3/1
Changed: ShutterSpeedValue old=808583224/825898288 new=309/32
Changed: ApertureValue old=892484144/976367904 new=183/32
Changed: ExposureBiasValue old=842675765/196664 new=-3/3
Changed: MaxApertureValue old=65536/20250624 new=194698/65536
Changed: SubjectDistance old=2097152/11993088 new=2215/1000
Changed: FocalPlaneYResolution old=2/145162241 new=1200000/155

Right, so I believe we're fine from the correctness perspective. 4644 files is not a huge sample, but testwiki tends to have all the weird files causing problems uploaded to it, so I think it's unlikely we missed any terrible bugs that are going to wreck our metadata.

Processing 4644 image files on testwiki took 12m46.582s, so my estimate for processing 34,000,000 image files on Commons is around two months. I'll talk to Greg and schedule this to happen some time soon (not sure if we want to do this during vacation, although I think it's low risk).

Note that the maintenance script must be run with the hhvm executable and not php5 (mwscript currently uses the latter!).

We might want to delay this by a few more days if that lets us deploy the HHVM patch for T148606 first.

@jcrespo question for you on this: @matmarex estimates it'll take roughly 2 months to complete. Should we start now-ish or wait until after the new year? I'm on the fence, honestly.

my estimate for processing 34,000,000 image files on Commons is around two months

I (and probably Releng neither) will not allow a single script to run uninterruptedly for 2 months (that doesn't mean we will not allow running this for several months). It has to be puppetized, do (relatively) short runs by which it reloads its mediawiki version and database configuration, and survives a terbium restart, or a failover to codfw. In a few months we may be doing a datacenter failover or periods of read-only time, during which your script has to be stopped or reloaded. Also, new versions or mediawiki will be upgraded, and the script should be prepared to use those new versions.

2 month seem a bit too much- Is the script completely serialized? Have you considered ask for extra resources ( a dedicated server) and do multiple parallel runs (not on concurrent data) and only serialize database changes (I assume many rows will be unchanged?) ?

If this involves reading physically every single file we have, I would involve the operators of swift file backend, too.

/me waits for follow-up questions to be resolved :)

I (and probably Releng neither) will not allow a single script to run uninterruptedly for 2 months (that doesn't mean we will not allow running this for several months). It has to be puppetized, do (relatively) short runs by which it reloads its mediawiki version and database configuration, and survives a terbium restart, or a failover to codfw. In a few months we may be doing a datacenter failover or periods of read-only time, during which your script has to be stopped or reloaded. Also, new versions or mediawiki will be upgraded, and the script should be prepared to use those new versions.

Right, that is reasonable. The script can be interrupted (killed) at any time, and it can be resumed later using the --start=filename command line option (where filename should be the last filename in the output until the script was killed). It can be run in smaller batches using --start and --end (e.g. hhvm refreshImageMetadata.php --force --mediatype=BITMAP --start=A --end=B).

Using --start with --mediatype=BITMAP will result in queries like below. There is no ideal index for this, we only have separate indexes in img_name (PRIMARY) and img_media_type (img_media_mime), but bitmaps are the vast majority of files, so these should be okay as long as MySQL uses the PRIMARY index.

select * from commonswiki.image
where img_name>'B' and img_media_type='BITMAP'
order by img_name
limit 200;

I don't know how to puppetize it (or why we'd need to puppetize), do we have docs or examples?

2 month seem a bit too much- Is the script completely serialized? Have you considered ask for extra resources ( a dedicated server) and do multiple parallel runs (not on concurrent data) and only serialize database changes (I assume many rows will be unchanged?) ?

What do you mean by "completely serialized"? It processes files alphabetically, one by one, in a single thread. So I guess yes?

We could easily run multiple instances of refreshImageMetadata.php with the --start and --end options (see above). But I don't think we can coordinate database writes between them in any way. And I have no idea how many we can run in parallel before some part of our infrastructure gives out :) From my perspective, this doesn't need to be done quickly (but I'd like it to be done eventually), so I could wait those two months and I'd rather not ask for extra resources.

The script will currently issue an UPDATE query for every single row in the table, but the data will often be the same (in the testwiki run, 105 rows were changed, 4771 unchanged). (Should we make an effort to avoid these no-op UPDATEs, or can MySQL optimize them out?)

If this involves reading physically every single file we have, I would involve the operators of swift file backend, too.

It does (but assuming that we run with --mediatype=BITMAP, as proposed, it'll only be bitmap files, no videos etc.). @aaron said earlier on this task that it's okay, I guess we should CC @fgiunchedi too.

My 2¢: Instead of using a super-long running maintenance script, consider just having a script to (slowly) queue jobs in the job queue. We can set up a dedicated job runner for it, so the rest of the queue isn't impacted, and pretty well control parallelism, monitoring, etc. And it will (well, should) automatically handle slave lag and backoff appropriately.

Another thing that should be considered is whether this is a one-off run, or should we plan to run it more often as new metadata stuffs are supported by MW?

My 2¢: Instead of using a super-long running maintenance script, consider just having a script to (slowly) queue jobs in the job queue.

+1 to that. @matmarex puppetization can be as easy as setting up a cron on terbium. I can help with that, we just need to figure out the best way to coordinate several threads if we do that, or how to keep track of the latest successful execution/batches done. You have examples of mediawiki maintanance scripts here: https://phabricator.wikimedia.org/diffusion/OPUP/browse/production/modules/mediawiki/manifests/maintenance/cirrussearch.pp (in this case a cron)

Agreed too using the jobqueue seems like a good idea to handle splitting in batches for start/end thresholds, handling failed batches, resuming, etc.
re: swift I'm not terribly concerned with parallelism unless we're talking in the ballpark of e.g >80-100 req/s, also does will metadata detection read back the whole file or request a range ?

So next steps are to:

  • write a maintenance script or adapt current one to queue jobs to the job queue (blocked on dev)
  • provide to rOPS a crontab to run the script on Terbium (once previous step done, blocked on ops)

provide to rOPS a crontab to run the script on Terbium

Note you only need to be blocked here for deployment, all production crons are handled by puppet. It is really, really easy to send a CR for us, and we will help you. For example, this is how misc backups are handled:

https://phabricator.wikimedia.org/diffusion/OPUP/browse/production/modules/role/manifests/mariadb/backup.pp;d8f2b63993a05c59eb3e2530381039ec886b849e$28

dr0ptp4kt raised the priority of this task from Medium to Needs Triage.May 30 2017, 3:13 PM
dr0ptp4kt moved this task from Next up to Tracking on the Multimedia board.

Not than I understand the details of the discussion ...but here I am as an uploader: if I have files affected with this exif-reading issue and no purging helped, can I give you (some mystical folks with the script) a list of files and hope for the better? Like, I've read a dozen of tickets on the rotation parameters from exif failure in Mediawiki and the corresponding dozen of discussions on Commons village pump... - just where should I go to have a bunch of files fixed, if possible?

According to comments above, it cannot be done until the maintenance script that does it is rewritten. Unfortunately I don't have free time to do that and I can't really justify working on it as part of my WMF job (a lot of other things to work on that are a lot closer to my responsibilities).

It may need to be done as part of the Structured Data project. Part of that project could involve copying useful values from Exif fields into a file's data page, and there's not much point in doing that until the corrupt values are fixed.