Page MenuHomePhabricator

Allow upload-by-URL from upload.wikimedia.org
Closed, ResolvedPublic

Description

This might seem ridiculous at first glance, but it would be incredibly useful for writing Commons transfer scripts (similar in concept to CommonsHelper, but calling the API from JavaScript).

It may be as simple as adding upload.wikimedia.org to $wgCopyUploadsDomains in InitialiseSettings.php. However, I don't know if the server configuration will allow this to work straight away.

See also T22512.

Details

Reference
bz42473

Related Objects

View Standalone Graph
This task is connected to more than 200 other tasks. Only direct parents and subtasks are shown here. Use View Standalone Graph to show more of the graph.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

One issue with this is that the proxy server currently handling upload-by-URL requests can't do HTTPS. So we would either need to fix that bug, or give some warning that HTTPS requests will error out.

Is there already a bug "add HTTPS capability to the proxy server"?

If so, please add a dependency.

Could we now enable this feature or is there another blocker?

(In reply to comment #9)

Could we now enable this feature or is there another blocker?

I guess it should be enabled on testwiki and confirmed to work first...

Could someone please go ahead and enable this on testwiki?

(In reply to comment #11)

Could someone please go ahead and enable this on testwiki?

https://gerrit.wikimedia.org/r/47299

Thanks, however it doesn't seem to work for me. I ran a test from test2wiki (this was easier because my JS code is set up for CORS):

HTTP POST to http://test.wikipedia.org/w/api.php

action=upload
filename=0.28589522187660577.png
text=this is a test file
comment=upload comment
token=<VALID EDIT TOKEN>
url=http%3A%2F%2Fupload.wikimedia.org%2Fwikipedia%2Ftest2%2F5%2F53%2F0.28589522187660577.png
ignorewarnings=true
format=json
origin=http%3A%2F%2Ftest2.wikipedia.org

This is the response:

{"servedby":"srv193","error":{"code":"http-bad-status","info":"Error fetching file from remote source","0":"403","1":"Forbidden"}}

(In reply to comment #13)

{"servedby":"srv193","error":{"code":"http-bad-status","info":"Error fetching
file from remote source","0":"403","1":"Forbidden"}}

acl to-wikimedia dst 208.80.152.0/22
acl to-wikimedia dst 91.198.174.0/24
acl to-wikimedia dst 10.0.0.0/16
acl to-wikimedia dst 10.64.0.0/16

Do not allow any fetches from our own IP ranges

http_access deny to-wikimedia

I'm not sure if the answer is to make squid serve those requests, or add a list of sites that shouldn't use $wgCopyUploadProxy

Suspect that's a question for ops whether they're ok with letting the proxy read from the cluster..

No, an upload-by-url proxy is the wrong day to do it. If we want to copy files within the upload.wm.org realm, then we should use efficient server-side copies (e.g. Swift's X-Copy-From header), not go through the application servers and upload-by-URL proxies.

Moreover, copying files internally seems wrong to me in general. It's probably okay if it's a limited use case, but if it's something that's going to get popular, then some other way of multiple reference to the same file should be found, rather than having the same contents copied over and over in the media storage backends.

Maybe so. However, Commons transfer has always been done by a download-upload process (this is what CommonsHelper on toolserver does, for example). Fixing this bug would allow this tried-and-true approach to continue at a faster rate. Or, we could wait an indefinite amount of time for the file storage backend to be complexified, convoluted, etc...

(In reply to comment #14)

Suspect that's a question for ops whether they're ok with letting the proxy
read from the cluster..

Were ops ever contacted about this?

(In reply to comment #17)

Were ops ever contacted about this?

See answer in comment 15 by Faidon.

(In reply to comment #18)

See answer in comment 15 by Faidon.

My bad, I didn't realise Faidon was part of the ops team.

It seems we've reached a stalemate: ops is refusing to fulfil the request, but no alternative is being suggested.

(In reply to comment #15)

It's probably
okay if it's a limited use case, but if it's something that's going to get
popular

Just so you are aware, Faidon... I daresay hundreds of thousands of files have already been copied from WMF wikis to Commons, leading already to massive duplication on the servers. So this process is already rather popular, and this bug is a way to streamline the process.

To be clear, I would welcome an alternative internal approach, or a rationalisation of the file storage backend, but I don't see those things happening anytime soon. Going ahead and reconfiguring the proxy can be done now (as far as I can tell) and would make the process as it already exists a lot simpler.

[CC'ing Fabrice as this covers Uploading/Multimedia]

  • Bug 62820 has been marked as a duplicate of this bug. ***

RfC is running at Commons: https://commons.wikimedia.org/wiki/Commons:Requests_for_comment/Allow_transferring_files_from_other_Wikimedia_Wikis_server_side

I didn't conceal that it's possibly not implemented *but* I hope that strong consensus and some of the comments by the community possibly motivate responsible persons to re-consider their position. The way transferring files is currently done adds likely more load the the WMF servers as if the proxies would allow to fetch from WMF directly.

Status update: On [[Commons:Commons:Requests for comment/Allow transferring files from other Wikimedia Wikis server side]], we have an unanimous consensus.

(In reply to Faidon Liambotis from comment #15)

No, an upload-by-url proxy is the wrong day to do it. If we want to copy
files within the upload.wm.org realm, then we should use efficient
server-side copies (e.g. Swift's X-Copy-From header), not go through the
application servers and upload-by-URL proxies.

Moreover, copying files internally seems wrong to me in general. It's
probably okay if it's a limited use case, but if it's something that's going
to get popular, then some other way of multiple reference to the same file
should be found, rather than having the same contents copied over and over
in the media storage backends.

Actually, we already do that with manual bots and tools to transfer media from local Wikimedia to Commons when they have been cleared as freely licensed or in public domain.

So I offer to enable it as it won't create more copy than we currently have, and then open a new bug to work on a better solution.

tomasz set Security to None.

So I offer to enable it as it won't create more copy than we currently have, and then open a new bug to work on a better solution.

That would be great, indeed. Can you enable that now?

So I offer to enable it as it won't create more copy than we currently have, and then open a new bug to work on a better solution.

@Dereckson Just asking about the status :-). I read the discussion again and it looks like it is possible now to enable this. Or not? It needs some special config? There is also T78167. Thanks in advice.

It would need the acls in the squid config for url-downloader.wikimedia.org to be changed. Someone (@csteipp ?) Would probably need to asses the security risk of such a change.

Stale for nearly a year. Any news about this?

It sounds like someone needs to create a new ticket out of T44473#1198327, assign it to Ops and Security, and add it as a blocker to this bug.

No, an upload-by-url proxy is the wrong day to do it. If we want to copy files within the upload.wm.org realm, then we should use efficient server-side copies (e.g. Swift's X-Copy-From header), not go through the application servers and upload-by-URL proxies.

Moreover, copying files internally seems wrong to me in general. It's probably okay if it's a limited use case, but if it's something that's going to get popular, then some other way of multiple reference to the same file should be found, rather than having the same contents copied over and over in the media storage backends.

So assuming that @faidon 's comment still stands. What is the way forward here?

How about having a config variable that gives a regex which converts urls to mwstore:// virtual urls. Thus if MW see's a url matching that regex, instead of doing an http request to copy the file, it will do an internal SWIFT copy.

Part of the problem is that the Upload class is very stiff and difficult to modify. However I think this is do-able.

So assuming that @faidon 's comment still stands. What is the way forward here?

How about having a config variable that gives a regex which converts urls to mwstore:// virtual urls. Thus if MW see's a url matching that regex, instead of doing an http request to copy the file, it will do an internal SWIFT copy.

@faidon: Any opinion on that approach?

see T140462 and T190716, this problem has been solved in another way?

Looking at the description of this ticket FileImporter can not be called from scripts right now, it is just a special page.
I don't think there is a ticket for this.

Any progress on this? Upload by url has literally been around for years and being unable to import from *.wikimedia.org is a pretty massive oversight.

edit:
I just tried this on testwiki and it works fine. What exactly is the hold up?

在T44473#5804309中,@Urbanecm写道:

We already have Move-Files-To-Commons, can this task be closed?

在T44473#5804309中,@Urbanecm写道:

We already have Move-Files-To-Commons, can this task be closed?

No. FileImporter/FileExporter unfortunately does not cover every situation where files need to be copied cross-wiki. For example, if the current version of a file was copied by another method to Commons, it is often useful to also copy the previous versions and then restore the newest version. This was previously done using an OgreBot tool, but it's now unmaintained and shut down.

So, I went ahead and quickly tested this:

[urbanecm@deploy1002 ~/tmp]$ export http_proxy='http://url-downloader.codfw.wikimedia.org:8080'
[urbanecm@deploy1002 ~/tmp]$ export https_proxy='http://url-downloader.codfw.wikimedia.org:8080'
[urbanecm@deploy1002 ~/tmp]$ wget 'https://upload.wikimedia.org/wikipedia/commons/thumb/d/d3/VaclavLudmilaTkadlik.jpg/525px-VaclavLudmilaTkadlik.jpg'
--2021-09-13 15:33:04--  https://upload.wikimedia.org/wikipedia/commons/thumb/d/d3/VaclavLudmilaTkadlik.jpg/525px-VaclavLudmilaTkadlik.jpg
Resolving url-downloader.codfw.wikimedia.org (url-downloader.codfw.wikimedia.org)... 2620:0:860:2:208:80:153:61, 208.80.153.61
Connecting to url-downloader.codfw.wikimedia.org (url-downloader.codfw.wikimedia.org)|2620:0:860:2:208:80:153:61|:8080... connected.
Proxy request sent, awaiting response... 200 OK
Length: 53821 (53K) [image/jpeg]
Saving to: ‘525px-VaclavLudmilaTkadlik.jpg’

525px-VaclavLudmilaTkadlik.jp 100%[=================================================>]  52.56K  --.-KB/s    in 0.06s

2021-09-13 15:33:04 (826 KB/s) - ‘525px-VaclavLudmilaTkadlik.jpg’ saved [53821/53821]

[urbanecm@deploy1002 ~/tmp]$

It appears it is possible to access Wikimedia-operated URLs via the url-downloader proxy service. That means it should be possible to enable the functionality.

Questions I have now:

  • Where should this be available? By default, the allowlist is only updated for WM Commons.
  • What would be the usecases for this?
  • Do we need an explicit on-wiki discussion about this, or is including this ie. in User-notice (if global) enough?

CopyUploads is only enabled on Commons and the testwikis anyway.

'wgAllowCopyUploads' => [
	'default' => false,
	'testwiki' => true,
	'test2wiki' => true,
	'commonswiki' => true,
	'testcommonswiki' => true,
],

Expanding wgAllowCopyUploads to more wikis would need more discussion and consensus and could be useful, but it's out of scope for this task.

The usecases are copying files to Commons in the situations FileImporter/FileExporter doesn't support:

  • Using an automated tool
  • To copy previous versions of existing files
    • because the original transfer was incomplete
    • or because the original (tagged with KeepLocal) was changed after being imported

wgCopyUploadsDomains currently includes upload.wm.o on testwiki. I would suggest moving it to 'default' to maintain that. Having it on the other testwikis won't be a problem.

Hi, wondering if there is any progress of this task. We've got another community requesting this feature, and they said the test performed on testwiki looks great.

Several sites have already enabled upload_by_url (T294824, T303577) and seems no issue reported, so I believe this task should not be a blocker for sites willing to enable "upload_by_url" (restricted uploading from "upload.wikimedia.org" only) - but, except commons for the following reason.

Personally, I thought the request that "Add upload.wikimedia.org to the wgCopyUploadsDomains whitelist of Wikimedia Commons" should not be considered as a duplication of this task - "upload.wikimedia.org" also stores non-free media files for several other projects per fair-use, so before an ACL is implemented (to prevent uploading fair-use media files to commonswiki), "upload.wikimedia.org" should not be added into wgCopyUploadsDomains. If there is no objection, I would reopen task T64820, T245053, T271633, T290828 (identical name meh) and add them as parant task of this task.

Waiting for several days to collect advice.

Stang closed this task as Resolved.EditedMay 10 2022, 8:21 PM

Boldly closed this task as resolved - issues mentioned in this task have been fixed, several sites already enabled such feature.