Page MenuHomePhabricator

Shared repositories support for Special:WantedFiles
Open, MediumPublic

Description

Author: Eugene.Zelenko

Description:
Will be great to have ability to list all missing files (on both local wiki and
Commons). It could be used for fixing pages referenced to such files.

In any case (if I understand correctly) list of all images are constructed for
Special:Mostimages, so only check for file existence must be added.


Version: unspecified
Severity: normal

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 21 2014, 9:16 PM
bzimport set Reference to bz6220.
bzimport added a subscriber: Unknown Object (MLST).

robchur wrote:

A special page which loaded a list of all images, then checked for file
existence on each, would be too expensive.

A special page which checks for inline inclusion of images which don't appear to
exist won't work with shared image repositories.

It works fine with shared repositories if there's access to the image table of
the repository - which is needed anyway in order to use it, right? SQL mockup:

SELECT page_namespace, page_title, il_to as img_name
FROM imagelinks
JOIN page ON page_id = il_from
WHERE NOT EXISTS( SELECT * FROM image WHERE img_name = il_to )
AND NOT EXISTS( SELECT * FROM commonswiki.image WHERE img_name = il_to )

Using LEFT JOIN instead of NOT EXISTS would be faster for a full list, but
slower if a limit in the hundrets is used.

robchur wrote:

*** Bug 8683 has been marked as a duplicate of this bug. ***

robchur wrote:

*** Bug 9924 has been marked as a duplicate of this bug. ***

Eugene.Zelenko wrote:

*** Bug 13314 has been marked as a duplicate of this bug. ***

Eugene.Zelenko wrote:

*** This bug has been marked as a duplicate of bug 13702 ***

Not a dupe. The patch in bug 13702 also does not take shared repositories into account.

Broken implementation or not, this is still a dupe to 13702 (or it's a dupe to here, but that bug was marked FIXED :)

  • This bug has been marked as a duplicate of bug 13702 ***

Reopening the bug and making it explicit that it requests support for shared repos.

  • Bug 15688 has been marked as a duplicate of this bug. ***

r77725 at least makes images on shared repos show up as struck-out bluelinks instead of redlinks in the output. It does nothing to fix the actual problem, but at least now you can visually tell the false positives apart from the actually missing files.

  • Bug 27107 has been marked as a duplicate of this bug. ***
  • Bug 28580 has been marked as a duplicate of this bug. ***

This is not an enhancement request, the page like it is just doesn't make any sense.
Example: http://meta.wikimedia.org/wiki/Special:WantedFiles

This page as it is lends itself nicely towards amending it to a "List of files used from remote (shared) repositories" one - see bug 28807

(In reply to comment #12)

r77725 at least makes images on shared repos show up as struck-out bluelinks
instead of redlinks in the output. It does nothing to fix the actual problem,
but at least now you can visually tell the false positives apart from the
actually missing files.

Given that this has been achieved, I wonder whether the bug cannot be closed by simply adding a filter option to hide the struck-out bluelinks? I have no insight into the code, but it seems the filter could be added with very little performance loss, provided we don't expect the precise number of returns and the filter automatically switches to a high browsing interval (2000-5000), and adds an explanation like:

"2000/ACTUAL NUMBER files have been found that are not present in the local wiki. Of these, some or many are available in a shared file repository. These are not shown below. As a result, the number or missing files shown is variable."

This may be not ideal, but clearly better than the present consistent, but rather useless behavior. Who is likely to browser through 100s of pages of struck-out blue links to find the truly missing red-links? In fact on metawiki nobody seem to be doing this, so many broken links exist...

Mark: you changed priority from Highest to Low without arguing - I think it would be better interaction with the community if you could argue or comment why. In some of your changes that may be evident from previous discussion, here I think not. You may well have much more information than Jan Kucera. Please share it.

(In reply to comment #18)

Mark: you changed priority from Highest to Low without arguing - I think it
would be better interaction with the community if you could argue or comment
why. In some of your changes that may be evident from previous discussion, here
I think not. You may well have much more information than Jan Kucera. Please
share it.

It's not a matter of interaction with the community, you probably missed bug 23816.
As a member of the community who voted for this bug, I'd rather mark it lowest priority or LATER, and disable the special page entirely on WMF wikis (see bug 31491).

a) I certainly miss bug 23816 if nobody is referring to it. Thank you for doing so!

b) There are certainly multiple "communities" with different opinions here.

c) I don't see through this at all. Either the bug should be closed, and a new one opened, or ... The largest Wikipedias may have reached a number of broken file links that make this functionality less likely to be essential, but smaller Wikis can substantially improve their quality by fixing these errors. I believe many who voted for this bug see this as an important function, even if Nemo_bis does not. It is widely agreed that the present implementation is broken. The bluelink-solution is a very good step, but it is still offputting potential users (the first pages are usually all clean). I am opening a new Bug 33446 in an attempt to focus on my proposal for a possible solution that makes it more likely that editors are willing to research fix broken file links.

I am sure I have overlooked many other things :-)

(In reply to comment #20)

b) There are certainly multiple "communities" with different opinions here.

Questionable.

c) I don't see through this at all. Either the bug should be closed, and a new
one opened, or ...

...we could close this and don't open any.

The largest Wikipedias may have reached a number of broken
file links that make this functionality less likely to be essential, but
smaller Wikis can substantially improve their quality by fixing these errors.

I don't see any usefulness in this page on any of the (many) small projects I'm active in, now that there's the tracking category.

I
believe many who voted for this bug see this as an important function, even if
Nemo_bis does not.

Not really, those votes are very old and they all came before the tracking category (mine too).

It is widely agreed that the present implementation is
broken. The bluelink-solution is a very good step, but it is still offputting
potential users (the first pages are usually all clean). I am opening a new Bug
33446 in an attempt to focus on my proposal for a possible solution that makes
it more likely that editors are willing to research fix broken file links.

:/

(ignoring what is best ignored:) I disagree that Special:WantedPages is redundant.

However, the basic assumption that it is easier to work by page than by file is, in my opinion, erroneous. A missing file often occurs on dozens of pages. Look at Metawiki (there for multilinguality mostly). In other cases it is because repo files are renamed without keeping redirects. Or, out of old habit, deleted and re-uploaded under a different name.

In cases where a file is missing on dozens of pages, I consider an improved Special:WantedPages desirable.

So I have some ideas how to fix this.

Basically, GlobalUsage stores what images that don't exist locally are in use. So I was thinking a query something like:

select '6' as namespace, gil_to as title, count(*) as value from globalimagelinks LEFT JOIN image on gil_to = img_name where img_name is null and gil_wiki = 'jawikinews' group by gil_to order by count(*) DESC;

(Using jawikinews as an example, since it's a smallish size wiki (5480 entries in global usage) thus I can easily test these queries on toolserver). 6 == NS_FILE.

This seemed to work, however with one problem. Image redirects were still included. I'm not sure if that's a globalusage issue (should the links be to the target image) or if its intentional behaviour. Filtering those out in the sql gives:

select '6' as namespace, gil_to as title, count(*) as value from globalimagelinks LEFT JOIN image on gil_to = img_name LEFT JOIN page on (gil_to = page_title and page_namespace=6) where img_name is null and gil_wiki = 'jawikinews' and (page_is_redirect is null or page_is_redirect = 0) group by gil_to order by count(*) DESC;

However, that seems to slow down the query by quite a bit (10 seconds went to 2 minutes). OTOH, the query is slow regardless, and its going to be cached (I'm not sure how slow is too slow). This still would mess up on some edge cases though, such as if the page is a redirect to a non-existant file (or even to something not in NS_FILE). [And of course it doesn't address the more general problem of files from Foreign repos in general. I'm not sure if the general problem is addressable without a schema change]

So possible way forward - Add to GlobalUsage extension a new special page that overrides the built in special:wantedfiles with the new query. Even with the first query i mentioned, it would cut down on false positives significantly.

So it determines that a remove file exists by checking if it is used anywhere according to global usage. That's a smart idea. Although maybe not semantically correct, it should be good in practice.

If there is a link to an image on a local wiki and the image doesn't exist on the local wiki, it's going into global usage.

One problem though, right now the system works in such a way that if a file exists neither locally nor in the repository, globalusage catches it, not the local wiki (meaning, it's added to GlobalUsage as a redlink, not to the local wiki as a redlink). This is means four things.

Three good things, which would hold us back from changing this behaviour

  • This is used to fix things if a file in the repo was deleted and is restored, the usage in globalusage is still there and can be restored if needed
  • This is used by gadget authors to track global usage. They make a comment in the script with the [[File:]] syntax in it with an inexisting file name. Requesting global usage for it will yield locations of copies of the script. This one can be worked around by uploading a bogus image to the repo, were this behavior to change and only tracking usage of existing images.
  • It acts a little bit like a global WantedFiles, files that are wanted by multiple wikis.

One bad thing that can compromise Bawolff's proposal:

  • Being in globalfileusage does not mean the file exists there...

..

  • Being in globalfileusage does not mean the file exists there..., just like an entry in the local *links table doesn't mean the target exists.

On the other hand, if a connection to globalfileusage is possible, perhaps a connection to the actual repository wiki database is possible as well ? One could (ahum) "simply" check the commonswiki database.

[mid air collision]

This is used to fix things if a file in the repo was deleted and is restored,
the usage in globalusage is still there and can be restored if needed

I'm not sure I understand. Do you mean If a file at commons is deleted then
restored? the outer join on image should take care of that (I'm assuming that global usage is in the same db as commons is). If you mean the
local file was deleted/restored I assumed that would re-add/delete the entries
in global usage. Is that incorrect?

  • This is used by gadget authors to track global usage. They make a comment in

the script with the [[File:]] syntax in it with an inexisting file name.
Requesting global usage for it will yield locations of copies of the script.
This one can be worked around by uploading a bogus image to the repo, were this
behavior to change and only tracking usage of existing images.

Hmm, that is an interesting hack. At the end of the day, those would still
appear in special:wantedfiles if it was working properly. I don't really think
we should worry too much about that, having special:wantedfiles into a somewhat
working direction even with such links is an improvement over the current
situation.

It acts a little bit like a global WantedFiles, files that are wanted by
multiple wikis.

Well in my example query i filter by gil_wiki to do only one wiki. But we could
also make a special:globallywantedfiles which gives the most wanted file across
all the wikis.

One bad thing that can compromise Bawolff's proposal:

  • Being in globalfileusage does not mean the file exists there...

I'm not sure I know what you mean. My proposal relies on the fact that there
are entries in globalusage for files that don't exist on the commons repo.

Change 143835 had a related patch set uploaded by Brian Wolff:
Make Special:Wantedfiles not include foreign false positives.

https://gerrit.wikimedia.org/r/143835

  • Bug 69391 has been marked as a duplicate of this bug. ***

Change 143835 had a related patch set uploaded (by Aklapper):
Make Special:Wantedfiles not include foreign false positives.

https://gerrit.wikimedia.org/r/143835

Change 143835 had a related patch set uploaded (by Huji; owner: Brian Wolff):
[mediawiki/extensions/GlobalUsage@master] Make Special:Wantedfiles not include foreign false positives.

https://gerrit.wikimedia.org/r/143835

From the plain user perspective I (as no database guru) would suggest the following structure and realization concerning the restructuring of the special page 'Wanted files' - ideally as two new special pages - requiring the two following steps:

  • Firstly the splitting off between the shared, embedded, used or similar part - called e.g. 'Shared files' - and the missing part - called e.g. 'Missing files' (if original special page is replaced it can be although 'Wanted files') - probably much better to use two special pages for it - e.g. 'Special:SharedFiles' and 'Special:MissingFiles' (or 'Special:WantedFiles').
  • Secondly both parts or special pages should be structured (distinguished) for the different target (sub-)/domains (the shared resource, other wiki or whatever) and they should as default mainly (only or not only) display the element counts for every target (sub-)/domain (contained in the full list of entries) in the head region of the page - therefore the viewer of the special page gets firstly the overview of the counts for every target (sub-)/domain (i.e. the 'Commons' target domain and the other target domains) with the options to query for the element entries of a selectable target (sub-)/domain and using a selectable offset for the requesting of the view of entries.

Probably although many other users would appreciate such a realization of these two new special pages.

Possibly to reduce the server traffic the required data (list of entries) for shared storage like 'Commons' can be better provided by them self and therefore are only to be requested by the using wikis, which will cache them for e.g. one day - that clearly requires to output the detailed cache date information to all outputs - i.e. all the counts and the views of entries.

Dear developers, please realize these two new special pages! ~ Best regards, Sonne7