Page MenuHomePhabricator

Flow not storing database link tables (pagelinks, categorylinks, imagelinks, etc.)
Closed, ResolvedPublic

Details

Reference
bz57512

Event Timeline

bzimport raised the priority of this task from to High.Nov 22 2014, 2:14 AM
bzimport set Reference to bz57512.
bzimport added a subscriber: Unknown Object (MLST).

bingle-admin wrote:

The WMF core features team tracks this bug on Mingle card https://mingle.corp.wikimedia.org/projects/flow/cards/520, but people from the community are welcome to contribute here and in Gerrit.

  • Bug 59756 has been marked as a duplicate of this bug. ***

Duplicate of this bug: 59756

Not a precise dupe, but same cause, so I'll repeat my description from the other bug:


Flow does not seem to update links tables (tested with external links, and
imagelinks). This means 2 things:
*Pages that link here, image usage, etc won't work
*When someone edits something else in the wiki (upload new version of a photo,
delete a page [changing it to a redlink]), the page won't have its cache purged
(I'm assuming that flow actually caches stuff, and that it hasn't totally
re-implemented its own version of linkstables or something crazy like that).

Example:
https://www.mediawiki.org/w/index.php?title=Talk:Sandbox&workflow=050dacf79dbcc14b3c5090b11c2789df#flow-post-050dacfb4dca37944c8c842b2b782866
includes a picture but the usage is not recorded at
https://www.mediawiki.org/wiki/File:Horses_of_the_Household_Cavalry_Mounted_Division_Exercising_on_the_Beaches_of_North_Norfolk_MOD_45156138.jpg#filelinks

I believe this issue applies to all *links, so I've adjusted the bug summary. Without objection, this bug may turn into a tracking bug.

*When someone edits something else in the wiki (upload new version of a
photo,
delete a page [changing it to a redlink]), the page won't have its cache
purged
(I'm assuming that flow actually caches stuff, and that it hasn't totally
re-implemented its own version of linkstables or something crazy like that).

I was asked to elaborate on why I think this is a bad thing (Given that "Flow posts are meant to be accurate snapshots of a moment in time").

Failing to update flow posts when templates change (and other things that trigger RefreshLinksUpdates and HTMLCacheUpdate) would be bad because:

*Users expect that this is how the wiki works. Imo this belief is very very strongly ingrained (heck, its related to where we derive the work "wiki" from), and users would probably not like it if things didn't update. (Obviously this is anecdotal. I have no data to back that assertion up, but I feel it to be true quite strongly)
*In cases of vandalism. People using templates in flow posts well a template is vandalized, won't have their edits fixed when the template is fixed
*Things being deleted - Person writes post, person b deletes template a month later, person re-edits their post a year later, suddenly template doesn't work = confusing. This is especially bad when combined with usage lists not working. Also without templates updated, notices that a template is up for deletion wouldn't show up on flow posts where the template is used.
*Redlinks. People expect the redness of links to reflect reality instantly
*Images (The following may change in some mythical future, how we currently handle images is not ideal): Person uses an image. A couple days later someone uploads a new version of the image that has different dimensions. Flow post would have the image squished (As the new image would get fitted to the old dimensions)
*There's probably other things I haven't thought of.

  • Bug 58009 has been marked as a duplicate of this bug. ***

Flow team: this is "High" here and there's no information available in mingle.

What's the status? Brian's comment (comment 5) is pretty clear about why this is bad.

Please give an update on this before the deploy to enwiki.

Some replies to Brian's comments:

*Users expect that this is how the wiki works. Imo this belief is very very
strongly ingrained (heck, its related to where we derive the work "wiki" from),
and users would probably not like it if things didn't update. (Obviously this
is anecdotal. I have no data to back that assertion up, but I feel it to be
true quite strongly)

Well, Flow isn't how a wiki works ;) There's never been a structured discussion system like this before on any Wikimedia project; it's fundamentally different by design. The "but this isn't how we've always done it!" argument could certainly be valid in instances when there's legitimate user confusion and/or broken workflows -- however, the only way we'll know if this is the case is when actual users begin to use Flow. That's why we're deploying to a few pages where people can test it out and tell us if this is important to fix asap.

*In cases of vandalism. People using templates in flow posts well a template is
vandalized, won't have their edits fixed when the template is fixed

But vandalism works the other way around, too. Take the Meepsheep vandal of.. 2011, I think? The guy who found a bunch of unprotected transcluded templates that were used on thousands of really high-visibility articles and inserted pictures of swastikas and dead babies in them. I'd argue that's a much more common and much more serious form of vandalism that the current mediawiki setup does not adequately address.

*Things being deleted - Person writes post, person b deletes template a month
later, person re-edits their post a year later, suddenly template doesn't work

confusing. This is especially bad when combined with usage lists not working.

Also without templates updated, notices that a template is up for deletion
wouldn't show up on flow posts where the template is used.

Yes, it's entirely possible that this could be confusing, but again, it's also extremely confusing when a template is used for years, is edited, and suddenly looks completely out of context in places where the older version still lives. For example, Template:Opentask on enwiki was, for a long time, just a list of, well, open tasks. Then it got too long and crazy, so me and a few other Wikipedians pruned it down, changed the formatting, and tested it to see if it was more efficient and getting people to actually do the tasks via the Community portal. But now it looks really weird transcluded onto user pages, like this one: https://en.wikipedia.org/wiki/User:Dgrant.

Do we want to live in a world where we have to consult with every single user (many probably no longer active) before we make changes to widely-used templates, because otherwise we'll break someone's workflow? I'm not sure, but I'd like to propose we try a different way and see if that works better. Again, having some real users try Flow out and see if it actually makes sense to preserve the historical state of templates in discussions or update them would be tremendously valuable, instead of just making assumptions one way or the other.

*Redlinks. People expect the redness of links to reflect reality instantly

That's a good point; I'm not sure how important it will be in talk namespaces, though (see also the comment below).

*Images (The following may change in some mythical future, how we currently
handle images is not ideal): Person uses an image. A couple days later someone
uploads a new version of the image that has different dimensions. Flow post
would have the image squished (As the new image would get fitted to the old
dimensions)

I'd like to see some stats on how often this actually happens in talk namespaces. I can see how this would be an issue in articles even if it was extremely rare, but in discussion?

*There's probably other things I haven't thought of.

Sure, there's probably other things I haven't thought of, too :) The point is that this is one of many instances where we're not entirely sure what the "right" solution is, and it would help tremendously to have the end-users of our software test it out and give us a better sense of what we should do.

*Redlinks. People expect the redness of links to reflect reality instantly

I'd note this hasn't been happening for...quite a while (not sure if it's job queue slowness or something else).

(In reply to comment #8)

however, the only way we'll know if this is the case is
when actual users begin to use Flow. That's why we're deploying to a few
pages
where people can test it out and tell us if this is important to fix asap.

Yes, they'll tell you. While coming after you with torches and pitchforks, most likely. Especially so soon after the same thing happened with VE.

I don't think that having Flow do all sorts of non-discussion-related things differently just because "it's Flow!" is a good justification.

But vandalism works the other way around, too. Take the Meepsheep vandal of..
2011, I think? The guy who found a bunch of unprotected transcluded templates
that were used on thousands of really high-visibility articles and inserted
pictures of swastikas and dead babies in them. I'd argue that's a much more
common and much more serious form of vandalism that the current mediawiki
setup
does not adequately address.

Having templates in articles work that way would be antithetical to the entire purpose of templates. This very issue has been discussed in depth very recently on enwiki, and various more-well-thought-out proposals than "break templates completely" were discussed and rejected.

And I strongly suspect that having templates weirdly work differently in Flow than in articles is going to be confusing for users.

And if some vandal does manage to do this, will it even be *possible* to find the Flow posts that are affected? Since there are no links table entries, it seems unlikely unless we can get lucky in being able to find them with the search engine without false positives.

but again, it's
also
extremely confusing when a template is used for years, is edited, and
suddenly
looks completely out of context in places where the older version still
lives.

OTOH, with this bug it's impossible for anyone to find how a template is used in discussions in the first place. Or to find the transclusions in order to fix them if a template (or a template redirect, or a shortcut redirect, etc) is repurposed. It also breaks any workflow that depends on finding transclusions of a template in talk pages. And if categorylinks is included in this bug, it'll also break workflows that depend on adding talk pages to maintenance categories.

This reminds me of the problem with bug 12974: bug 529 was easy enough to work around, but fixing it caused problems that are impossible to work around. "Templates that shouldn't change" is easy to work around with subst, but making all templates pseudo-substed in Flow sounds like it'll cause problems that are impossible to work around.

The non-updating of link tables also makes it impossible to search for external links (e.g. to find past discussions of the reliability of a source), or to find discussions that link to some page, or to find discussions that use an image, and so on. Except by hoping that some search engine invocation can find it without too many false positives to wade through.

[mid-air collision]

Well, Flow isn't how a wiki works ;) There's never been a structured discussion
system like this before on any Wikimedia project; it's fundamentally different
by design. The "but this isn't how we've always done it!" argument could
certainly be valid in instances when there's legitimate user confusion and/or
broken workflows -- however, the only way we'll know if this is the case is
when actual users begin to use Flow. That's why we're deploying to a few pages
where people can test it out and tell us if this is important to fix asap.

different things are going to be different or they wouldn't be a different thing is a fair point. I do believe the social intertia around this would be significant enough as to cause backlash, however when it comes right down to it, that's just my opinion. I do think people expect flow to behave in the wiki-way in a wide variety of situations, even though it is different from a wiki in many ways.

But vandalism works the other way around, too. Take the Meepsheep vandal of..
2011, I think? The guy who found a bunch of unprotected transcluded templates
that were used on thousands of really high-visibility articles and inserted
pictures of swastikas and dead babies in them. I'd argue that's a much more
common and much more serious form of vandalism that the current mediawiki setup
does not adequately address.

Indeed it does, the difference is that its very easy to undo that type of vandalism, just revert the edits. There's even a nice list of all edits to revert. In the flow situation the vandalism would be hard to undo, with reverts being ineffectual. There is not even a list of affected pages. Thus while the number of shock images would be much less, they would last much longer. I believe it is significantly better to have all the templates be replaced with bad images for 30 seconds (yeah yeah, job queue delay makes that probably a bit longer), than it would be for a couple template instances to be replaced for 6 months without anybody realizing the presence of hidden vandalism until someone stumbles upon it.

Do we want to live in a world where we have to consult with every single user
(many probably no longer active) before we make changes to widely-used
templates, because otherwise we'll break someone's workflow? I'm not sure, but
I'd like to propose we try a different way and see if that works better.

Same thing could be said about code deprecations, etc. When it comes down to it, changing the semantics of public symbols usually causes problems, and there's no way around that, but sometimes it is necessary. I don't think {{opentasks}} is a good example to use here, since without instant updating, that template is entirely useless once the tasks become closed. If people really wanted non-changing templates, they would probably use the already existing subst: keyword. So it all comes down to user expectations, and well I have a guess what these expectations are, I don't have a crystal ball that would tell me If I'm right.

I'd like to see some stats on how often this actually happens in talk
namespaces. I can see how this would be an issue in articles even if it was
extremely rare, but in discussion?

Ask and you shall receive. There's been 456 such image changes that have been used on any talk namespace since Jan 1, 2014 (So in 27 days). If you limit that to just things in the main talk namespace (i.e. namespace 1), it falls to 189 image changes. If you further limit that to main talk, and enwiki only, then you have 86 such image changes:

MariaDB [commonswiki_p]> select count( distinct img_name) from image inner join oldimage on oi_name = img_name inner join globalimagelinks on gil_to = img_name and (gil_page_namespace_id & 1) where (oi_width != img_width or oi_height != img_height) and img_timestamp > '20140101000000' and oi_archive_name like '2014%';
+---------------------------+

count( distinct img_name)

+---------------------------+

456

+---------------------------+
1 row in set (2.26 sec)

MariaDB [commonswiki_p]> select count( distinct img_name) from image inner join oldimage on oi_name = img_name inner join globalimagelinks on gil_to = img_name and gil_page_namespace_id = 1 where (oi_width != img_width or oi_height != img_height) and img_timestamp > '20140101000000' and oi_archive_name like '2014%';
+---------------------------+

count( distinct img_name)

+---------------------------+

189

+---------------------------+
1 row in set (31.37 sec)

Limiting to enwiki:

MariaDB [commonswiki_p]> select count( distinct img_name) from image inner join oldimage on oi_name = img_name inner join globalimagelinks on gil_to = img_name and gil_page_namespace_id = 1 and gil_wiki = 'enwiki' where (oi_width != img_width or oi_height != img_height) and img_timestamp > '20140101000000' and oi_archive_name like '2014%';
+---------------------------+

count( distinct img_name)

+---------------------------+

86

+---------------------------+
1 row in set (2.57 sec)

That's a good point; I'm not sure how important it will be in talk namespaces,
though (see also the comment below).

In modern wikipedia, it seems like red links are primarily used on talk namespaces and less commonly on article namespace.


Additional issue I failed to mention: Not working with globalusage means that commons admins will not know the image is in use, which means that if there's a problem with the image its much more likely to get deleted. In the ages before globalusage, this sort of thing caused a lot of tension between commons and sister projects.

Sure, there's probably other things I haven't thought of, too :) The point is
that this is one of many instances where we're not entirely sure what the
"right" solution is, and it would help tremendously to have the end-users of
our software test it out and give us a better sense of what we should do.

Yes of course, the future is unknowable, but that doesn't mean we shouldn't address the issues we do know about first.

unless we can get lucky in being able to find them with the
search engine without false positives

You'd have to get really lucky, since from what I understand, the search engine does not index flow posts.

The non-updating of link tables also makes it impossible to search for external
links (e.g. to find past discussions of the reliability of a source), or to
find discussions that link to some page, or to find discussions that use an
image, and so on.

Or to find everywhere the spambots added linkspam to a specific website

(In reply to comment #12)

unless we can get lucky in being able to find them with the
search engine without false positives

You'd have to get really lucky, since from what I understand, the search
engine
does not index flow posts.

Sounds like that should be another blocker for bug 60178. Is there a bug for that yet?

So, here's my perspective:

I'm going to go through each table's behaviour for things other than WhatLinksHere, integrating with which would take some substantial engineering and I'd be happy to leave it until later, Maryana willing.

pagelinks: Are used to update red/blue link status. Flow actually uses its own red-linker class to set the red/blue-link status of individual links in posts and headers, so this is moot.

templatelinks: Used to update pages when templates change. This can be accomplished another way – I'm currently looking at how we might work with Parsoid here. In particular, we store fully rendered HTML, and I'm not sure how comfortable I am rewriting history when a template is changed. We may want to do it for headers only and not for posts.

categorylinks: I am happy to say that posts and headers cannot be in categories. Maybe later on we can allow talk pages to be added to categories by adding tags to the header or by another method.

imagelinks: Used to update page HTML when images are updated. However, the HTML only changes if either no size is specified in wikitext and the image is resized, or if the image is deleted. In both cases, I think it's an issue to be addressed soon, but not urgently and not so important to block short testing deployments.

Filed the search thing as bug 60493.

(In reply to comment #14)

imagelinks: Used to update page HTML when images are updated. However, the
HTML
only changes if either no size is specified in wikitext and the image is
resized, or if the image is deleted. In both cases, I think it's an issue to
be
addressed soon, but not urgently and not so important to block short testing
deployments.

It also changes if the aspect ratio of the source image is changed (regardless of it a size is specified). It changes if someone moves the image (That's not critical though).

Just as important as imagelinks is globalimagelinks, as it makes cross wiki image issues difficult to deal with when missing.

(In reply to comment #17)

I'm speccing out potential integration with link tables here:
https://www.mediawiki.org/wiki/Flow/Link_table_spec

Please let me know if my requirements + implementation look sane, Brian / Brad.

Those look reasonable to me.

"Red link / Blue link behaviour works correctly out of the box, because Flow resolves these at display time." might have scaling issues depending on how its implemented (That is if that involves doing a db query on every page view?), but that's more of a long term concern.

(In reply to comment #18)

Please let me know if my requirements + implementation look sane [...]

Emulating MediaWiki behavior doesn't seem ideal to me. It feels like reinventing/rewriting MediaWiki as a MediaWiki extension. Ensuring that *links tables are up-to-date and accurate is already difficult as it is.

As I understand it, Flow wikitext is currently not run through the PHP parser. I don't see how you can reasonably escape this step.

Copying Tim and Gabriel in case they'd like to weigh in.

In a way Flow is ahead of us here. We'll have to implement link table updates for HTML-only wikis in the storage backend. For this and other reasons like HTML format updates it probably makes sense to share the same backend infrastructure for this in the longer term. In the short term Flow might have to do its own thing.

It would be helpful if we could coordinate more on things like red links, as that functionality will also be needed for normal page views / VE edits.

(In reply to comment #20)

Emulating MediaWiki behavior doesn't seem ideal to me. It feels like
reinventing/rewriting MediaWiki as a MediaWiki extension.

It's too bad we didn't add the semantic information into the existing PHP parser and make it runnable as a service, instead of trying to rewrite the whole thing in a completely different language. But that ship has already sailed.

(In reply to comment #18)

(In reply to comment #17)

I'm speccing out potential integration with link tables here:
https://www.mediawiki.org/wiki/Flow/Link_table_spec

Please let me know if my requirements + implementation look sane, Brian /
Brad.

I don't much like the parallel Flow-specific links tables, but I suppose that's necessary because Flow makes talk pages so complicated that trying to find the usage of the link/file/template/etc from the talk page itself won't work too well.

Don't forget the list of pages using the file at the bottom of File-namespace pages. And there are also a number of API modules that query the existing links tables; there should at least be some way to (1) identify that a page *is* Flow and (2) search these flow-specific links tables to find the actual "thing(s)" doing the linking in the particular page.

Cross-reference: [[Template:Flow-enabled]]

  • Bug 57991 has been marked as a duplicate of this bug. ***

About the image links:

Free images are deleteed at enwiki (and other wikis) and copied to commons, and often with new names. With no links, its not possible to rename the links to the files at flow pages.

But more important:
Non-free images (fair use) may not be used on talk pages, and its important to track where the non free-images are used (so they can be removed from talk (flow) pages) - its all about copyright...

Christian

Gerrit change 117231 has a comment related to this bug, that could use input from the Flow team.

Change 110090 had a related patch set uploaded by Spage:
Extract wiki and external links, file and template usages from text.

https://gerrit.wikimedia.org/r/110090

Change 110090 merged by jenkins-bot:
Extract wiki and external links, file and template usages from text.

https://gerrit.wikimedia.org/r/110090

There's been a couple (few?) patches reviewed and merged related to this: what's the status?

I think this was fixed by Andrew Garrett's work. Flow detects references in items, stores them in new flow_wiki_ref and flow_ext_ref tables, and appends to the standard MediaWiki link tables. So Special:WhatLinksHere works for pages, links between Flow boards, images, and templates; and the "File usage" section of a File: page shows the Flow boards using the image.

Gerrit 115860 adds a WhatLinksHereProps hook in MW 1.24 so that Flow can add "from the _header_; from a _post_" links to the line in WhatLinksHere.

(I added this to mw:Flow/Architecture#Link_handling, corrections are welcome there.)

Someone should check to make sure all link tables are accounted for, either being updated or documented as to why they are not supported. I'd suggest reopening this bug until that is done so it doesn't get lost, but I'm not going to do that myself (yet, anyway).

I see pagelinks, imagelinks, and templatelinks mentioned; how about categorylinks, externallinks, iwlinks, langlinks, redirect (I suspect this is "it's not possible to have a Flow object be a redirect"), and globalimagelinks? Did I miss any? e.g. do Wikidata or ProofreadPage or other extensions add any links tables?

From testing in this Topic: https://www.mediawiki.org/w/index.php?title=Talk:Sandbox&workflow=rylnpvcpshj9n30s

It appears that the Special:WhatLinksHere table isn't getting updated until I reply to, or edit, a post.
I.e. If I link or transclude [[mw:template:test1234]], then it will not appear in [[mw:Special:WhatLinksHere/Template:Test1234]], until I reply to, or edit, the post that contains it.

(Or, possibly it will automatically update after a delay, but it didn't within my 15 minute pause-test)

(In reply to Quiddity from comment #32)

It appears that the Special:WhatLinksHere table isn't getting updated until
I reply to, or edit, a post.

I filed T70343 for the delay (or whatever it is).

@Spage said:

I think this was fixed by Andrew Garrett's work. Flow detects references in items, stores them in new flow_wiki_ref and flow_ext_ref
tables, and appends to the standard MediaWiki link tables. So Special:WhatLinksHere works for pages, links between Flow boards,
images, and templates; and the "File usage" section of a File: page shows the Flow boards using the image.

So, that means that Flow re-implements the link tables? That's not going to work with tracking tables used by extensions, such as the entity usage tracking table for Wikidata (T49288), right?

If Flow has a ParserOutput or Content object somewhere, it could just execute the associated DataUpdates, and everything should Just Work (tm):

		$updates = $parserOutput->getSecondaryDataUpdates();
		DataUpdate::runUpdates( $updates );

Flow does populate the core tables via SecondaryDataUpdates, the reason flow has its own tables in addition is so we can calculate just the changes to a particular piece of content and load up the existing references for all the unchanged content instead of running the full process against all the unchanged data.

He7d3r set Security to None.
Mattflaschen-WMF claimed this task.

This was fixed a long time ago. T59991: Changes in templates are not immediately reflected in Flow posts which transclude those templates is not the same thing.

If there are related (but distinct bugs), please file a new task. Making edits like this that completely change the definition of a task are not helpful.

Ok, but I fail to see how that detailed description is different from the stub which was (and now is again) in the description of the task.

Mattflaschen-WMF claimed this task.

For the first one, they're both there (the image include, and the normal link). Maybe it wasn't apparent since it's on the same line:

File_link_and_regular_link_on_same_line.png (113×550 px, 14 KB)

It doesn't look like this is affected by the job queue.

For the Labs one, there is a bug that it is not recording red links (links to non-existent pages). Thanks for reporting this. Filed as T96220: Flow does not record red links (links to non-existent pages).

In T59512#1211298, @Mattflaschen wrote:

For the first one, they're both there (the image include, and the normal link). Maybe it wasn't apparent since it's on the same line:

File_link_and_regular_link_on_same_line.png (113×550 px, 14 KB)

I'm 100% sure It was empty when I looked at that list. So, there is something wrong making it to take more than 3 days to update the list.

In T59512#1211298, @Mattflaschen wrote:

For the first one, they're both there (the image include, and the normal link). Maybe it wasn't apparent since it's on the same line:

File_link_and_regular_link_on_same_line.png (113×550 px, 14 KB)

I'm 100% sure It was empty when I looked at that list. So, there is something wrong making it to take more than 3 days to update the list.

Filed as T96810: Apparent lag in picking up links (to file pages).