Page MenuHomePhabricator

Paragraph splits and moves not identified in diffs
Closed, ResolvedPublicFeature

Assigned To
Authored By
bzimport
Feb 24 2006, 2:05 AM
Referenced Files
None
Tokens
"The World Burns" token, awarded by czar."Love" token, awarded by Liuxinyu970226."Mountain of Wealth" token, awarded by Gryllida."Manufacturing Defect?" token, awarded by Zazpot."Love" token, awarded by Gestrid."Love" token, awarded by Nux."Love" token, awarded by Agent007bond.

Description

Author: circeus

Description:
When you break a paragraph, the entire text is marked as deleted and added,
creating an annoying and needless big chunk of red text, while only the added line
breaks should really be indicated.


Version: unspecified
Severity: enhancement/bug
See also: T15462: Enhance line matching in diffs

Details

Reference
bz5072

Event Timeline

bzimport raised the priority of this task from to Low.Nov 21 2014, 9:08 PM
bzimport set Reference to bz5072.
bzimport added a subscriber: Unknown Object (MLST).

scmcc wrote:

I'm adding to this bug, since it seems to be part of a larger problem: the article
history diffs fails to track paragraph moves as well as splits. Since diffs are
intended to help editors track changes, this failure represents a minor loss of
function so I'm upgrading the severity from trivial to minor.

scmcc wrote:

Further clarification:

The essence of the problem is history/diffs loses track of moved paragraphs, and hence
does not compare within the moved paragraph to identify changes that may have been made
in the same edit as the move.

When a new paragraph is inserted, history/diffs often compares the inserted paragraph
with an old following paragraph, rather than comparing the old paragraph with the
(often unchanged) version that is now in a new location.

What we seem to need is a robust difference detector that can track moves of paragraphs
or even lines, and then identify smaller changes in those moved segments. They must be
around, they've been in word processors for years, and in Wiki's editing intensive
environment they're long overdue.

Thanks, Steve

scmcc wrote:

It looks like there's a solution to this problem by installing User:Cacycle/wikEdDiff. It tracks changes through paragraph breaks and (apparently) catches moved sections of text as well. I wouldn't quite call this bug "fixed" until this is made a standard part of the Wikipedia difference display, but having it available is a big step in the right direction.

Steve

wikiEdDiff is available as a gadget now. [[User:Cacycle/wikEd_help#wikEd_control_buttons]]

Think this is closed.

(In reply to comment #3)

I wouldn't quite call
this bug "fixed" until this is made a standard part of the Wikipedia difference
display, but having it available is a big step in the right direction.

I agree on this.

(In reply to comment #5)

(In reply to comment #3)

I wouldn't quite call
this bug "fixed" until this is made a standard part of the Wikipedia difference
display, but having it available is a big step in the right direction.

I agree on this.

Should we make it possible to replace the built-in server-side wikidiff tool with this client side one? Or maybe encourage the developer to have his JS output replace server-side generated output?

The wikiEdDiff author seems interested in getting a PHP-based solution to replace this JS-based one, but I'm not sure if that would be better (as far as server impact, which is what Wikimedia ops would be concerned with) than the C++-based one.

Reopening per comment #3 and comment #5 (and I also agree).

And replying to comment #6: From the user perspective, I think either option would be an improvement to the current situation. Should the ops decide that a server-side implementation is unfeasible (without performance degradation), then the client-side gadget is a reasonable (but not as good) replacement. In any case, the client-side gadget would need to be bundled with MediaWiki, so that this bug can IMO be considered fixed.

There's already some code which could be used: see for instance this tool (screenshot at p. 8): http://www.fst.umac.mo/en/staff/documents/robertb/WikiSym2010-PeterRobert-Final.pdf

Code is published in http://sourceforge.net/projects/weha/ , I was told in August 2010 that they were going to release it as an extension once ready but perhaps someone could already work on it.

Agent007bond updated the task description. (Show Details)
Agent007bond set Security to None.

As discussed here, another example of this bug is at https://en.wikipedia.org/w/index.php?title=MalwareTech&diff=782456060&oldid=782449642 . In that diff, both the left and right columns have hunks that begin "Following his work on the WannaCry ransomware attack in 2017" and that are almost identical (edit distance: 3) but that have been aligned with other hunks instead of with each other, making it very hard to spot what has changed between them. (To spare you searching, it is "he's" to "he has".)

I expect the solution to this bug will involve matching paragraphs according to minimum edit distance, with a fallback algorithm in case two or more paragraphs are equal edit distances away.

Several people above have suggested that a client-side solution would be acceptable. I disagree.

The diff tool is crucial for checking edits for vandalism, etc, and must be usable by all editors. Not all editors enable JavaScript. Therefore, a client-side-only solution would be inadequate.

This bug really needs fixing on the server side :)

Gryllida rescinded a token.
Gryllida awarded a token.
Aklapper changed the subtype of this task from "Task" to "Feature Request".Feb 4 2022, 11:01 AM
Aklapper removed a subscriber: wikibugs-l-list.

I'm wondering if this can be resolved now. The task description is rather vague and absent of an example diff.

As discussed here, another example of this bug is at https://en.wikipedia.org/w/index.php?title=MalwareTech&diff=782456060&oldid=782449642 . In that diff, both the left and right columns have hunks that begin "Following his work on the WannaCry ransomware attack in 2017" and that are almost identical (edit distance: 3) but that have been aligned with other hunks instead of with each other, making it very hard to spot what has changed between them. (To spare you searching, it is "he's" to "he has".)

I expect the solution to this bug will involve matching paragraphs according to minimum edit distance, with a fallback algorithm in case two or more paragraphs are equal edit distances away.

This issue still stands, but can likely be addressed with tuning wikidiff2 parameters (T341753). We also now have the inline style that we hope makes it easier to read trickier diffs like this example. Perhaps that is enough to consider this task resolved?

MusikAnimal assigned this task to tstarling.

No reply to T7072#9221499. With the Better-Diffs-2023 project now complete, including a new inline style option, I'm going to be bold and close this 17-year old bug! Assigning to Tim as he implemented the algorithm changes.

As discussed here, another example of this bug is at https://en.wikipedia.org/w/index.php?title=MalwareTech&diff=782456060&oldid=782449642 . In that diff, both the left and right columns have hunks that begin "Following his work on the WannaCry ransomware attack in 2017" and that are almost identical (edit distance: 3) but that have been aligned with other hunks instead of with each other, making it very hard to spot what has changed between them. (To spare you searching, it is "he's" to "he has".)

This is T164795.