Page MenuHomePhabricator

Versioned data in backend
Closed, DeclinedPublic

Description

Author: lapo

Description:
Last version if the one most accessed, the previous ones are less access as they
are more old.
This was RCS's programmers rationale and it worked for years and years for them,
and for CVS too (which uses RCS format, in fact).

Storing only the last version and then "diff to last" for each old revision is:
a. space-savvy
b. quite efficient as the page viewed most of the times is available with no
diffing and patching at all
c. space-savvy
d. space-savvy
e. space-savvi
f. compatible with any backend, requiring only to add a record with the new data
and only modifying the last record (replacing the full text with the "to latest"
diff), all the old records are left alone as they are

When I'll have finished my university thesis I could help implementing it
(either here, if the idea is liked, or in a different wiki engine, I really feel
"bad" to add a single line to a 1000-lines article and know to have "wasted" 999
lines of space ^_^)


Version: unspecified
Severity: enhancement

Details

Reference
bz1935

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 21 2014, 8:21 PM
bzimport set Reference to bz1935.
bzimport added a subscriber: Unknown Object (MLST).

rowan.collins wrote:

As I understand it, old versions are no longer stored "whole" anyway. At the
very least, they are now compressed in batches, and I have "heard" discussion of
making the "text" table (now independent of both article and revision metadata)
manageable by a kind of independent back-end with various storage schemes at its
disposal.

But certainly, this idea is one which has been mentionned before as having
potential merit, and your effort to implement it would I'm sure be welcomed.

Quoting Roan in http://permalink.gmane.org/gmane.science.linguistics.wikipedia.technical/51583

'''
Wikimedia doesn't technically use delta compression. It concatenates a
couple dozen adjacent revisions of the same page and compresses that
(with gzip?), achieving very good compression ratios because there is
a huge amount of duplication in, say, 20 adjacent revisions of
[[Barack Obama]] (small changes to a large page, probably a few
identical versions to due vandalism reverts, etc.). However,
decompressing it just gets you the raw text, so nothing in this
storage system helps generation of diffs. Diff generation is still
done by shelling out to wikidiff2 (a custom C++ diff implementation
that generates diffs with HTML markup like <ins>/<del>) and caching
the result in memcached.

'''

Seems good enough. Closing bug as works for me.

(In reply to comment #2)

Quoting Roan in
http://permalink.gmane.org/gmane.science.linguistics.wikipedia.technical/51583

'''
Wikimedia doesn't technically use delta compression. It concatenates a
couple dozen adjacent revisions of the same page and compresses that
(with gzip?), achieving very good compression ratios because there is
a huge amount of duplication in, say, 20 adjacent revisions of
[[Barack Obama]] (small changes to a large page, probably a few
identical versions to due vandalism reverts, etc.). However,
decompressing it just gets you the raw text, so nothing in this
storage system helps generation of diffs. Diff generation is still
done by shelling out to wikidiff2 (a custom C++ diff implementation
that generates diffs with HTML markup like <ins>/<del>) and caching
the result in memcached.

'''

...and I was wrong, see the replies to that post. We actually DO use delta-based storage, almost exactly in the way you propose.