Page MenuHomePhabricator

Relations table and excessive size
Closed, ResolvedPublic

Description

Author: sergey.chernyshev

Description:
smw_relations table grows very fast. It might be good idea to consider stripping it of subject_namespace and subject_title or even merging it with MediaWiki pagelinks table which apart from relation_title stores all the same information (since relations are defined through links syntax).

For statistics - one of my test installations with about 30K (relation intensive) pages has approximately 5mil entries in smw_relations (taking approximately 1GB of space) table and a little more entries in pagelinks table (taking approximately 850MB of space).

Maillist discussion thread (my posts only, so far) can be seen here:
http://sourceforge.net/mailarchive/forum.php?thread_name=9984a7a70705281206r58d041f9i7c393a82447f1336%40mail.gmail.com&forum_name=semediawiki-devel


Version: unspecified
Severity: normal
URL: http://sourceforge.net/mailarchive/forum.php?thread_name=9984a7a70705281206r58d041f9i7c393a82447f1336%40mail.gmail.com&forum_name=semediawiki-devel

Details

Reference
bz10087

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 21 2014, 9:53 PM
bzimport set Reference to bz10087.
bzimport added a subscriber: Unknown Object (MLST).

I understand the problem and will consider ways of providing storage-optimised data models in future releases. The additional title and namespace information is relevant for implementing filtering operations more effiently (i.e. without an additional join with the page table) and hence was denormalised on purpose.

For relations and objects, ids are not always available and thus cannot replace the title strings in general. We could consider having an own indexing scheme for them or use ids at least for attributes (which need to have articles before being stored). However, ids in MediaWiki are not as persistent as the name: when you move a page, the id of the page of the original name changes. Hence all uses of ids require global updating operations whenever pages are moved and we also tried to limit this.

What you can do to save space (at the expense of performance) is to drop indexes which are currently built for the tables. Especially the ones over object and subject titles/namespaces are not required for all operations and deleting them might be feasible.

The new storage engine of SMW uses internal numerical ids for all pages, so that title strings vanish from relation tables and subject positions of other tables completely. In compensation, there is a new SMW-specific id management (we cannot use MediaWiki ids since they do not exist for all objects occuring in SMW tables). Yet, for relation-heavy wikis, this should be a great reduction of storage space. The new storage implementation is in SVN and will also soon be released with proper update instructions. For testing it now, see the instructions in Bug 13960 (switching back to the old implementation is always possible, but you should run on current stable release SMW1.1.1 before trying out SVN.