Page MenuHomePhabricator

log database in s1-analytics-slave has charecter set utf8, while every other database has charset binary
Closed, DeclinedPublic

Description

Inconsistency is causing issues with scripts which now have to treat the log database specially.


Version: unspecified
Severity: normal

Details

Reference
bz47368

Event Timeline

bzimport raised the priority of this task from to Needs Triage.Nov 22 2014, 1:17 AM
bzimport set Reference to bz47368.
bzimport added a subscriber: Unknown Object (MLST).

Related is https://bugzilla.wikimedia.org/show_bug.cgi?id=45718 - if we change charset for the database we should also ensure that the tables are changed too.

Yuvi, I am not fully persuaded the charset should be consistent with MW's default binary. Other databases used for data analysis on s1 host datasets coming from various sources (squid logs, API, dumps) that make utf8 the ideal encoding. I typically rely on IDs for my work so I am probably less sensitive to this issue.

Can you expand on the issues you're referring to in the original report?

On a related note, the situation on s1 is pretty messy:

SHOW CREATE DATABASE staging;
+----------+--------------------------------------------------------------------+

DatabaseCreate Database

+----------+--------------------------------------------------------------------+

stagingCREATE DATABASE staging /*!40100 DEFAULT CHARACTER SET latin1 */

+----------+--------------------------------------------------------------------+
1 row in set (0.03 sec)

SHOW CREATE DATABASE prod;
+----------+-----------------------------------------------------------------+

DatabaseCreate Database

+----------+-----------------------------------------------------------------+

prodCREATE DATABASE prod /*!40100 DEFAULT CHARACTER SET latin1 */

+----------+-----------------------------------------------------------------+
1 row in set (0.03 sec)

SHOW CREATE DATABASE enwiki;
+----------+-------------------------------------------------------------------+

DatabaseCreate Database

+----------+-------------------------------------------------------------------+

enwikiCREATE DATABASE enwiki /*!40100 DEFAULT CHARACTER SET binary */

+----------+-------------------------------------------------------------------+
1 row in set (0.03 sec)

SHOW CREATE DATABASE log;
+----------+--------------------------------------------------------------+

DatabaseCreate Database

+----------+--------------------------------------------------------------+

logCREATE DATABASE log /*!40100 DEFAULT CHARACTER SET utf8 */

+----------+--------------------------------------------------------------+
1 row in set (0.03 sec)

So the problem is with Python. I'm using a bunch of scripts to manually 'join' data from the production slaves (commons) and eventlogging, and am having to treat the connections to log / others differently. This prevents databases from being interchangeable, plus this kinda special casing is always bad.

Is there any direct benefit from having log to be utf8 than binary?

As per IRC, I think being internally consistent inside MySQL (all mysql databases have same encoding, you don't need to check that everytime) is more important than having *some* mysql tables be consistent with some other outside sources.

https://gerrit.wikimedia.org/r/#/c/59880/ is the use case I am talking about. DB handling code should not have to care about which database it is connecting to...

(In reply to comment #0)

Inconsistency is causing issues with scripts which now have to treat the log
database specially.

This isn't EventLogging exceptionalism, you know: URIs are UTF-8, MediaWiki speaks UTF-8, the Python source is UTF-8, the local of the system it's running on is UTF-8, the ZeroMQ stream is UTF-8, MongoDB is UTF-8, and it is the most common encoding on Android and iOS.

Binary character encoding is a bit of a misnomer, since it's not actually a character encoding but the absence of one. When all of your interactions with the database are mediated by MediaWiki, then it doesn't matter much how the database is configured as long as MediaWiki knows how to work with it. But EventLogging data has many consumers, working with data in SQL, CSV and XML format in Python and R and direct database client GUIs and who knows what else. Leaving the character encoding unspecified means leaving it up to application code, which in this case is scattered across a number of different codebases. This opens up a vast gulf of indeterminacy and a large potential for encoding issues.

Getting all these different pieces, written in different languages and running on different systems, to speak with one another was tricky as hell, but it's working now, and we'd need to have a really compelling reason for changing it.

I think this is a WONTFIX, but leaving it open for now in case Yuvi wants to make a thorough case for a migration.

*shrug* I should open a new bug to have the slaves of the production databases set to utf8, but I am unsure how productive that would be.

The reason we don't use the 'utf8' character encoding setting in MySQL is, it turns out, that it can only encode BMP characters, which have a maximum width of three bytes. Supplementary characters cannot be encoded at all.

So I *do* think we need to change the character encoding, but to what?

Our databases run MariaDB 5.5, which supports a newer character encoding called 'utf8mb4', which offers true utf8 support. I'd strongly prefer that to 'binary', in the interest of being explicit and mitigating the chance that a client will default to some other value based on the system locale or what have you. But this then would leave Yuvi's issue unresolved: the EventLogging database would still be slightly different than the rest of the cluster.

Thoughts?

[moving from MediaWiki extensions to Analytics product - see bug 61946]