SV: SV: Bug in MySQL code?

Søren Løvborg sorenl at unity3d.com
Mon Jan 4 13:26:22 UTC 2016


> Each table already specifies mysql_charset utf8 ...
> but that is apparently for something else?

That's the encoding used on disk; the charset in the db URL is the
connection encoding. Ideally, these should be the same, otherwise
MySQL will convert all incoming and outgoing string data between the
two encodings, which is just a waste of time.

It is correct that there is a weird limitation to the MySQL "utf8"
encoding – it wouldn't be MySQL if there wasn't ;) – which is why some
people used to recommend storing strings as BINARY values instead, to
avoid MySQL messing up text data. (This is in line with the grand
tradition of using MySQL as a dumb data store and doing all logic and
validation client-side.)

As Mads hints, the issue is that MySQL's "utf-8" encoding is limited
to 16-bit code points (the Basic Multilingual Plane), which is of
course ridiculous (the UTF-8 standard never had this limit). In MySQL
5.5 and later, you can use utf8mb4 to get actual standards-compliant
UTF-8 support; given that 5.5 has been "Generally Available" for just
over 5 years now, that seems like a reasonable default for Kallithea
to use. But note that we must use "ut8mb4" as the table encoding as
well to get any benefit; it's not enough to use it as the connection
encoding.

Ironically, this means that the lack of a connection encoding (and
thus using the default "latin1") ensures full Unicode support, whereas
setting the connection encoding to "utf8" would limit users to BMP. :)

Given this, it seems we should do something like

diff --git a/.travis.yml b/.travis.yml
--- a/.travis.yml
+++ b/.travis.yml
@@ -8,1 +8,1 @@
-  - TEST_DB=mysql://root@127.0.0.1/kallithea_test
+  - TEST_DB=mysql://root@127.0.0.1/kallithea_test?charset=utf8mb4
diff --git a/development.ini b/development.ini
--- a/development.ini
+++ b/development.ini
@@ -496,1 +496,1 @@
-#sqlalchemy.db1.url = mysql://user:pass@localhost/kallithea
+#sqlalchemy.db1.url = mysql://user:pass@localhost/kallithea?charset=utf8mb4
diff --git a/kallithea/bin/template.ini.mako b/kallithea/bin/template.ini.mako
--- a/kallithea/bin/template.ini.mako
+++ b/kallithea/bin/template.ini.mako
@@ -498,1 +498,1 @@
-sqlalchemy.db1.url = mysql://user:pass@localhost/kallithea
+sqlalchemy.db1.url = mysql://user:pass@localhost/kallithea?charset=utf8mb4
diff --git a/kallithea/config/deployment.ini_tmpl
b/kallithea/config/deployment.ini_tmpl
--- a/kallithea/config/deployment.ini_tmpl
+++ b/kallithea/config/deployment.ini_tmpl
@@ -488,1 +488,1 @@
-#sqlalchemy.db1.url = mysql://user:pass@localhost/kallithea
+#sqlalchemy.db1.url = mysql://user:pass@localhost/kallithea?charset=utf8mb4
diff --git a/kallithea/model/db.py b/kallithea/model/db.py
--- a/kallithea/model/db.py
+++ b/kallithea/model/db.py
@@ -166,1 +166,1 @@
-         'mysql_charset': 'utf8', 'sqlite_autoincrement': True}
+         'mysql_charset': 'utf8mb4', 'sqlite_autoincrement': True}
diff --git a/kallithea/tests/test.ini b/kallithea/tests/test.ini
--- a/kallithea/tests/test.ini
+++ b/kallithea/tests/test.ini
@@ -498,1 +498,1 @@
-#sqlalchemy.db1.url = mysql://user:pass@localhost/kallithea
+#sqlalchemy.db1.url = mysql://user:pass@localhost/kallithea?charset=utf8mb4

This will ensure proper encoding for new installations.

For existing users, database tables would remain 'utf8' (and stay
restricted to the BMP Unicode subset), though any new tables added
would become 'utf8mb4'. This should not cause any problems, as long as
they also stick to the old and incorrect connection encoding.

Existing users wishing to get Unicode "compliant" would have to first
convert all string data using a combination of these statements (for
every table and field, as appropriate):

    ALTER TABLE tbl CONVERT TO CHARACTER SET utf8mb4;
    UPDATE tbl SET field=CONVERT(BINARY field USING utf8mb4);

I guess we could document this if anyone's up for writing the SQL
script and testing that it works...

/Søren

On Thu, Dec 24, 2015 at 5:33 AM, Mads Kiilerich <mads at kiilerich.com> wrote:
> On 11/27/2015 01:33 PM, Dominik Ruf wrote:
>
> Great.
> BTW I made another test and it seems the key thing is charset=utf8.
>
>
> TLDR: Lars is right that a default Kallithea installation on MySQL stores
> utf-8 in the database instead of storing unicode and letting the database
> deal with the encoding. I was also right that it generally works fine
> anyway. ;-)
>
> I also tested (with Fedora, mariadb and mysql-python). I tested by creating
> a new database, changing the admin users name to blåbærgrød, creating a
> blåbærgrød repository, and inspecting database and file system content.
>
> Everything worked flawlessly with the default mysql url. Only with the
> caveat that it stores utf-8 in the database. Sqlalchemy will however encode
> and decode it consistently so everything just works ... but I guess
> collation order and other "details" might be wrong and direct database
> hacking will be tricky - as Lars found out the hard way in the initial post.
>
> I agree that
> sqlalchemy.db1.url =
> mysql://kallithea:foobar@localhost/kallithea?charset=utf8
> seems to be the right "solution". It works and the database content is as
> expected. (Except that this however apparently not is fully unicode
> compliant and it would be better to use utf8mb4 ...)
>
> I don't know the root cause of the weirdness. It might be some (old and
> fixed?) MySQL deficiencies and workarounds in SqlAlchemy ... or something in
> Kallithea that triggers it. I guess it could be the combination of mysql not
> being unicode compliant by default and convert_unicode thus triggering the
> unnecessary utf8 encoding.
> (http://docs.sqlalchemy.org/en/latest/core/engines.html#sqlalchemy.create_engine.params.encoding
> could also seem to play a role ... but probably only relevant for
> understanding.)
>
> I guess we should change the default mysql uri in the .ini files to use
> charset=utf8?
>
> Each table already specifies mysql_charset utf8 ... but that is apparently
> for something else?
>
> We should probably also improve the documentation to give some advice of
> which "DBAPI" to use. Any recommendations?
>
> I guess we also should get rid of all the explicit convert_unicode in db.py
> and .ini and just use Unicode and UnicodeText fields.
>
> Changes in this area could however cause pain for installations that happily
> are using mysql with double encoding.
>
> /Mads
>
>
> _______________________________________________
> kallithea-general mailing list
> kallithea-general at sfconservancy.org
> http://lists.sfconservancy.org/mailman/listinfo/kallithea-general
>


More information about the kallithea-general mailing list