Movable Type, MySQL, Perl, Unicode

Unicode is tough. It’s tough because of bad programmers and their ingrained habits. Everybody loves shortcuts and programmers are no exception. That is why years after the introduction of Unicode, it is still difficult to do any real application programming with Unicode data.

Let us look at two examples.

We had a problem on our Urdu forum which runs on phpbb. Any user who registered with a somewhat longish Urdu user name could not log into the forum again. The maximum length of the user name field was set to 25 characters but people had problems even with user names 14 characters long. At first, we thought it was due to the size of the field in the MySQL database, but increasing that didn’t help. After much effort, I found out that when logging in (but not when registering), the user name was truncated to 25 characters using the PHP function substr. And of course it turned out that substr works only with bytes, not with characters. I hope you understand the difference between characters and bytes unlike a lot of programmers. So I had to replace substr with mb_substr. Yes, PHP has a separate set of string functions for multibyte encodings.

The other example comes from Movable Type which claims:

Movable Type ships with full support for Unicode and international character sets. Official, fully-supported versions with translated documentation are now available in Japanese, French, German, Spanish, and Dutch.

However, take a look at this function in the Movable Type code:

sub no_utf8 {
for (@_) {
next if !defined $_;
$_ = pack 'C0A*', $_;
}
}

So what does this function do? It converts character-based strings to bytes. It is used to truncate the excerpts of incoming trackbacks to 255 bytes. Unicode must be pretty hard if programmers keep tripping over the difference between characters and bytes! This code has been in Movable Type since version 2.6 (or earlier) and is still there in the latest version 3.2.

Now, remember that Movable Type has a Japanese version which obviously would have issues with this. However, the character-to-byte conversion is still done in that version, but some additional processing is used to bring it back to characters. Why? Because Unicode is hard, of course.

That was just a preamble. Let’s now talk about the problem that prompted this post.

First, here are the different versions of the software we’ll talk about that I am using:

Since my webhost recently upgraded to Debian 3.1 and Perl 5.8.4, I thought I could do Unicode better on my MT blog here. Perl 5.8 is really the first version of Perl with Unicode support; Perl 5.6.1 claimed some support but there were lots of issues. I should know, I tried.

Among my ideas was one that I wanted to use actual Unicode characters instead of HTML entities, numeric character references etc. Right now I am using numeric character references for MathML (via the Numeric Entities plugin), Urdu dates (my own localization of MT) and smart quotes (via the Smartypants plugin). The Numeric Entities plugin has a mode where it can output UTF-8 characters while the other two needed to be edited to use Unicode character literals instead of numeric character references (i.e., \x{hhhh} instead of &#xhhhh; where hhhh is a hexadecimal number).

When I made these changes, it garbled up my web pages. Basically, the Unicode characters that I had changed from entities showed up okay but the rest of the non-ASCII Unicode characters on the page were garbled into accented roman characters. Counting these roman characters it looked like there were more of these characters than the actual Unicode characters which they replaced on the web page. Can we say: Why is Unicode hard? Bytes vs characters, sir!

At first, I was stumped. How could the Numeric Entities plugin affect characters that were not even processed by it? Then the harsh truth dawned on me. Perl uses a UTF8 flag to mark Unicode strings. When two strings are concatenated but only one is a Unicode string, then the other must be converted into Unicode before the concatenation. By default, strings in Perl are Latin1 or ISO-8859-1. So what was happening was that stuff wasn’t marked with the UTF8 flag in Movable Type, but the explicitly defined Unicode characters generated by the Numeric Entities plugin were so marked. When these were concatenated to form the web page output, anything other than the characters converted from entities by Numeric Entities plugin was considered to be Latin1 and hence converted byte-by-byte to Unicode. Garbage (no UTF8 flag) in, Garbage (garbled Unicode characters) out! I confirmed this by doing a is_utf8() at different stages in Movable Type.

To recap, this meant that any programmatic insertion of Unicode characters was properly marked as Unicode. However, since none of the Unicode data entered by the user (in the entry or comment fields or even in the templates) was marked as such, the presence of programmatic Unicode characters garbled the rest of the non-ASCII Unicode characters.

The next step was to find out why and where this was happening. The first thing I found was that all the data in my MySQL database was marked Latin1. Why? Legacy issues: When I created that database, my host was running MySQL 4.0 which had only one character set: Latin1. MySQL 4.1 added Unicode (and lots of other character sets) support so that you can now assign character sets and collations not only to databases and tables but also to columns. That is great but if you want databases created in 4.0 or earlier to keep running without any conversions then all such databases need to be assigned the latin1 character set which is what my host did. As to why I was seeing the correct Unicode characters on my website and in the MT interface: The data was stored simply as bytes.

Time to fix the MySQL database character set then. First, I tried the easy way but that didn’t work for some odd reason. So I had to do it the hard way which involved changing the type of each column from CHAR(n), VARCHAR(n), TEXT, MEDIUMTEXT, TINYTEXT or LONGTEXT to its corresponding binary type (BINARY(n), VARBINARY(n), BLOB, MEDIUMBLOB, TINYBLOB or LONGBLOB) and then back along with changing the character to utf8. Here are the statements:

ALTER TABLE t MODIFY column BINARY(n) | VARBINARY(n) | BLOB | MEDIUMBLOB | TINYBLOB | LONGBLOB [ [DEFAULT | NOT] NULL];
ALTER TABLE t MODIFY column CHAR(n) | VARCHAR(n) | TEXT | MEDIUMTEXT | TINYTEXT | LONGTEXT CHARACTER SET utf8 [ [DEFAULT | NOT] NULL];

Replace t by the table name, column by the column name, and n by the length of the field for CHAR and VARCHAR types. Also, choose the corresponding data types for the specific column. One complication (other than doing this individually for all columns in all tables) is that you have to specify any attributes that were originally there for the column, otherwise they get dropped. These would be things like whether the default value for the field is NULL or the field cannot have a NULL value, etc. The easiest way to do this is using phpMyAdmin instead of typing these statements since it lets you alter specific characteristics.

I went through this for the more than 50 columns that needed changing for my Movable Type database. It was good to see the Urdu characters finally appearing in phpMyAdmin as I browsed the database. Then I opened the MT interface and saw that all Urdu characters were appearing as “?”. What the ****? Then I remembered: Unicode is hard.

Suddenly a light went on and I remembered this comment by Asif when he transferred our Urdu Wiki to its present location. Reading up on that I realized that in addition to database, table and column character sets, there were also character sets defined for server, client and connection. And obviously these were set to a default of latin1.

I found out where to put this statement (SET NAMES utf8) in mt/lib/MT/ObjectDriver/DBI/mysql.pm but found a more elegant solution than simply setting it for all cases. So here’s my patch (for Movable Type 3.2) for this issue:

--- lib/MT/ConfigMgr.pm.orig    2005-08-16 19:37:11.000000000 -0700
+++ lib/MT/ConfigMgr.pm 2005-11-01 19:48:26.000000000 -0800
@@ -151,6 +151,7 @@
['OutboundTrackbackLimit', { Default => 'any' }],
['OutboundTrackbackDomains', { Type => 'ARRAY' } ],
['IndexBasename', {Default => 'index'}],
+        ['SQLSetNames', {Default => 0}],
]);
}
--- lib/MT/ObjectDriver/DBI/mysql.pm.orig       2005-07-29 13:41:11.000000000 -0700
+++ lib/MT/ObjectDriver/DBI/mysql.pm    2005-11-01 19:45:32.000000000 -0800
@@ -49,6 +49,11 @@
{ RaiseError => 0, PrintError => 0 })
or return $driver->error(MT->translate("Connection error: [_1]",
$DBI::errstr));
+    if ($cfg->SQLSetNames && (my $c = lc $cfg->PublishCharset)) {
+        my %Charset = ('utf-8' => 'utf8', 'shift_jis' => 'sjis', 'euc-jp' => 'ujis');
+        my $c = $Charset{$c} ? $Charset{$c}  : $c;
+        $driver->{dbh}->do("SET NAMES " . $c);
+    }
$driver;
}

And of course add SQLSetNames 1 in your mt-config.cgi or mt.cfg configuration file.

Elated, I opened up MT and checked. The question marks had disappeared. So far, so good. Let’s check if the original Unicode problem was fixed. Oh no! It isn’t. I guess we are back to square one. Did I say something about Unicode being hard?

What is the reason for the problem now? MySQL has nice UTF-8 data which it passes along to MT’s Perl functions. Why is the data still not marked with the UTF-8 flag? Movable Type uses the Perl modules DBI and DBD::mysql to access the MySQL database. And guess what? They don’t have any Unicode support. In fact, forget marking the UTF-8 flag properly, according to this, DBD::mysql doesn’t even preserve UTF-8 flag when it’s already there.

In the end, I have 3 options:

  1. Wait for Unicode support for DBI/DBD::mysql which might be a long time since nobody is sure if it should be provided by the database-independent interface DBI or by the MySQL driver DBD::mysql or both together in some way.
  2. Use decode_utf8 on every output from the database. This is not very easy to do.
  3. Use a patch which blesses all database data (yes that includes the binary fields) as UTF-8 based on a flag you set when connecting to the database.

None of these options is very appealing and all have side-effects and problems associated with them. My plan is to set up a development subdomain and then test out options 2 and 3 there thoroughly before bringing them online for my weblog.

POSTSCRIPT: For something funny about Perl’s Unicode support, read about the difference between utf8 and UTF-8.

UPDATE: Here’s a DBI modification that does the same thing as option 3 above. See also an amendment.

Published
Categorized as Internet

By Zack

Dad, gadget guy, bookworm, political animal, global nomad, cyclist, hiker, tennis player, photographer

14 comments

  1. I spent several hours today, dumped & drop’d my database, recreated it with charset and collations not default, tore my hair some until I found why index.html (alone) wasn’t being rebuilt (pilot error in template menu box ticking), and found my test data remained … unaesthetic.

    * FreeBSD 6.0-STABLE
    * Perl 5.8.7
    * MySQL 4.1.15
    * Movable Type 3.2
    * DBI Perl module 1.48
    * DBD::mysql Perl module 3.0002

    I didn’t have this when the backend was BerkeleyDB. My next possiblities are mysql 5.0.5, postgresql 7.4 and 8, and going back to BerkeleyDB, and telling my co-authors to use WP.

    Thanks a lot for a thoughtful post. Please let me know if you think of anything.

  2. Интересно, через сколько лет

    …эти люди вырастут и научатся делать это а) сами б) сразу Причем это паттерн. The floggings will continue until morale improves With ruby, though, it isn’t just a case of American programmers forgetting to think about the other 95% of……

  3. Интересно, через сколько лет

    …эти люди вырастут и научатся делать это а) сами б) сразу Причем это паттерн. The floggings will continue until morale improves With ruby, though, it isn’t just a case of American programmers forgetting to think about the other 95% of……

  4. Eric: The Perl DBD driver for postgresql has some, though not good, Unicode support. And I hope you know that there is no Unicode support in PHP without mbstring extension.

    Scott: That is my current semi-solution as well. However, that still doesn’t mark the data as Unicode, though it does get copied from the database correctly.

  5. WordPress和MovableType的MySQL乱码问题解决方法

    WordPress和MovableType是主流的Blog系统,而他们都用的是MySQL数据库,那么在MySQL4.1下,中文的WP和MT就会产生种种的乱码问题。 如前MySQL4.1乱码问题分析的,一个程序( PHP,CGI 等)与MySQL建立连接后,这个程序发送给MySQL的数据采用的是什么字符集,MySQL 是无从得知的。所以解决乱码问题的根本就是我们在程序中告诉MySQL采用的编码是什么,简单的就是在程序中加入这样的一个语句: SET NAMES ‘utf8’。 这个语句的效果等同于同时设定了 SET character_set_client=’utf8’SET character_set_connection=’utf8’SET character_set_results=’utf8’ 为什么这么做? 我们安装MySQL4.1时按照默认配置,那么default-character-set= utf8。在MySQL Command Line Client下查看到的查看系统的字符集和排序方式的设定为: mysql> SHOW VARIABLES LIKE ‘character_set_%’;+————————–+—————————-+| Variable_name            |            Value           |+————————–+—————————-+| character_set_client     | latin1                     || character_set_connection | latin1                     || character_set_database   | utf8                       ||…

  6. Movable Type Unicode Issue

    I have used MySQL as MobableType backend database. After I setup a Chinese blog, I suddenly found the double-byte characters stored in database table were bad encoded. But MT shows correct double-byte encode in the webpage. Even current MySQL and…

  7. MySQL 4.1과 MovableType의 궁합 문제

    왜인지 모르겠지만, MySQL 4.1로 바뀐 이후 접속을 할 때의 세션의 캐릭터셋이 utf8로 설정이 안 되는 것 같다. 그것때문에 고민하다가 MovableType의 경우 이러한 해결 방법을 찾아서 일단 해결. 좀더 근본적인 해결책을 찾는 중이다…….

Comments are closed.