Eid and Diwali Mubarak

Eid Mubarak. Happy Diwali.

Eid Mubarak, everyone. We are celebrating Eid today here in Atlanta.

And Happy Diwali too.

Movable Type, MySQL, Perl, Unicode

Unicode is tough. Programs and languages still have issues with supporting Unicode properly. This is the story of my adventures with phpbb, PHP, Movable Type, MySQL, and Perl wrestling with their Unicode support.

Unicode is tough. It’s tough because of bad programmers and their ingrained habits. Everybody loves shortcuts and programmers are no exception. That is why years after the introduction of Unicode, it is still difficult to do any real application programming with Unicode data.

Let us look at two examples.

We had a problem on our Urdu forum which runs on phpbb. Any user who registered with a somewhat longish Urdu user name could not log into the forum again. The maximum length of the user name field was set to 25 characters but people had problems even with user names 14 characters long. At first, we thought it was due to the size of the field in the MySQL database, but increasing that didn’t help. After much effort, I found out that when logging in (but not when registering), the user name was truncated to 25 characters using the PHP function substr. And of course it turned out that substr works only with bytes, not with characters. I hope you understand the difference between characters and bytes unlike a lot of programmers. So I had to replace substr with mb_substr. Yes, PHP has a separate set of string functions for multibyte encodings.

The other example comes from Movable Type which claims:

Movable Type ships with full support for Unicode and international character sets. Official, fully-supported versions with translated documentation are now available in Japanese, French, German, Spanish, and Dutch.

However, take a look at this function in the Movable Type code:

sub no_utf8 {
for (@_) {
next if !defined $_;
$_ = pack 'C0A*', $_;

So what does this function do? It converts character-based strings to bytes. It is used to truncate the excerpts of incoming trackbacks to 255 bytes. Unicode must be pretty hard if programmers keep tripping over the difference between characters and bytes! This code has been in Movable Type since version 2.6 (or earlier) and is still there in the latest version 3.2.

Now, remember that Movable Type has a Japanese version which obviously would have issues with this. However, the character-to-byte conversion is still done in that version, but some additional processing is used to bring it back to characters. Why? Because Unicode is hard, of course.

That was just a preamble. Let’s now talk about the problem that prompted this post.

First, here are the different versions of the software we’ll talk about that I am using:

Since my webhost recently upgraded to Debian 3.1 and Perl 5.8.4, I thought I could do Unicode better on my MT blog here. Perl 5.8 is really the first version of Perl with Unicode support; Perl 5.6.1 claimed some support but there were lots of issues. I should know, I tried.

Among my ideas was one that I wanted to use actual Unicode characters instead of HTML entities, numeric character references etc. Right now I am using numeric character references for MathML (via the Numeric Entities plugin), Urdu dates (my own localization of MT) and smart quotes (via the Smartypants plugin). The Numeric Entities plugin has a mode where it can output UTF-8 characters while the other two needed to be edited to use Unicode character literals instead of numeric character references (i.e., \x{hhhh} instead of &#xhhhh; where hhhh is a hexadecimal number).

When I made these changes, it garbled up my web pages. Basically, the Unicode characters that I had changed from entities showed up okay but the rest of the non-ASCII Unicode characters on the page were garbled into accented roman characters. Counting these roman characters it looked like there were more of these characters than the actual Unicode characters which they replaced on the web page. Can we say: Why is Unicode hard? Bytes vs characters, sir!

At first, I was stumped. How could the Numeric Entities plugin affect characters that were not even processed by it? Then the harsh truth dawned on me. Perl uses a UTF8 flag to mark Unicode strings. When two strings are concatenated but only one is a Unicode string, then the other must be converted into Unicode before the concatenation. By default, strings in Perl are Latin1 or ISO-8859-1. So what was happening was that stuff wasn’t marked with the UTF8 flag in Movable Type, but the explicitly defined Unicode characters generated by the Numeric Entities plugin were so marked. When these were concatenated to form the web page output, anything other than the characters converted from entities by Numeric Entities plugin was considered to be Latin1 and hence converted byte-by-byte to Unicode. Garbage (no UTF8 flag) in, Garbage (garbled Unicode characters) out! I confirmed this by doing a is_utf8() at different stages in Movable Type.

To recap, this meant that any programmatic insertion of Unicode characters was properly marked as Unicode. However, since none of the Unicode data entered by the user (in the entry or comment fields or even in the templates) was marked as such, the presence of programmatic Unicode characters garbled the rest of the non-ASCII Unicode characters.

The next step was to find out why and where this was happening. The first thing I found was that all the data in my MySQL database was marked Latin1. Why? Legacy issues: When I created that database, my host was running MySQL 4.0 which had only one character set: Latin1. MySQL 4.1 added Unicode (and lots of other character sets) support so that you can now assign character sets and collations not only to databases and tables but also to columns. That is great but if you want databases created in 4.0 or earlier to keep running without any conversions then all such databases need to be assigned the latin1 character set which is what my host did. As to why I was seeing the correct Unicode characters on my website and in the MT interface: The data was stored simply as bytes.

Time to fix the MySQL database character set then. First, I tried the easy way but that didn’t work for some odd reason. So I had to do it the hard way which involved changing the type of each column from CHAR(n), VARCHAR(n), TEXT, MEDIUMTEXT, TINYTEXT or LONGTEXT to its corresponding binary type (BINARY(n), VARBINARY(n), BLOB, MEDIUMBLOB, TINYBLOB or LONGBLOB) and then back along with changing the character to utf8. Here are the statements:


Replace t by the table name, column by the column name, and n by the length of the field for CHAR and VARCHAR types. Also, choose the corresponding data types for the specific column. One complication (other than doing this individually for all columns in all tables) is that you have to specify any attributes that were originally there for the column, otherwise they get dropped. These would be things like whether the default value for the field is NULL or the field cannot have a NULL value, etc. The easiest way to do this is using phpMyAdmin instead of typing these statements since it lets you alter specific characteristics.

I went through this for the more than 50 columns that needed changing for my Movable Type database. It was good to see the Urdu characters finally appearing in phpMyAdmin as I browsed the database. Then I opened the MT interface and saw that all Urdu characters were appearing as “?”. What the ****? Then I remembered: Unicode is hard.

Suddenly a light went on and I remembered this comment by Asif when he transferred our Urdu Wiki to its present location. Reading up on that I realized that in addition to database, table and column character sets, there were also character sets defined for server, client and connection. And obviously these were set to a default of latin1.

I found out where to put this statement (SET NAMES utf8) in mt/lib/MT/ObjectDriver/DBI/mysql.pm but found a more elegant solution than simply setting it for all cases. So here’s my patch (for Movable Type 3.2) for this issue:

--- lib/MT/ConfigMgr.pm.orig    2005-08-16 19:37:11.000000000 -0700
+++ lib/MT/ConfigMgr.pm 2005-11-01 19:48:26.000000000 -0800
@@ -151,6 +151,7 @@
['OutboundTrackbackLimit', { Default => 'any' }],
['OutboundTrackbackDomains', { Type => 'ARRAY' } ],
['IndexBasename', {Default => 'index'}],
+        ['SQLSetNames', {Default => 0}],
--- lib/MT/ObjectDriver/DBI/mysql.pm.orig       2005-07-29 13:41:11.000000000 -0700
+++ lib/MT/ObjectDriver/DBI/mysql.pm    2005-11-01 19:45:32.000000000 -0800
@@ -49,6 +49,11 @@
{ RaiseError => 0, PrintError => 0 })
or return $driver->error(MT->translate("Connection error: [_1]",
+    if ($cfg->SQLSetNames && (my $c = lc $cfg->PublishCharset)) {
+        my %Charset = ('utf-8' => 'utf8', 'shift_jis' => 'sjis', 'euc-jp' => 'ujis');
+        my $c = $Charset{$c} ? $Charset{$c}  : $c;
+        $driver->{dbh}->do("SET NAMES " . $c);
+    }

And of course add SQLSetNames 1 in your mt-config.cgi or mt.cfg configuration file.

Elated, I opened up MT and checked. The question marks had disappeared. So far, so good. Let’s check if the original Unicode problem was fixed. Oh no! It isn’t. I guess we are back to square one. Did I say something about Unicode being hard?

What is the reason for the problem now? MySQL has nice UTF-8 data which it passes along to MT’s Perl functions. Why is the data still not marked with the UTF-8 flag? Movable Type uses the Perl modules DBI and DBD::mysql to access the MySQL database. And guess what? They don’t have any Unicode support. In fact, forget marking the UTF-8 flag properly, according to this, DBD::mysql doesn’t even preserve UTF-8 flag when it’s already there.

In the end, I have 3 options:

  1. Wait for Unicode support for DBI/DBD::mysql which might be a long time since nobody is sure if it should be provided by the database-independent interface DBI or by the MySQL driver DBD::mysql or both together in some way.
  2. Use decode_utf8 on every output from the database. This is not very easy to do.
  3. Use a patch which blesses all database data (yes that includes the binary fields) as UTF-8 based on a flag you set when connecting to the database.

None of these options is very appealing and all have side-effects and problems associated with them. My plan is to set up a development subdomain and then test out options 2 and 3 there thoroughly before bringing them online for my weblog.

POSTSCRIPT: For something funny about Perl’s Unicode support, read about the difference between utf8 and UTF-8.

UPDATE: Here’s a DBI modification that does the same thing as option 3 above. See also an amendment.