Movable Type and Unicode

Running Movable Type natively in Unicode was not as difficult as I thought but it still required a number of patches to the code.

I have been trying to get Movable Type to run Unicode natively for a while. When Movable Type was upgraded to version 3.3, I saw my chance. This new version has a lot of the needed code for encoding and decoding etc. and made my job much easier than before.

If you remember my previous travails, DBD::mysql module lacked UTF8 support. Almost immediately after my changes, the develper release of DBD::mysql finally included a UTF8 patch. But that was too late for me. Plus I am going to wait for it to be included in a regular release since DBD::mysql is somewhat complicated.

What I did was to set the UTF-8 flag for everything coming out of the database using a wrapper around the DBI module. I used Pavel Kudinov’s code for that, which is given below.

# UTF8DBI.pm re-implementation by Pavel Kudinov http://search.cpan.org/~kudinov/
# originally from: http://dysphoria.net/code/perl-utf8/
package UTF8DBI    ; use base DBI    ;
package UTF8DBI::db; use base DBI::db;
package UTF8DBI::st; use base DBI::st;
sub _utf8_() {
use Encode;
if    (ref $_ eq 'ARRAY'){ &_utf8_() foreach        @$_  }
elsif (ref $_ eq 'HASH' ){ &_utf8_() foreach values %$_  }
else                     {         Encode::_utf8_on($_) };
$_;
};
sub fetch             { return _utf8_ for shift->SUPER::fetch            (@_)  };
sub fetchrow_arrayref { return _utf8_ for shift->SUPER::fetchrow_arrayref(@_)  };
sub fetchrow_hashref  { return _utf8_ for shift->SUPER::fetchrow_hashref (@_)  };
sub fetchall_arrayref { return _utf8_ for shift->SUPER::fetchall_arrayref(@_)  };
sub fetchall_hashref  { return _utf8_ for shift->SUPER::fetchall_hashref (@_)  };
sub fetchcol_arrayref { return _utf8_ for shift->SUPER::fetchcol_arrayref(@_)  };
sub fetchrow_array    {                 @{shift->       fetchrow_arrayref(@_)} };
1;

With that code, I needed to replace calls to DBI module with calls to UTF8DBI module as shown in the patches below.

--- lib/MT/ObjectDriver/DBI.pm.orig	2006-09-06 19:27:17.000000000 -0700
+++ lib/MT/ObjectDriver/DBI.pm	2006-09-06 19:23:09.000000000 -0700
@@ -7,7 +7,7 @@
package MT::ObjectDriver::DBI;
use strict;
-use DBI;
+use UTF8DBI;
use MT::Util qw( offset_time_list );
use MT::ObjectDriver;
--- lib/MT/ObjectDriver/DBI/mysql.pm.orig	2006-09-06 19:26:55.000000000 -0700
+++ lib/MT/ObjectDriver/DBI/mysql.pm	2006-09-06 19:24:20.000000000 -0700
@@ -93,10 +93,10 @@
$dsn .= ';hostname=' . $cfg->DBHost if $cfg->DBHost;
$dsn .= ';mysql_socket=' . $cfg->DBSocket if $cfg->DBSocket;
$dsn .= ';port=' . $cfg->DBPort if $cfg->DBPort;
-    $driver->{dbh} = DBI->connect($dsn, $cfg->DBUser, $cfg->DBPassword,
+    $driver->{dbh} = UTF8DBI->connect($dsn, $cfg->DBUser, $cfg->DBPassword,
{ RaiseError => 0, PrintError => 0 })
or return $driver->error(MT->translate("Connection error: [_1]",
-             $DBI::errstr));
+             $UTF8DBI::errstr));
$driver;
}

However, that didn’t fix all the problems. The Perl CGI module was still working in Latin1 mode. I could wrap that into a UTF8CGI module but the newer versions of CGI module support Unicode. So I just upgraded the version of CGI bundled with Movable Type. Still I needed to tell the CGI module that the character set in use was UTF-8. I could either do that every single time the CGI module was called or I could just set the default character set to UTF-8. Since this CGI module was in the Movable Type extlib folder, I decided to modify its default character set.

--- extlib/CGI.pm.orig	2006-09-15 10:39:30.000000000 -0700
+++ extlib/CGI.pm	2006-09-15 10:39:59.000000000 -0700
@@ -517,8 +517,8 @@
$fh = to_filehandle($initializer) if $initializer;
-    # set charset to the safe ISO-8859-1
-    $self->charset('ISO-8859-1');
+    # set charset to utf-8
+    $self->charset('utf-8');
METHOD: {

I also set the utf8 mode for writing the files to disk.

--- lib/MT/FileMgr/Local.pm.orig	2006-09-27 06:56:39.000000000 -0700
+++ lib/MT/FileMgr/Local.pm	2006-09-27 06:57:36.000000000 -0700
@@ -75,6 +75,9 @@
binmode(FH);
binmode($from) if $fmgr->is_handle($from);
}
+    else {
+        binmode(FH, ":utf8");
+    }
## Lock file unless NoLocking specified.
flock FH, LOCK_EX unless $fmgr->{cfg}->NoLocking;
seek FH, 0, 0;

These changes caused problems with file uploads through the Movable Type interface. I expected this since I have run into this problem with PHP and mbstring as well. The following patch fixed this issue.

--- lib/MT/App/CMS.pm.orig	2006-10-08 21:17:11.000000000 -0700
+++ lib/MT/App/CMS.pm	2006-10-08 21:17:37.000000000 -0700
@@ -8334,6 +8334,7 @@
$app->validate_magic() or return;
my $q = $app->param;
+    $q->charset('iso-8859-1');
my($fh, $no_upload);
if ($ENV{MOD_PERL}) {
my $up = $q->upload('file');

Then it was time to comment out the liberally sprinkled code to switch off the utf8 flag in Movable Type.

--- lib/MT/I18N/default.pm.orig	2006-09-16 20:22:22.000000000 -0700
+++ lib/MT/I18N/default.pm	2006-09-16 20:23:26.000000000 -0700
@@ -292,7 +292,7 @@
$text = $class->_conv_to_utf8($text, $enc) if $enc ne 'utf-8';
Encode::_utf8_on($text);
$text = substr($text, $startpos, $length);
-    Encode::_utf8_off($text);
+#    Encode::_utf8_off($text);
$text = $class->_conv_from_utf8($text, $enc) if $enc ne 'utf-8';
$text;
}
@@ -322,7 +322,7 @@
}
}
-    Encode::_utf8_off($text) if $to eq 'utf-8';
+#    Encode::_utf8_off($text) if $to eq 'utf-8';
$text;
}

Finally I had to make changes to the MTHash plugin that I use to force comment previews. The Digest::SHA1 module only accepts bytes, therefore, the UTF-8 characters had to be encoded as bytes before being passed to any functions in the module. Here is my patch:

--- lib/MT/App/Comments.pm.orig	2006-09-16 21:01:21.000000000 -0700
+++ lib/MT/App/Comments.pm	2006-09-16 21:03:08.000000000 -0700
@@ -266,9 +266,10 @@
require Digest::SHA1;
my $sha1 = Digest::SHA1->new;
-     $sha1->add($q->param('text') . $q->param('entry_id') . $app->remote_ip
-                . $q->param('author') . $q->param('email') . $q->param('url')
-                . $q->param('convert_breaks'));
+     my $octets = Encode::encode_utf8($q->param('text') . $q->param('entry_id') . $app->remote_ip
+                                      . $q->param('author') . $q->param('email') . $q->param('url')
+                                      . $q->param('convert_breaks'));
+     $sha1->add($octets);
my $salt_file = MT::ConfigMgr->instance->PluginPath .'/salt.txt';
my $FH;
open($FH, $salt_file) or die "cannot open file <$salt_file> ($!)";
--- plugins/MTHash.pl.orig	2006-09-16 20:29:22.000000000 -0700
+++ plugins/MTHash.pl	2006-09-16 20:57:22.000000000 -0700
@@ -32,7 +32,8 @@
or return $ctx->error($ctx->errstr);
my $sha1 = Digest::SHA1->new;
-  $sha1->add($content);
+  my $octets = Encode::encode_utf8($content);
+  $sha1->add($octets);
my $salt_file = MT::ConfigMgr->instance->PluginPath .'/salt.txt';
open(FH, $salt_file) or die "cannot open file <$salt_file> ($!)";
$sha1->addfile(FH);

One thing that I still need to do is to fix the Serializer and Un-serializer used by Movable Type plugins.

Author: Zack

Dad, gadget guy, bookworm, political animal, global nomad, cyclist, hiker, tennis player, photographer

One thought on “Movable Type and Unicode”

Comments are closed.