« Happy Thanksgiving | Main | درجن »
منگل 28 نومبر 2006Tuesday, November 28, 2006
Movable Type and Unicode
I have been trying to get Movable Type to run Unicode natively for a while. When Movable Type was upgraded to version 3.3, I saw my chance. This new version has a lot of the needed code for encoding and decoding etc. and made my job much easier than before.
If you remember my previous travails, DBD::mysql module lacked UTF8 support. Almost immediately after my changes, the develper release of DBD::mysql finally included a UTF8 patch. But that was too late for me. Plus I am going to wait for it to be included in a regular release since DBD::mysql is somewhat complicated.
What I did was to set the UTF-8 flag for everything coming out of the database using a wrapper around the DBI module. I used Pavel Kudinov’s code for that, which is given below.
# UTF8DBI.pm re-implementation by Pavel Kudinov http://search.cpan.org/~kudinov/
# originally from: http://dysphoria.net/code/perl-utf8/
package UTF8DBI ; use base DBI ;
package UTF8DBI::db; use base DBI::db;
package UTF8DBI::st; use base DBI::st;
sub _utf8_() {
use Encode;
if (ref $_ eq 'ARRAY'){ &_utf8_() foreach @$_ }
elsif (ref $_ eq 'HASH' ){ &_utf8_() foreach values %$_ }
else { Encode::_utf8_on($_) };
$_;
};
sub fetch { return _utf8_ for shift->SUPER::fetch (@_) };
sub fetchrow_arrayref { return _utf8_ for shift->SUPER::fetchrow_arrayref(@_) };
sub fetchrow_hashref { return _utf8_ for shift->SUPER::fetchrow_hashref (@_) };
sub fetchall_arrayref { return _utf8_ for shift->SUPER::fetchall_arrayref(@_) };
sub fetchall_hashref { return _utf8_ for shift->SUPER::fetchall_hashref (@_) };
sub fetchcol_arrayref { return _utf8_ for shift->SUPER::fetchcol_arrayref(@_) };
sub fetchrow_array { @{shift-> fetchrow_arrayref(@_)} };
1;
With that code, I needed to replace calls to DBI module with calls to UTF8DBI module as shown in the patches below.
--- lib/MT/ObjectDriver/DBI.pm.orig 2006-09-06 19:27:17.000000000 -0700
+++ lib/MT/ObjectDriver/DBI.pm 2006-09-06 19:23:09.000000000 -0700
@@ -7,7 +7,7 @@
package MT::ObjectDriver::DBI;
use strict;
-use DBI;
+use UTF8DBI;
use MT::Util qw( offset_time_list );
use MT::ObjectDriver;
--- lib/MT/ObjectDriver/DBI/mysql.pm.orig 2006-09-06 19:26:55.000000000 -0700
+++ lib/MT/ObjectDriver/DBI/mysql.pm 2006-09-06 19:24:20.000000000 -0700
@@ -93,10 +93,10 @@
$dsn .= ';hostname=' . $cfg->DBHost if $cfg->DBHost;
$dsn .= ';mysql_socket=' . $cfg->DBSocket if $cfg->DBSocket;
$dsn .= ';port=' . $cfg->DBPort if $cfg->DBPort;
- $driver->{dbh} = DBI->connect($dsn, $cfg->DBUser, $cfg->DBPassword,
+ $driver->{dbh} = UTF8DBI->connect($dsn, $cfg->DBUser, $cfg->DBPassword,
{ RaiseError => 0, PrintError => 0 })
or return $driver->error(MT->translate("Connection error: [_1]",
- $DBI::errstr));
+ $UTF8DBI::errstr));
$driver;
}
However, that didn’t fix all the problems. The Perl CGI module was still working in Latin1 mode. I could wrap that into a UTF8CGI module but the newer versions of CGI module support Unicode. So I just upgraded the version of CGI bundled with Movable Type. Still I needed to tell the CGI module that the character set in use was UTF-8. I could either do that every single time the CGI module was called or I could just set the default character set to UTF-8. Since this CGI module was in the Movable Type extlib folder, I decided to modify its default character set.
--- extlib/CGI.pm.orig 2006-09-15 10:39:30.000000000 -0700
+++ extlib/CGI.pm 2006-09-15 10:39:59.000000000 -0700
@@ -517,8 +517,8 @@
$fh = to_filehandle($initializer) if $initializer;
- # set charset to the safe ISO-8859-1
- $self->charset('ISO-8859-1');
+ # set charset to utf-8
+ $self->charset('utf-8');
METHOD: {
I also set the utf8 mode for writing the files to disk.
--- lib/MT/FileMgr/Local.pm.orig 2006-09-27 06:56:39.000000000 -0700
+++ lib/MT/FileMgr/Local.pm 2006-09-27 06:57:36.000000000 -0700
@@ -75,6 +75,9 @@
binmode(FH);
binmode($from) if $fmgr->is_handle($from);
}
+ else {
+ binmode(FH, ":utf8");
+ }
## Lock file unless NoLocking specified.
flock FH, LOCK_EX unless $fmgr->{cfg}->NoLocking;
seek FH, 0, 0;
These changes caused problems with file uploads through the Movable Type interface. I expected this since I have run into this problem with PHP and mbstring as well. The following patch fixed this issue.
--- lib/MT/App/CMS.pm.orig 2006-10-08 21:17:11.000000000 -0700
+++ lib/MT/App/CMS.pm 2006-10-08 21:17:37.000000000 -0700
@@ -8334,6 +8334,7 @@
$app->validate_magic() or return;
my $q = $app->param;
+ $q->charset('iso-8859-1');
my($fh, $no_upload);
if ($ENV{MOD_PERL}) {
my $up = $q->upload('file');
Then it was time to comment out the liberally sprinkled code to switch off the utf8 flag in Movable Type.
--- lib/MT/I18N/default.pm.orig 2006-09-16 20:22:22.000000000 -0700
+++ lib/MT/I18N/default.pm 2006-09-16 20:23:26.000000000 -0700
@@ -292,7 +292,7 @@
$text = $class->_conv_to_utf8($text, $enc) if $enc ne 'utf-8';
Encode::_utf8_on($text);
$text = substr($text, $startpos, $length);
- Encode::_utf8_off($text);
+# Encode::_utf8_off($text);
$text = $class->_conv_from_utf8($text, $enc) if $enc ne 'utf-8';
$text;
}
@@ -322,7 +322,7 @@
}
}
- Encode::_utf8_off($text) if $to eq 'utf-8';
+# Encode::_utf8_off($text) if $to eq 'utf-8';
$text;
}
Finally I had to make changes to the MTHash plugin that I use to force comment previews. The Digest::SHA1 module only accepts bytes, therefore, the UTF-8 characters had to be encoded as bytes before being passed to any functions in the module. Here is my patch:
--- lib/MT/App/Comments.pm.orig 2006-09-16 21:01:21.000000000 -0700
+++ lib/MT/App/Comments.pm 2006-09-16 21:03:08.000000000 -0700
@@ -266,9 +266,10 @@
require Digest::SHA1;
my $sha1 = Digest::SHA1->new;
- $sha1->add($q->param('text') . $q->param('entry_id') . $app->remote_ip
- . $q->param('author') . $q->param('email') . $q->param('url')
- . $q->param('convert_breaks'));
+ my $octets = Encode::encode_utf8($q->param('text') . $q->param('entry_id') . $app->remote_ip
+ . $q->param('author') . $q->param('email') . $q->param('url')
+ . $q->param('convert_breaks'));
+ $sha1->add($octets);
my $salt_file = MT::ConfigMgr->instance->PluginPath .'/salt.txt';
my $FH;
open($FH, $salt_file) or die "cannot open file <$salt_file> ($!)";
--- plugins/MTHash.pl.orig 2006-09-16 20:29:22.000000000 -0700
+++ plugins/MTHash.pl 2006-09-16 20:57:22.000000000 -0700
@@ -32,7 +32,8 @@
or return $ctx->error($ctx->errstr);
my $sha1 = Digest::SHA1->new;
- $sha1->add($content);
+ my $octets = Encode::encode_utf8($content);
+ $sha1->add($octets);
my $salt_file = MT::ConfigMgr->instance->PluginPath .'/salt.txt';
open(FH, $salt_file) or die "cannot open file <$salt_file> ($!)";
$sha1->addfile(FH);
One thing that I still need to do is to fix the Serializer and Un-serializer used by Movable Type plugins.
Tags: movabletype, perl, unicode
Posted by Zack at November 28, 2006 2:09 PM in Internet
Related Entries
Advertisements
Trackback Pings
TrackBack URL for this entry:
http://www.zackvision.com/mt/zv-trbk.cgi/1028
Comments
Post a comment
Note: Disagreements are welcome, but please keep it civil. Any comments full of hatred, bigotry, trolling or spam will be deleted and the commenter banned. Do read the commenting policy.
Valid XHTML: You have to preview your comment to make sure that it is valid XHTML 1.1. You will see the "Post" button on the preview page.
Urdu: To comment in Urdu, include "p[ur](urdu). " (with a space at the end and without the quotes) at the start of every Urdu paragraph. If you want to write an Urdu word(s) in an English paragraph, do it like this: %[ur](urdu)اردو%. If you want to put an English word(s) in an Urdu paragraph, write it like this: %[en](en)English words%.
PGP Signing: PGP-signed comments are encouraged. However, clearsigning Urdu text with GPGshell produces garbage.
MathML: Select the Textile with itex to MathML text filter. What you'll use is itex, which is a superset of WebTeX and differs somewhat from standard LaTeX.
Text Filters: For regular comments, whether in English or Urdu, keep the text filter setting to its default of Textile 2. Change it to Textile with itex to MathML when writing MathML.