Movable Type Upgrade

I have upgraded to Movable Type 3.32 and have made modifications to run it natively with Unicode. I like some of the new features (better Unicode support and tags). If there are any problems due to the upgrade, drop me a line.

I just upgraded to Movable Type 3.32 and am now running Movable Type on UTF-8 natively (that’s my doing, not Six Apart’s), but more on the Unicode issues later.

Here’s what I like about the changes in MT 3.3.

  • Movable Type can now be configured (using the DeleteFilesAtRebuild configuration directive) to delete files made unnecessary by changes made in the administrative interface. Individual archive files are deleted when previously published entries are deleted or unpublished. Category archives are deleted when their corresponding categories are deleted.
  • Pings coming from the same IP address with the same Source URL are now silently discarded. A success value is, however, sent to the pinging server so that the doesn’t keep trying to reping. [Not the best approach but better than duplicate pings.]
  • The order of attributes specified in template tags is now observed and respected (e.g. trim_to=”10” remove_html=”1” is different than remove_html=”1” trim_to=”10”). In addition, the same attribute can now be processed multiple times if so desired (ie, regex=”abc” regex=”def”). [ Jacques must like that.]
  • Added textarea resizing controls to the template editing pages.
  • In version 3.2, using certain later versions of MySQL or postgreSQL, some non-ASCII characters were not returned correctly from the database as originally written. This was caused by a mismatch between the character_set_client and character_set_connection variables. To fix this problem, we’ve added a configuration directive, SQLSetNames, which will inform the database of the character set being used by the client. The database character set must match the PublishCharset used by Movable Type. [I had implemented it already in my system.]
  • Implemented TrackBack transcoding between many character sets via Encode/JCode modules. This allows for correct display of TrackBacks sent in an encoding different than the recipient’s blog character encoding. [It is good that Six Apart is doing this, but I don’t like their implementation and prefer the one by Jacques Distler which I have been using already.]

I also like the inclusion of tags, though it would take me some time to populate my posts with tags and make any use of them.

The search page is barely working right now, but I’ll fix it soon. If there are any other problems with the upgrade (with commenting, trackbacks or anything else), please let me know.

I have also made some template changes. One is the inclusion of a menu bar at the top so that you can find the most common pages easily. Also, I am now including the sidebar in most pages other than individual entries.

Four Years

My weblog is 4 years old today. It has changed a lot as it has grown. It is now more a personal blog than a political one. I don’t know where it’s going but I do plan on continuing writing about the usual mix of topics.

It has been four years since I started blogging. In that time, there have been a total of 916 posts and 5,527 comments on this weblog.

There have been many changes over the duration of this blog. My writing frequency is much less nowadays. The topics I post on are also different. This blog has become a lot more personal blog rather than a political one, though that might change as the midterm elections draw near. I still blog about politics, religion and current events but I don’t feel the urgency to post as soon as some news breaks. In some ways, I have also seceded from the blogosphere since I don’t read as many blogs as I used to nor do I link to or respond to a lot of the blog-chatter.

Because of the lack of urgency in posting about current events as well as the lack of free time from other activities, I have a few dozen unfinished draft entries some of which have no relevance any more. Who knows I might still post them in the near future.

According to Google Analytics, I got 22,462 visits and 38,195 page views in the last month. Overall in the 4 years, I am nearing a million page views.

Blogspot Blocked in Pakistan

Pakistani bloggers are livid as blogspot has been blocked by the Pakistani government.

The Muhammad cartoon controversy has claimed another victim. But this one is not a person but rather bloggers:

Pakistan telecom authorities have blocked several websites inviting people to draw cartoons of the Prophet Muhammad, it has emerged.

Instructions were issued to internet service providers across Pakistan on 27 February to block about a dozen websites of various origins.

[…] Bloggers in Pakistan became first became aware of the ban on 28 February when they were unable to access a popular blog hosting site, Blogspot.

One of the blocked sites is hosted on Blogspot, which led to the blocking of all web journals hosted on the site.

The Pakistan bloggers found their blogs blocked, even though their blogs are not connected with the cartoons.

They say they have still been able to edit and update their blogs, but not able to read them.

BBC Urdu has a copy of the official letter banning about a dozen websites, one of which is hosted on Blogger’s free blog hosting at blogspot.com.

Pakistani bloggers are, of course, not happy. Here is the response Moiz Khan got from his ISP.

I have got the response from our Network administrator I.e All the subdomains of the blogspot hosted on the server 66.102.15.101 are blocked from the ITI on the strict instructions from the FIA because of the unathorized and anti Islamic contents so they have block the IP of the server rather then URL blocking. We are getting many other complains for the said IP and webhost. Our written complaint is already submitted to the higher authorities of ITI.

Don't block the blogTeeth Maestro has started a “Don’t block the blog!” campaign while a Google group called Action Group Against Blogspot Ban in Pakistan has also been started. Nouman has also been active and has found fame (and fortune?) by getting his post published on BBC Urdu. Danial suggests the following actions against the ban:

Via my Dad comes the news that the Supreme Court of Pakistan has gotten in the act though on the censorship side.

The Supreme Court on Thursday directed the government to block internet sites displaying sacrilegious cartoons and called explanation from authorities concerned as to why these sites had not been blocked earlier.

A three-judge bench comprising Chief Justice Iftikhar Mohammad Chaudhry, Justice Faqir Mohammad Khokhar and Justice Mian Shakirullah Jan was hearing a petition of Dr Mohammad Imran Uppal.

It issued notices to Attorney General Makhdoom Ali Khan, Chairmen of the Pakistan Electronic Media and Regulatory Authority (Pemra) and the Pakistan Telecommunication Authority (PTA) for March 13.

The federal government, Ministry of Telecommunication, Pemra, PTA, Yahoo Incorporation USA and I & I Co, the host of blasphemous site, have been cited as respondents in the petition.

[…] Makhdoom Ali Khan was also asked by the court to explore legal ways to block objectionable material on websites.

“We will not accept any excuse or any technical objection on this issue as it concerns sentiments of entire Muslim Ummah,” CJ observed adding all concerned authorities would have to appear in the court on next hearing with report of concrete measures for implementation of court’s order.

It seems like the court has no idea of the Internet and doesn’t know that blocking any website is not easy and it is often the case that lots of other sites are accidently blocked too (like all those blogs in this case).

Bypassing the censorship of blogspot in Pakistan is very easy. Here are some tips to defeat censors.

And finally, I don’t understand what is the point of blocking a dozen websites while the cartoons are available now on thousands of sites. If Pakistan wants to block every such site, it might have to block big names like Wikipedia, search engines like Google and Yahoo!, proxy servers and so on. In short, the whole Internet would have to be blocked.

Urdu phpBB

We are making a public release of an Urdu translation of phpBB. It is more than a translation since we made some Unicode-related modifications to the core code and have included an on-screen keyboard for users.

In case you were wondering where I was, among other things I was doing this.

As Nabeel mentioned before, we have been working on releasing a public version of Urdu phpBB.

PhpBB, for those you do not know, is described on its website thus:

phpBB is a high powered, fully scalable, and highly customizable Open Source bulletin board package. phpBB has a user-friendly interface, simple and straightforward administration panel, and helpful FAQ. Based on the powerful PHP server language and your choice of MySQL, MS-SQL, PostgreSQL or Access/ODBC database servers, phpBB is the ideal free community solution for all web sites.

phpBB also has internationalization and localization support. Hence, it has been translated into a large number of languages.

Adding Urdu support to phpBB is a lot more than simply adding an Urdu language pack to it. Some changes to the phpBB core are needed to make it work properly as a Unicode Urdu forum. Further, several changes are required in the templates for properly displaying Urdu text. The integration of Urdu WebPad makes it possible to edit Urdu text without the need of installing Urdu language support or keyboard even on windows 98 systems. Therefore we have decided to release Urdu phpbb as a pre-modded package. Later, we also plan to package the Urdu language and image files separately for those running multilingual phbBB boards. For those wanting to turn their exiting phpBB forums into Urdu forum, we will release a MOD that will detail the steps needed to perform this conversion.

The version (0.5b2) of Urdu phpBB (based on phpBB version 2.0.19) we are releasing is beta quality software. Therefore, we need beta testers to help us debug it.

Most important of all, we need people who plan to install Urdu phpBB and start a forum. It will be good if they cover a variety of operating systems, database software and PHP versions.

We have also set up a test installation where you can register and help us find bugs. We also need some of those users to take the responsibility for testing moderator and admin functions.

You can download Urdu phpBB here. Information about the package, including requirements and installation instructions are here.

If you find a bug in Urdu phpBB or want to request a new/improved feature, please open a ticket. Please do check the existing tickets before opening a new one to make sure that someone hasn’t reported the same issue before.

If you have any question about Urdu phpBB or you need any help with it, please visit the Mehfil forum for Urdu phpBB support.

Brass Crescent Awards 2005

The voting for the 2nd Brass Crescent Awards for the Islamic blogosphere has started. I am nominated for best blog. Go and vote.

It is time for the Brass Crescent Awards again.

In recognition of the growing talent and creativity of the Islamic blogosphere, alt.muslim and City of Brass are hosting the Second Annual Brass Crescent Awards. The Brass Crescent Awards are named for the Story of the City of Brass in the Thousand and One Nights.

Last year, I was nominated for too many categories and ended up being a runner-up in three.

Here is how I voted:

  • Best Middle-East/Asian Blogger: Cha’nad Bahraini.
  • Best Group Blog: Hu despite the fact that they don’t post often enough.
  • Most Deserving of Wider Recognition: Akram’s Razor which I discovered recently.
  • Best Thinker: Thebit of towards God is our journey whose posts are always thoughtful and thought-provoking and usually demand a second read.
  • Best Female Blog: Koonj. What can I say, I am addicted to blogs that talk about pregnancy, dissertation, baby names, etc.
  • Best Post or Series: That Terror Thing by Chapati Mystery.
  • Best Non-Muslim Blog: Velveteen Rabbi.
  • Best Blog: um, myself.

So go ahead and vote.

Half A Million Visits

Yesterday, this weblog had its 500,000th visitor. This has happened in about 3 years and 2 months of this blog being public. I take stock by talking about the search queries getting people here, where the visitors are coming from, etc.

Yesterday, this weblog had its 500,000th visitor. This has happened in about 3 years and 2 months of this blog being public.

As my blog has become popular and as blogging has become popular in general, spamming has become very common. Spambots try to leave comments and send trackbacks with all kinds of shopping, porn or other links. Last month, there were attempts to post 1,100 spam comments and 5,900 trackbacks. I say attempts because apart from a few (in single digits), none ever appeared on this site. So I guess I can say that I have effective anti-spam measures installed. For now.

Let’s take a look at some of the search results which lead random web visitors to my weblog. Here is how Google describes these search queries:

Top search query clicks are the top queries to Google that directed traffic to your site (based on the number of clicks to your pages in our search results).

And here are top search query clicks from Google to this blog.

  1. arranged marriages
  2. arranged marriage
  3. harun yahya
  4. level 2 ultrasound
  5. arrange marriages
  6. gays sex
  7. procrastination
  8. indian arranged marriages
  9. am i having a boy or girl
  10. urdu typing
  11. am i having a boy or a girl
  12. freedom at midnight
  13. crvo
  14. arrange marriage
  15. boy or girl quiz
  16. deals gap
  17. gender prediction tests
  18. baby gender quiz
  19. pregnancy pics
  20. kashmir pictures

I guess arranged marriage is quite a popular query for my blog. My post on the topic has consistently been one of the most popular here for a while. The posts relating to our pregnancy and Michelle’s birth are also popular topics.

My traffic is actually down nowadays from its peak in the months of April/May 2005. But I am still getting a decent 500 visits per day.

Here is a world map with the countries I have had visitors from shown in red.

Visitor Countries Map

The list of 167 countries follows:

Continue reading “Half A Million Visits”

Movable Type, MySQL, Perl, Unicode

Unicode is tough. Programs and languages still have issues with supporting Unicode properly. This is the story of my adventures with phpbb, PHP, Movable Type, MySQL, and Perl wrestling with their Unicode support.

Unicode is tough. It’s tough because of bad programmers and their ingrained habits. Everybody loves shortcuts and programmers are no exception. That is why years after the introduction of Unicode, it is still difficult to do any real application programming with Unicode data.

Let us look at two examples.

We had a problem on our Urdu forum which runs on phpbb. Any user who registered with a somewhat longish Urdu user name could not log into the forum again. The maximum length of the user name field was set to 25 characters but people had problems even with user names 14 characters long. At first, we thought it was due to the size of the field in the MySQL database, but increasing that didn’t help. After much effort, I found out that when logging in (but not when registering), the user name was truncated to 25 characters using the PHP function substr. And of course it turned out that substr works only with bytes, not with characters. I hope you understand the difference between characters and bytes unlike a lot of programmers. So I had to replace substr with mb_substr. Yes, PHP has a separate set of string functions for multibyte encodings.

The other example comes from Movable Type which claims:

Movable Type ships with full support for Unicode and international character sets. Official, fully-supported versions with translated documentation are now available in Japanese, French, German, Spanish, and Dutch.

However, take a look at this function in the Movable Type code:

sub no_utf8 {
for (@_) {
next if !defined $_;
$_ = pack 'C0A*', $_;
}
}

So what does this function do? It converts character-based strings to bytes. It is used to truncate the excerpts of incoming trackbacks to 255 bytes. Unicode must be pretty hard if programmers keep tripping over the difference between characters and bytes! This code has been in Movable Type since version 2.6 (or earlier) and is still there in the latest version 3.2.

Now, remember that Movable Type has a Japanese version which obviously would have issues with this. However, the character-to-byte conversion is still done in that version, but some additional processing is used to bring it back to characters. Why? Because Unicode is hard, of course.

That was just a preamble. Let’s now talk about the problem that prompted this post.

First, here are the different versions of the software we’ll talk about that I am using:

Since my webhost recently upgraded to Debian 3.1 and Perl 5.8.4, I thought I could do Unicode better on my MT blog here. Perl 5.8 is really the first version of Perl with Unicode support; Perl 5.6.1 claimed some support but there were lots of issues. I should know, I tried.

Among my ideas was one that I wanted to use actual Unicode characters instead of HTML entities, numeric character references etc. Right now I am using numeric character references for MathML (via the Numeric Entities plugin), Urdu dates (my own localization of MT) and smart quotes (via the Smartypants plugin). The Numeric Entities plugin has a mode where it can output UTF-8 characters while the other two needed to be edited to use Unicode character literals instead of numeric character references (i.e., \x{hhhh} instead of &#xhhhh; where hhhh is a hexadecimal number).

When I made these changes, it garbled up my web pages. Basically, the Unicode characters that I had changed from entities showed up okay but the rest of the non-ASCII Unicode characters on the page were garbled into accented roman characters. Counting these roman characters it looked like there were more of these characters than the actual Unicode characters which they replaced on the web page. Can we say: Why is Unicode hard? Bytes vs characters, sir!

At first, I was stumped. How could the Numeric Entities plugin affect characters that were not even processed by it? Then the harsh truth dawned on me. Perl uses a UTF8 flag to mark Unicode strings. When two strings are concatenated but only one is a Unicode string, then the other must be converted into Unicode before the concatenation. By default, strings in Perl are Latin1 or ISO-8859-1. So what was happening was that stuff wasn’t marked with the UTF8 flag in Movable Type, but the explicitly defined Unicode characters generated by the Numeric Entities plugin were so marked. When these were concatenated to form the web page output, anything other than the characters converted from entities by Numeric Entities plugin was considered to be Latin1 and hence converted byte-by-byte to Unicode. Garbage (no UTF8 flag) in, Garbage (garbled Unicode characters) out! I confirmed this by doing a is_utf8() at different stages in Movable Type.

To recap, this meant that any programmatic insertion of Unicode characters was properly marked as Unicode. However, since none of the Unicode data entered by the user (in the entry or comment fields or even in the templates) was marked as such, the presence of programmatic Unicode characters garbled the rest of the non-ASCII Unicode characters.

The next step was to find out why and where this was happening. The first thing I found was that all the data in my MySQL database was marked Latin1. Why? Legacy issues: When I created that database, my host was running MySQL 4.0 which had only one character set: Latin1. MySQL 4.1 added Unicode (and lots of other character sets) support so that you can now assign character sets and collations not only to databases and tables but also to columns. That is great but if you want databases created in 4.0 or earlier to keep running without any conversions then all such databases need to be assigned the latin1 character set which is what my host did. As to why I was seeing the correct Unicode characters on my website and in the MT interface: The data was stored simply as bytes.

Time to fix the MySQL database character set then. First, I tried the easy way but that didn’t work for some odd reason. So I had to do it the hard way which involved changing the type of each column from CHAR(n), VARCHAR(n), TEXT, MEDIUMTEXT, TINYTEXT or LONGTEXT to its corresponding binary type (BINARY(n), VARBINARY(n), BLOB, MEDIUMBLOB, TINYBLOB or LONGBLOB) and then back along with changing the character to utf8. Here are the statements:

ALTER TABLE t MODIFY column BINARY(n) | VARBINARY(n) | BLOB | MEDIUMBLOB | TINYBLOB | LONGBLOB [ [DEFAULT | NOT] NULL];
ALTER TABLE t MODIFY column CHAR(n) | VARCHAR(n) | TEXT | MEDIUMTEXT | TINYTEXT | LONGTEXT CHARACTER SET utf8 [ [DEFAULT | NOT] NULL];

Replace t by the table name, column by the column name, and n by the length of the field for CHAR and VARCHAR types. Also, choose the corresponding data types for the specific column. One complication (other than doing this individually for all columns in all tables) is that you have to specify any attributes that were originally there for the column, otherwise they get dropped. These would be things like whether the default value for the field is NULL or the field cannot have a NULL value, etc. The easiest way to do this is using phpMyAdmin instead of typing these statements since it lets you alter specific characteristics.

I went through this for the more than 50 columns that needed changing for my Movable Type database. It was good to see the Urdu characters finally appearing in phpMyAdmin as I browsed the database. Then I opened the MT interface and saw that all Urdu characters were appearing as “?”. What the ****? Then I remembered: Unicode is hard.

Suddenly a light went on and I remembered this comment by Asif when he transferred our Urdu Wiki to its present location. Reading up on that I realized that in addition to database, table and column character sets, there were also character sets defined for server, client and connection. And obviously these were set to a default of latin1.

I found out where to put this statement (SET NAMES utf8) in mt/lib/MT/ObjectDriver/DBI/mysql.pm but found a more elegant solution than simply setting it for all cases. So here’s my patch (for Movable Type 3.2) for this issue:

--- lib/MT/ConfigMgr.pm.orig    2005-08-16 19:37:11.000000000 -0700
+++ lib/MT/ConfigMgr.pm 2005-11-01 19:48:26.000000000 -0800
@@ -151,6 +151,7 @@
['OutboundTrackbackLimit', { Default => 'any' }],
['OutboundTrackbackDomains', { Type => 'ARRAY' } ],
['IndexBasename', {Default => 'index'}],
+        ['SQLSetNames', {Default => 0}],
]);
}
--- lib/MT/ObjectDriver/DBI/mysql.pm.orig       2005-07-29 13:41:11.000000000 -0700
+++ lib/MT/ObjectDriver/DBI/mysql.pm    2005-11-01 19:45:32.000000000 -0800
@@ -49,6 +49,11 @@
{ RaiseError => 0, PrintError => 0 })
or return $driver->error(MT->translate("Connection error: [_1]",
$DBI::errstr));
+    if ($cfg->SQLSetNames && (my $c = lc $cfg->PublishCharset)) {
+        my %Charset = ('utf-8' => 'utf8', 'shift_jis' => 'sjis', 'euc-jp' => 'ujis');
+        my $c = $Charset{$c} ? $Charset{$c}  : $c;
+        $driver->{dbh}->do("SET NAMES " . $c);
+    }
$driver;
}

And of course add SQLSetNames 1 in your mt-config.cgi or mt.cfg configuration file.

Elated, I opened up MT and checked. The question marks had disappeared. So far, so good. Let’s check if the original Unicode problem was fixed. Oh no! It isn’t. I guess we are back to square one. Did I say something about Unicode being hard?

What is the reason for the problem now? MySQL has nice UTF-8 data which it passes along to MT’s Perl functions. Why is the data still not marked with the UTF-8 flag? Movable Type uses the Perl modules DBI and DBD::mysql to access the MySQL database. And guess what? They don’t have any Unicode support. In fact, forget marking the UTF-8 flag properly, according to this, DBD::mysql doesn’t even preserve UTF-8 flag when it’s already there.

In the end, I have 3 options:

  1. Wait for Unicode support for DBI/DBD::mysql which might be a long time since nobody is sure if it should be provided by the database-independent interface DBI or by the MySQL driver DBD::mysql or both together in some way.
  2. Use decode_utf8 on every output from the database. This is not very easy to do.
  3. Use a patch which blesses all database data (yes that includes the binary fields) as UTF-8 based on a flag you set when connecting to the database.

None of these options is very appealing and all have side-effects and problems associated with them. My plan is to set up a development subdomain and then test out options 2 and 3 there thoroughly before bringing them online for my weblog.

POSTSCRIPT: For something funny about Perl’s Unicode support, read about the difference between utf8 and UTF-8.

UPDATE: Here’s a DBI modification that does the same thing as option 3 above. See also an amendment.

بلاگ کی سالگرہ

آج میرے بلاگ کو تین سال ہو گئے ہیں-

آج میرے بلاگ کو تین سال ہو گئے ہیں-

Today marks three years since I started this blog on Blogspot. I didn’t post muh the first few months and the blog took off only around November 2002. Since then, there have been many changes. I blog a lot less than I used to, from a few posts a day at the peak to 1—2 posts every week now. I also post less about politics now. It’s not that I am less interested in politics, it’s just that I am frustrated and I don’t want all my political posts to sound like rants.

One reason for my blogging less is that I have become more involved in other stuff. I am a daddy blogger too, i.e. I have a weblog where I post about Michelle’s antics and pictures etc. In fact, that blog is updated more often than this one. Plus I have gotten involved in a small but growing community of Urdu bloggers. We are trying to setup resources for the use of Urdu on the Internet especially for blogging. The latest development is the setting up of a new website at Urduweb.org. We are now working to move our different Urdu-related resources to this site.

Treo Blogging

I forgot to mention that the previous entry was written on my new Treo 650. Typing on its small keyboard takes a little getting used to.

I typed the previous post using the standard Movable Type interface in the Blazer web browser. That is not the best idea since the MT interface is not good for the small PDA screen. I also ran into a character limit (2000?) in the text box for entry body in Blazer.

I am exploring other options for blogging on the Treo and would appreciate any suggestions.

Let’s test if I can post a photo I took from my Treo. The “Upload File” thing doesn’t do anything in Blazer and the Javascript buttons on top of the Entry Body textarea don’t appear at all. Does this thing have Javascript?

Postscript: Publishing an entry in Blazer works but the pings are not sent because of the way page refresh is used by MT.

Urdu Domain Poll

We are thinking of getting a separate domain for all the Urdu-related stuff we are doing. Help us decide on a domain name by voting for your choice.

We are thinking of getting a separate domain for all the Urdu-related stuff we are doing. This would include the following:

Help us decide on a domain name by voting in the poll below.

poll_process(4);
?>

If you don’t like any of the choices available, feel free to suggest any others in the comments.

UPDATE: The poll would close on Friday, May 13.