Spence Green

التكرار يعلم الحمار

I work at Lilt. In addition to computers and languages, my interests include travel, running, and scuba diving. more...

Archive for the ‘HOWTO’ Category

HOWTO: Working with Python, Unicode, and Arabic

with 11 comments

When working with non-European languages such as Arabic and Chinese, a practical understanding of Unicode is necessary. My research group uses Java for larger applications, and although Java represents all strings in Unicode, it is often cumbersome to write small Java applications for the various data manipulation tasks that appear while preparing corpora for translation. Therefore, fluency in one of the dynamically-typed scripting languages can be immensely useful in this particular domain. I prefer Python for its intuitive Unicode support and minimalist syntax. This article provides sample Python code for several common use cases that require particular consideration of string encodings. Perl offers analogous capabilities, but Ruby’s Unicode support is somewhat limited as of Ruby 1.9.
Read the rest of this entry »

Written by Spence

December 19th, 2008 at 4:52 pm

Posted in HOWTO