Spence Green

التكرار يعلم الحمار

I am a Ph.D student in Computer Science at Stanford University. In addition to computers and languages, my interests include travel, running, and scuba diving. more...

HOWTO: Working with Python, Unicode, and Arabic

with 4 comments

When working with non-European languages such as Arabic and Chinese, a practical understanding of Unicode is necessary. My research group uses Java for larger applications, and although Java represents all strings in Unicode, it is often cumbersome to write small Java applications for the various data manipulation tasks that appear while preparing corpora for translation. Therefore, fluency in one of the dynamically-typed scripting languages can be immensely useful in this particular domain. I prefer Python for its intuitive Unicode support and minimalist syntax. This article provides sample Python code for several common use cases that require particular consideration of string encodings. Perl offers analogous capabilities, but Ruby’s Unicode support is somewhat limited as of Ruby 1.9.

(Note that this document is valid for Python 2.5.x only; Python3k introduces numerous incompatibilities.)

Working with Strings
Python supports two methods of constructing Unicode strings. The unicode() built-in function constructs a string with a default encoding of utf-8. The u” shorthand notation is equivalent:

best = unicode('?????', encoding='utf-8')			#Preferred method
old = u'?????'									#Deprecated in Python3k

To convert between one encoding and another, use the string method encode():

ascii_string = 'My anglo-centric string...'
utf8_string = ascii_string.encode('utf-8')

#Concatenation works as expected
utf8_string = utf8_string + unicode('???? ??????', encoding='utf-8')

#print does not require additional parameters
print ascii_string
print utf8_string

#But this throws an exception
print utf8_string.encode('ascii')

Sometimes it is necessary to manipulate individual Unicode characters. This is particularly useful for non-printing characters like the right-to-left override (0x202E):

print unichr(1591)					#0x0637 in the Unicode charts
print ord(u'?')
print unichr(ord(u'?'))			#unichr() and ord() are inversely related

Manipulating Files
The syntax for reading and writing Unicode files does not differ from that used for ASCII files. Opening Unicode files requires the codecs library, however:

IN_FILE = codecs.open('test.ar','r', encoding='utf-8')
OUT_FILE = codecs.open('out.ar', 'w', encoding='utf-8')

for line in IN_FILE:
	OUT_FILE.write(line.rstrip() + unicode('\n', encoding='utf-8'))

IN_FILE.close()
OUT_FILE.close()

Python also supports Unicode files names. Simply pass in a Unicode string to the open() method. As a sanity check, you can discover your filesystem’s default encoding by calling sys.getfilesystemencoding():

a = sys.getfilesystemencoding()
print "Filesystem encoding: " + a

arabic_filename = unicode('???_????', encoding='utf-8')
IN_FILE = codecs.open(arabic_filename, encoding='utf-8')

for line in IN_FILE:
	print line
IN_FILE.close()

Unicode Source Files
When strings exceed several characters-as they often do-it is better to write the Unicode literals into the source file. To force the Python interpreter to recognize the source code as utf-8, add a ‘magic comment’ to the first or second line of the source file:

# coding: utf-8
import sys

#Try this without the magic comment above and see what happens....
embedded_unicode = unicode('?????', encoding='utf-8')

print embedded_unicode

Regular Expressions
With the ability to embed Unicode literals into the source code, it is possible to write regular expressions for any natural language [2]. Simply indicate to the regex compiler that the pattern parameter is a Unicode string:

TEST_FILE = codecs.open('test.ar','r', encoding='utf-8')

p = re.compile(unicode('^????', 'utf-8'), re.U)

for line in TEST_FILE:
	match = p.match(line)
	if match:
		print line.rstrip()
		print match.group().rstrip()
TEST_FILE.close()

References
Unicode 5.1.0 book – The authoritative programmer’s reference for Unicode. No need to memorize this one.

Unicode charts – Keep these within reach. The Arabic charts are especially useful for characters that do not appear on the keyboard.

Written by Spence

December 19th, 2008 at 4:52 pm

Posted in HOWTO

4 Responses to 'HOWTO: Working with Python, Unicode, and Arabic'

Subscribe to comments with RSS or TrackBack to 'HOWTO: Working with Python, Unicode, and Arabic'.

  1. Spence this is really useful stuff for me. What platform and editor do you favour for this work? I am trying to do something with arabic text processing under Ubuntu and nothing seems to work out of the box.

    markandrew

    7 Oct 09 at 6:43 am

  2. Hi Mark,
    I run Ubuntu with a character mapping for Arabic. For editing Python, I use Stani’s Python editor, which is fabulous. Finally, for working with LaTeX, I write in Winefish LaTeX editor. All of these have exceptional Unicode support.

    Spence

    14 Oct 09 at 8:07 am

  3. I want to know how can I display Arabic letters with Perl programming language?

    SH

    30 Oct 09 at 10:47 pm

  4. [...] don’t know about Unicode support in the popular Python IDE’s. This post suggests that SPE might provide support for Unicode. (If you have information about Python IDEs and [...]

Leave a Reply

You must be logged in to post a comment.