When working with non-European languages such as Arabic and Chinese, a practical understanding of Unicode is necessary. My research group uses Java for larger applications, and although Java represents all strings in Unicode, it is often cumbersome to write small Java applications for the various data manipulation tasks that appear while preparing corpora for translation. Therefore, fluency in one of the dynamically-typed scripting languages can be immensely useful in this particular domain. I prefer Python for its intuitive Unicode support and minimalist syntax. This article provides sample Python code for several common use cases that require particular consideration of string encodings. Perl offers analogous capabilities, but Ruby’s Unicode support is somewhat limited as of Ruby 1.9.
(Note that this document is valid for Python 2.5.x only; Python3k introduces numerous incompatibilities.)
Working with Strings
Python supports two methods of constructing Unicode strings. The unicode() built-in function constructs a string with a default encoding of utf-8. The u” shorthand notation is equivalent:
best = unicode('?????', encoding='utf-8') #Preferred method old = u'?????' #Deprecated in Python3k
To convert between one encoding and another, use the string method encode():
ascii_string = 'My anglo-centric string...' utf8_string = ascii_string.encode('utf-8') #Concatenation works as expected utf8_string = utf8_string + unicode('???? ??????', encoding='utf-8') #print does not require additional parameters print ascii_string print utf8_string #But this throws an exception print utf8_string.encode('ascii')
Sometimes it is necessary to manipulate individual Unicode characters. This is particularly useful for non-printing characters like the right-to-left override (0x202E):
print unichr(1591) #0x0637 in the Unicode charts print ord(u'?') print unichr(ord(u'?')) #unichr() and ord() are inversely related
The syntax for reading and writing Unicode files does not differ from that used for ASCII files. Opening Unicode files requires the codecs library, however:
IN_FILE = codecs.open('test.ar','r', encoding='utf-8') OUT_FILE = codecs.open('out.ar', 'w', encoding='utf-8') for line in IN_FILE: OUT_FILE.write(line.rstrip() + unicode('\n', encoding='utf-8')) IN_FILE.close() OUT_FILE.close()
Python also supports Unicode files names. Simply pass in a Unicode string to the open() method. As a sanity check, you can discover your filesystem’s default encoding by calling sys.getfilesystemencoding():
a = sys.getfilesystemencoding() print "Filesystem encoding: " + a arabic_filename = unicode('???_????', encoding='utf-8') IN_FILE = codecs.open(arabic_filename, encoding='utf-8') for line in IN_FILE: print line IN_FILE.close()
Unicode Source Files
When strings exceed several characters-as they often do-it is better to write the Unicode literals into the source file. To force the Python interpreter to recognize the source code as utf-8, add a ‘magic comment’ to the first or second line of the source file:
# coding: utf-8 import sys #Try this without the magic comment above and see what happens.... embedded_unicode = unicode('?????', encoding='utf-8') print embedded_unicode
With the ability to embed Unicode literals into the source code, it is possible to write regular expressions for any natural language . Simply indicate to the regex compiler that the pattern parameter is a Unicode string:
TEST_FILE = codecs.open('test.ar','r', encoding='utf-8') p = re.compile(unicode('^????', 'utf-8'), re.U) for line in TEST_FILE: match = p.match(line) if match: print line.rstrip() print match.group().rstrip() TEST_FILE.close()
Unicode 5.1.0 book – The authoritative programmer’s reference for Unicode. No need to memorize this one.
Unicode charts – Keep these within reach. The Arabic charts are especially useful for characters that do not appear on the keyboard.
8 Responses to 'HOWTO: Working with Python, Unicode, and Arabic'
Leave a Reply
You must be logged in to post a comment.