Spence Green

التكرار يعلم الحمار

I work at Lilt. In addition to computers and languages, my interests include travel, running, and scuba diving. more...

HOWTO: Working with Python, Unicode, and Arabic

with 10 comments

When working with non-European languages such as Arabic and Chinese, a practical understanding of Unicode is necessary. My research group uses Java for larger applications, and although Java represents all strings in Unicode, it is often cumbersome to write small Java applications for the various data manipulation tasks that appear while preparing corpora for translation. Therefore, fluency in one of the dynamically-typed scripting languages can be immensely useful in this particular domain. I prefer Python for its intuitive Unicode support and minimalist syntax. This article provides sample Python code for several common use cases that require particular consideration of string encodings. Perl offers analogous capabilities, but Ruby’s Unicode support is somewhat limited as of Ruby 1.9.

(Note that this document is valid for Python 2.5.x only; Python3k introduces numerous incompatibilities.)

Working with Strings
Python supports two methods of constructing Unicode strings. The unicode() built-in function constructs a string with a default encoding of utf-8. The u” shorthand notation is equivalent:

best = unicode('?????', encoding='utf-8')			#Preferred method
old = u'?????'									#Deprecated in Python3k

To convert between one encoding and another, use the string method encode():

ascii_string = 'My anglo-centric string...'
utf8_string = ascii_string.encode('utf-8')

#Concatenation works as expected
utf8_string = utf8_string + unicode('???? ??????', encoding='utf-8')

#print does not require additional parameters
print ascii_string
print utf8_string

#But this throws an exception
print utf8_string.encode('ascii')

Sometimes it is necessary to manipulate individual Unicode characters. This is particularly useful for non-printing characters like the right-to-left override (0x202E):

print unichr(1591)					#0x0637 in the Unicode charts
print ord(u'?')
print unichr(ord(u'?'))			#unichr() and ord() are inversely related

Manipulating Files
The syntax for reading and writing Unicode files does not differ from that used for ASCII files. Opening Unicode files requires the codecs library, however:

IN_FILE = codecs.open('test.ar','r', encoding='utf-8')
OUT_FILE = codecs.open('out.ar', 'w', encoding='utf-8')

for line in IN_FILE:
	OUT_FILE.write(line.rstrip() + unicode('\n', encoding='utf-8'))


Python also supports Unicode files names. Simply pass in a Unicode string to the open() method. As a sanity check, you can discover your filesystem’s default encoding by calling sys.getfilesystemencoding():

a = sys.getfilesystemencoding()
print "Filesystem encoding: " + a

arabic_filename = unicode('???_????', encoding='utf-8')
IN_FILE = codecs.open(arabic_filename, encoding='utf-8')

for line in IN_FILE:
	print line

Unicode Source Files
When strings exceed several characters-as they often do-it is better to write the Unicode literals into the source file. To force the Python interpreter to recognize the source code as utf-8, add a ‘magic comment’ to the first or second line of the source file:

# coding: utf-8
import sys

#Try this without the magic comment above and see what happens....
embedded_unicode = unicode('?????', encoding='utf-8')

print embedded_unicode

Regular Expressions
With the ability to embed Unicode literals into the source code, it is possible to write regular expressions for any natural language [2]. Simply indicate to the regex compiler that the pattern parameter is a Unicode string:

TEST_FILE = codecs.open('test.ar','r', encoding='utf-8')

p = re.compile(unicode('^????', 'utf-8'), re.U)

for line in TEST_FILE:
	match = p.match(line)
	if match:
		print line.rstrip()
		print match.group().rstrip()

Unicode 5.1.0 book – The authoritative programmer’s reference for Unicode. No need to memorize this one.

Unicode charts – Keep these within reach. The Arabic charts are especially useful for characters that do not appear on the keyboard.

Written by Spence

December 19th, 2008 at 4:52 pm

Posted in HOWTO

10 Responses to 'HOWTO: Working with Python, Unicode, and Arabic'

Subscribe to comments with RSS or TrackBack to 'HOWTO: Working with Python, Unicode, and Arabic'.

  1. Spence this is really useful stuff for me. What platform and editor do you favour for this work? I am trying to do something with arabic text processing under Ubuntu and nothing seems to work out of the box.


    7 Oct 09 at 6:43 am

  2. Hi Mark,
    I run Ubuntu with a character mapping for Arabic. For editing Python, I use Stani’s Python editor, which is fabulous. Finally, for working with LaTeX, I write in Winefish LaTeX editor. All of these have exceptional Unicode support.


    14 Oct 09 at 8:07 am

  3. I want to know how can I display Arabic letters with Perl programming language?


    30 Oct 09 at 10:47 pm

  4. […] don’t know about Unicode support in the popular Python IDE’s. This post suggests that SPE might provide support for Unicode. (If you have information about Python IDEs and […]

  5. stress test

    HOWTO: Working with Python, Unicode, and Arabic | Spence Green

    stress test

    18 Nov 14 at 5:08 pm

  6. Understanding Real-World private property impound software Products

    HOWTO: Working with Python, Unicode, and Arabic | Spence Green

  7. mac neutral eyeshadow swatches Store Online P6ZHI 970 Zaa Network 喔嬥箞喔?喔∴副喔權釜喙?喔复喔?: 喔娻箞喔竾喔椸傅喔о傅喔о覆喙勦福喔曕傅喙夃釜喔膏笖喔复喔曕競喔竾喔勦笝喔`父喙堗笝喙冟斧喔∴箞

    mac lipsticks set Crazy Price BFKCZ 798 JLIG mac online australia Authentic 7L2SC 341 – Matt Marlon Matthew Marlon Recipes mac makeup train case Big Discount Free Shipping 9VMIO 676 Marathon Sport how much is mac makeup Free Shipping. FKZ1Q 253 Buller…

  8. garage

    HOWTO: Working with Python, Unicode, and Arabic | Spence Green


    29 Sep 15 at 9:43 am

  9. http://www.0760dh.com

    HOWTO: Working with Python, Unicode, and Arabic | Spence Green


    20 Oct 15 at 9:28 pm

  10. esehospitalsanrafaelconcepcion.gov.co

    HOWTO: Working with Python, Unicode, and Arabic | Spence Green

Leave a Reply

You must be logged in to post a comment.