HOWTO: Working with Python, Unicode, and Arabic
When working with non-European languages such as Arabic and Chinese, a practical understanding of Unicode is necessary. My research group uses Java for larger applications, and although Java represents all strings in Unicode, it is often cumbersome to write small Java applications for the various data manipulation tasks that appear while preparing corpora for translation. Therefore, fluency in one of the dynamically-typed scripting languages can be immensely useful in this particular domain. I prefer Python for its intuitive Unicode support and minimalist syntax. This article provides sample Python code for several common use cases that require particular consideration of string encodings. Perl offers analogous capabilities, but Ruby’s Unicode support is somewhat limited as of Ruby 1.9.
(Note that this document is valid for Python 2.5.x only; Python3k introduces numerous incompatibilities.)
Working with Strings
Python supports two methods of constructing Unicode strings. The unicode() built-in function constructs a string with a default encoding of utf-8. The u” shorthand notation is equivalent:
best = unicode('?????', encoding='utf-8') #Preferred method old = u'?????' #Deprecated in Python3k
To convert between one encoding and another, use the string method encode():
ascii_string = 'My anglo-centric string...' utf8_string = ascii_string.encode('utf-8') #Concatenation works as expected utf8_string = utf8_string + unicode('???? ??????', encoding='utf-8') #print does not require additional parameters print ascii_string print utf8_string #But this throws an exception print utf8_string.encode('ascii')
Sometimes it is necessary to manipulate individual Unicode characters. This is particularly useful for non-printing characters like the right-to-left override (0x202E):
print unichr(1591) #0x0637 in the Unicode charts print ord(u'?') print unichr(ord(u'?')) #unichr() and ord() are inversely related
Manipulating Files
The syntax for reading and writing Unicode files does not differ from that used for ASCII files. Opening Unicode files requires the codecs library, however:
IN_FILE = codecs.open('test.ar','r', encoding='utf-8') OUT_FILE = codecs.open('out.ar', 'w', encoding='utf-8') for line in IN_FILE: OUT_FILE.write(line.rstrip() + unicode('\n', encoding='utf-8')) IN_FILE.close() OUT_FILE.close()
Python also supports Unicode files names. Simply pass in a Unicode string to the open() method. As a sanity check, you can discover your filesystem’s default encoding by calling sys.getfilesystemencoding():
a = sys.getfilesystemencoding() print "Filesystem encoding: " + a arabic_filename = unicode('???_????', encoding='utf-8') IN_FILE = codecs.open(arabic_filename, encoding='utf-8') for line in IN_FILE: print line IN_FILE.close()
Unicode Source Files
When strings exceed several characters-as they often do-it is better to write the Unicode literals into the source file. To force the Python interpreter to recognize the source code as utf-8, add a ‘magic comment’ to the first or second line of the source file:
# coding: utf-8 import sys #Try this without the magic comment above and see what happens.... embedded_unicode = unicode('?????', encoding='utf-8') print embedded_unicode
Regular Expressions
With the ability to embed Unicode literals into the source code, it is possible to write regular expressions for any natural language [2]. Simply indicate to the regex compiler that the pattern parameter is a Unicode string:
TEST_FILE = codecs.open('test.ar','r', encoding='utf-8') p = re.compile(unicode('^????', 'utf-8'), re.U) for line in TEST_FILE: match = p.match(line) if match: print line.rstrip() print match.group().rstrip() TEST_FILE.close()
References
Unicode 5.1.0 book – The authoritative programmer’s reference for Unicode. No need to memorize this one.
Unicode charts – Keep these within reach. The Arabic charts are especially useful for characters that do not appear on the keyboard.
25 Responses to 'HOWTO: Working with Python, Unicode, and Arabic'
Leave a Reply
You must be logged in to post a comment.
Spence this is really useful stuff for me. What platform and editor do you favour for this work? I am trying to do something with arabic text processing under Ubuntu and nothing seems to work out of the box.
markandrew
7 Oct 09 at 06:43
Hi Mark,
I run Ubuntu with a character mapping for Arabic. For editing Python, I use Stani’s Python editor, which is fabulous. Finally, for working with LaTeX, I write in Winefish LaTeX editor. All of these have exceptional Unicode support.
Spence
14 Oct 09 at 08:07
I want to know how can I display Arabic letters with Perl programming language?
SH
30 Oct 09 at 22:47
[…] don’t know about Unicode support in the popular Python IDE’s. This post suggests that SPE might provide support for Unicode. (If you have information about Python IDEs and […]
EasyGui and Unicode | EasyGui
12 Jun 10 at 10:47
stress test
HOWTO: Working with Python, Unicode, and Arabic | Spence Green
stress test
18 Nov 14 at 17:08
Understanding Real-World private property impound software Products
HOWTO: Working with Python, Unicode, and Arabic | Spence Green
Understanding Real-World private property impound software Products
13 Jul 15 at 04:45
mac neutral eyeshadow swatches Store Online P6ZHI 970 Zaa Network 喔嬥箞喔?喔∴副喔權釜喙?喔复喔?: 喔娻箞喔竾喔椸傅喔о傅喔о覆喙勦福喔曕傅喙夃釜喔膏笖喔复喔曕競喔竾喔勦笝喔`父喙堗笝喙冟斧喔∴箞
mac lipsticks set Crazy Price BFKCZ 798 JLIG mac online australia Authentic 7L2SC 341 – Matt Marlon Matthew Marlon Recipes mac makeup train case Big Discount Free Shipping 9VMIO 676 Marathon Sport how much is mac makeup Free Shipping. FKZ1Q 253 Buller…
mac neutral eyeshadow swatches Store Online P6ZHI 970 Zaa Network 喔嬥箞喔?喔∴副喔權釜喙?喔复喔?: 喔娻箞喔竾喔椸傅喔о傅喔о覆喙勦福喔曕傅喙夃釜喔膏笖喔复喔曕競喔竾喔勦笝喔`父喙堗笝喙冟
27 Aug 15 at 16:41
garage
HOWTO: Working with Python, Unicode, and Arabic | Spence Green
garage
29 Sep 15 at 09:43
http://www.0760dh.com
HOWTO: Working with Python, Unicode, and Arabic | Spence Green
www.0760dh.com
20 Oct 15 at 21:28
michael michael kors
HOWTO: Working with Python, Unicode, and Arabic | Spence Green
michael michael kors
1 Sep 17 at 03:53
chattercams
HOWTO: Working with Python, Unicode, and Arabic | Spence Green
Chattercams
3 Nov 17 at 10:45
car transporter hire clearwell
HOWTO: Working with Python, Unicode, and Arabic | Spence Green
car delivery firms cam gloucesteshire
27 Oct 18 at 01:08
central heating upgrades bourton on the Water
HOWTO: Working with Python, Unicode, and Arabic | Spence Green
heating system power flushing birmingham
27 Oct 18 at 03:46
resources
HOWTO: Working with Python, Unicode, and Arabic | Spence Green
Takeaways gloucester
27 Oct 18 at 05:04
their website
HOWTO: Working with Python, Unicode, and Arabic | Spence Green
chimney cleaned and pointed tetbury
29 Oct 18 at 16:35
Clean All Ceiling Fans And Light Fixtures Longhope
HOWTO: Working with Python, Unicode, and Arabic | Spence Green
small rooms cleaners tewkesbury
26 Nov 18 at 15:11
Connect New Oven Cheltenham
HOWTO: Working with Python, Unicode, and Arabic | Spence Green
Consumer units tewkesbury
15 Dec 18 at 22:37
clash royale hack
HOWTO: Working with Python, Unicode, and Arabic | Spence Green
section-5cc1f116cddd3
30 Apr 19 at 18:00
books
HOWTO: Working with Python, Unicode, and Arabic | Spence Green
books
29 Nov 19 at 19:46
pendand lighting
HOWTO: Working with Python, Unicode, and Arabic | Spence Green
pendand lighting
5 Mar 20 at 23:25
릴게임야마토
HOWTO: Working with Python, Unicode, and Arabic | Spence Green
릴게임야마토
29 Jun 21 at 00:27
https://www.fhwa.dot.gov/reauthorization/reauexit.cfm?link=https://www.jbo88s.com/
HOWTO: Working with Python, Unicode, and Arabic | Spence Green
https://www.fhwa.dot.gov/reauthorization/reauexit.cfm?link=https://www.jbo88s.com/
8 Dec 21 at 04:10
[…] here is a long article explaining the problem and answer for python […]
Python 3 print() function with Farsi/Arabic characters [duplicate]
19 Apr 22 at 09:01
[…] here is a long article explaining the problem and answer for python […]
Python 3 print() function with Farsi/Arabic characters
10 Sep 22 at 09:43
click through the next website page
blog topic
click through the next website page
6 Oct 22 at 06:30