Spence Green

التكرار يعلم الحمار

I am a Ph.D student in Computer Science at Stanford University. In addition to computers and languages, my interests include travel, running, and scuba diving. more...

HOWTO: Basic Arabic Preprocessing for NLP

without comments

Raw Arabic text is difficult to process. Errant diacritics, strange unicode characters, and haphazard use whitespace are all common obstacles for even basic tasks. For statistical systems, cliticization and morphological variation can induce sparsity. As a result, sophisticated preprocessing techniques have been developed, the best of which are described in these three papers:

  • Nizar Habash and Owen Rambow. 2005. Arabic tokenization, part-of-speech tagging and morphological disambiguation in one fell swoop. In NAACL. [pdf]
  • Nizar Habash and Fatiha Sadat. 2006. Arabic preprocessing schemes for statistical machine translation. In NAACL. [pdf]
  • Mona Diab, Kadri Hacioglu and Daniel Jurafsky. 2007. Automatic processing of Modern Standard Arabic text. In Arabic Computational Morphology. [pdf]
Recently I have had success with a less sophisticated normalization process, which requires these scripts and MADA (freely available from Columbia University). Two scripts are included:
  1. basic_ortho_norm.py — Simple orthographic normalization.
  2. run_mada — Run script for MADA+TOKAN 3.1 that performs morphological analysis and clitic segmentation (like the Penn Arabic Treebank).

Both of my 2010 conference papers used these scripts.

Written by Spence

January 19th, 2011 at 6:48 am

Posted in Arabic,HOWTO,NLP

Leave a Reply

You must be logged in to post a comment.