HOWTO: Basic Arabic Preprocessing for NLP
Raw Arabic text is difficult to process. Errant diacritics, strange unicode characters, and haphazard use whitespace are all common obstacles for even basic tasks. For statistical systems, cliticization and morphological variation can induce sparsity. As a result, sophisticated preprocessing techniques have been developed, the best of which are described in these three papers:
- Nizar Habash and Owen Rambow. 2005. Arabic tokenization, part-of-speech tagging and morphological disambiguation in one fell swoop. In NAACL. [pdf]
- Nizar Habash and Fatiha Sadat. 2006. Arabic preprocessing schemes for statistical machine translation. In NAACL. [pdf]
- Mona Diab, Kadri Hacioglu and Daniel Jurafsky. 2007. Automatic processing of Modern Standard Arabic text. In Arabic Computational Morphology. [pdf]
Recently I have had success with a less sophisticated normalization process, which requires these scripts and MADA (freely available from Columbia University). Two scripts are included:
- basic_ortho_norm.py — Simple orthographic normalization.
- run_mada — Run script for MADA+TOKAN 3.1 that performs morphological analysis and clitic segmentation (like the Penn Arabic Treebank).
Both of my 2010 conference papers used these scripts.
Leave a Reply
You must be logged in to post a comment.