Archive for the ‘Arabic’ Category
2011 Arabic Linguistics Symposium Programme Posted
The 2011 ALS starts on 4 March. I’ve always wanted to attend this conference, and have fooled myself into believing that I might submit to it one day. Reading through the bound proceedings of ALS counts as one of my earlier grad school memories.
HOWTO: Basic Arabic Preprocessing for NLP
Raw Arabic text is difficult to process. Errant diacritics, strange unicode characters, and haphazard use whitespace are all common obstacles for even basic tasks. For statistical systems, cliticization and morphological variation can induce sparsity. As a result, sophisticated preprocessing techniques have been developed, the best of which are described in these three papers:
- Nizar Habash and Owen Rambow. 2005. Arabic tokenization, part-of-speech tagging and morphological disambiguation in one fell swoop. In NAACL. [pdf]
- Nizar Habash and Fatiha Sadat. 2006. Arabic preprocessing schemes for statistical machine translation. In NAACL. [pdf]
- Mona Diab, Kadri Hacioglu and Daniel Jurafsky. 2007. Automatic processing of Modern Standard Arabic text. In Arabic Computational Morphology. [pdf]
- basic_ortho_norm.py — Simple orthographic normalization.
- run_mada — Run script for MADA+TOKAN 3.1 that performs morphological analysis and clitic segmentation (like the Penn Arabic Treebank).
Both of my 2010 conference papers used these scripts.
Awesome Arabic Corpus Search Tool
Evidently an Arabic corpus search tool has existed at BYU for some time, but a post on the Arabic LinguistList this morning brought it to my attention:
This is to announce that two new ‘sub’ corpora have been added to newspaper section of arabiCorpus.byu.edu:Masri2010:This is the entire year of 2010 worth of the newspaper Al-Masri Al-Yawm. This paper was chosen partly because of its popularity, partly because it contrasts markedly in style from the Ahram, and partly because it is one of the papers that uses the new ‘quoting’ style: they actually write down what people say, even if it is in colloquial Arabic or some mixed form (look up وتعاليمها تخاخل الإنجيل using ‘string’ for a relatively hilarious example quoting Baba Shanouda during last summers ‘divorce controversy'(. (almost 14 million words)ShuruqColumns:This is a large set of columns from the Egyptian newspaper Al-Shuruuq. This paper is reputed to have attracted some of the best editorial writers in Egypt, and many people buy it just for the writers and columns, rather than for the news. This would be a good (small) corpus to use if you wanted samples of what is considered to be ‘fine’ current writing on politics and social life. Writers include Fahmy Huwaidi, Khaled Al-Khamissi (of Taxi fame), Alaa’ Al-Aswaani (of Yaqubian Building fame), and many others. Enjoy. (about 2 million words)