Spence Green

التكرار يعلم الحمار

I work at Lilt. In addition to computers and languages, my interests include travel, running, and scuba diving. more...

Archive for the ‘Arabic’ Category

2011 Arabic Linguistics Symposium Programme Posted

with 2 comments

The 2011 ALS starts on 4 March. I’ve always wanted to attend this conference, and have fooled myself into believing that I might submit to it one day. Reading through the bound proceedings of ALS counts as one of my earlier grad school memories.

Written by Spence

February 23rd, 2011 at 5:01 pm

HOWTO: Basic Arabic Preprocessing for NLP

without comments

Raw Arabic text is difficult to process. Errant diacritics, strange unicode characters, and haphazard use whitespace are all common obstacles for even basic tasks. For statistical systems, cliticization and morphological variation can induce sparsity. As a result, sophisticated preprocessing techniques have been developed, the best of which are described in these three papers:

  • Nizar Habash and Owen Rambow. 2005. Arabic tokenization, part-of-speech tagging and morphological disambiguation in one fell swoop. In NAACL. [pdf]
  • Nizar Habash and Fatiha Sadat. 2006. Arabic preprocessing schemes for statistical machine translation. In NAACL. [pdf]
  • Mona Diab, Kadri Hacioglu and Daniel Jurafsky. 2007. Automatic processing of Modern Standard Arabic text. In Arabic Computational Morphology. [pdf]
Recently I have had success with a less sophisticated normalization process, which requires these scripts and MADA (freely available from Columbia University). Two scripts are included:
  1. basic_ortho_norm.py — Simple orthographic normalization.
  2. run_mada — Run script for MADA+TOKAN 3.1 that performs morphological analysis and clitic segmentation (like the Penn Arabic Treebank).

Both of my 2010 conference papers used these scripts.


Written by Spence

January 19th, 2011 at 6:48 am

Posted in Arabic,HOWTO,NLP

Awesome Arabic Corpus Search Tool

without comments

Evidently an Arabic corpus search tool has existed at BYU for some time, but a post on the Arabic LinguistList this morning brought it to my attention:

This is to announce that two new ‘sub’ corpora have been added to newspaper section of arabiCorpus.byu.edu:
Masri2010:
This is the entire year of 2010 worth of the newspaper Al-Masri Al-Yawm.  This paper was chosen partly because of its popularity, partly because it contrasts markedly in style from the Ahram, and partly because it is one of the papers that uses the new ‘quoting’ style: they actually write down what people say, even if it is in colloquial Arabic or some mixed form (look up وتعاليمها تخاخل الإنجيل using ‘string’ for a relatively hilarious example quoting Baba Shanouda during last summers ‘divorce controversy'(. (almost 14 million words)
ShuruqColumns:
This is a large set of columns from the Egyptian newspaper Al-Shuruuq.  This paper is reputed to have attracted some of the best editorial writers in Egypt, and many people buy it just for the writers and columns, rather than for the news.  This would be a good (small) corpus to use if you wanted samples of what is considered to be ‘fine’ current writing on politics and social life.  Writers include Fahmy Huwaidi, Khaled Al-Khamissi (of Taxi fame), Alaa’ Al-Aswaani (of Yaqubian Building fame), and many others.  Enjoy.  (about 2 million words)
I cannot contain my excitement. Not only does the search provide full citations, but also does it show frequency distributions of various word forms (e.g., مكتب => مكتبهم, المكتب) and tokens appearing both before and after the query term. In the past I have used Google search as a corpus tool, but the pollution in the Arabic web due to chat forums subverts the discovery of meaningful linguistic examples. Bravo, BYU.

Written by Spence

January 11th, 2011 at 11:58 am

Posted in Arabic,Corpora,NLP