New Arabic Word Segmenter
We just released the Arabic word segmenter that was developed last summer at Google. The segmenter is written in Java and has no external dependencies (like the Standard Arabic Morphological Analyzer (SAMA) from the LDC). You can get it from the Stanford Word Segmenter download page.
The segmenter produces Penn Arabic Treebank (ATB) 3 clitic segmentation. Of course, sometimes you might want finer or coarser levels of segmentation, but I’ve found that this segmentation scheme is very effective for both translation and parsing. Our segmenter is based on a conditional random fields (CRF) sequence classifier, so it processes raw text very quickly.
Leave a Reply
You must be logged in to post a comment.