Spence Green

التكرار يعلم الحمار

I work at Lilt. In addition to computers and languages, my interests include travel, running, and scuba diving. more...

New Arabic Word Segmenter

without comments

We just released the Arabic word segmenter that was developed last summer at Google. The segmenter is written in Java and has no external dependencies (like the Standard Arabic Morphological Analyzer (SAMA) from the LDC). You can get it from the Stanford Word Segmenter download page.

The segmenter produces Penn Arabic Treebank (ATB) 3 clitic segmentation. Of course, sometimes you might want finer or coarser levels of segmentation, but I’ve found that this segmentation scheme is very effective for both translation and parsing. Our segmenter is based on a conditional random fields (CRF) sequence classifier, so it processes raw text very quickly.

Written by Spence

April 23rd, 2012 at 2:29 pm

Posted in Arabic,NLP

Leave a Reply

You must be logged in to post a comment.