Spence Green

التكرار يعلم الحمار

I work at Lilt. In addition to computers and languages, my interests include travel, running, and scuba diving. more...

Archive for the ‘NLP’ Category

New Arabic Word Segmenter

without comments

We just released the Arabic word segmenter that was developed last summer at Google. The segmenter is written in Java and has no external dependencies (like the Standard Arabic Morphological Analyzer (SAMA) from the LDC). You can get it from the Stanford Word Segmenter download page.

The segmenter produces Penn Arabic Treebank (ATB) 3 clitic segmentation. Of course, sometimes you might want finer or coarser levels of segmentation, but I’ve found that this segmentation scheme is very effective for both translation and parsing. Our segmenter is based on a conditional random fields (CRF) sequence classifier, so it processes raw text very quickly.

Written by Spence

April 23rd, 2012 at 2:29 pm

Posted in Arabic,NLP

Entity Clustering Across Languages

without comments

I posted the final version of our NAACL 2012 paper on entity clustering across languages. The idea is to identify text mentions to entities in the world (e.g., people, places, and things) in multiple languages. We’ve tried to stay away from language-specific feature engineering. Moreover, we only used training resources that are abundant for many languages. For example, we make use of the topical structure of Wikipedia in which a single topic (e.g., “Steve Jobs”) is discussed in many languages. Our techniques should be especially useful for low-resource languages.

We wrote the first few lines of code for this project in June 2010. The code for this paper now exceeds 20k lines and I shudder to think of the rolling brownouts our experiments have likely caused throughout Maryland. We’ve burned up the COE computing cluster. I am thankful that this work will finally receive a public hearing.

Names are fascinating objects. They tend to originate in one language, usually the language of the culture in which the entity originates. Then the name spreads. Nicknames and aliases develop.  New variants arise in other languages and writing systems. We’ve started to think of this phenomenon as a phylogenetic process, much like the proliferation of linguistic cognates or even bird species. Nick and Jason have been developing a model for this process, an initial version of which they recently presented at the NIPS NP Bayes workshop.

Written by Spence

April 9th, 2012 at 5:02 pm

Installing SRILM on Ubuntu 11.10

with 4 comments

I’ve recently upgraded to Ubuntu 11.10, which I had avoided due to the Unity fiasco, among other reports of general “bugginess.” 10.04 worked really well for me for years. One of the first jobs was to install SRILM. Previously, installation proceeded without incident. Not this time. Here is how I got it to install in /usr/share/srilm on a 64-bit architecture:

  1. mkdir /usr/share/srilm
  2. mv srilm.tgz /usr/share/srilm
  3. cd /usr/share/srilm
  4. tar xzf srilm.tgz
  5. sudo apt-get install tcl tcl-dev csh gawk
  6. In Makefile, uncomment the SRILM= parameter and point it to /usr/share/srilm (or your equivalent path)
  7. make NO_TCL=1 MACHINE_TYPE=i686-ubuntu World
  8. Add the following to your .bashrc

export PATH=$PATH:$SRILM/bin:$SRILM/bin/i686-ubuntu

Now you should be able to run ‘make test’ successfully.


Written by Spence

February 1st, 2012 at 5:42 pm

Posted in HOWTO,NLP,Ubuntu