Spence Green

التكرار يعلم الحمار

I work at Lilt. In addition to computers and languages, my interests include travel, running, and scuba diving. more...

Archive for the ‘NLP’ Category

Reading Up On Bayesian Methods

without comments

For the next few months I’ve decided to focus on semi-supervised learning in a Bayesian setting. At Johns Hopkins last summer I was introduced to “fancy generative models,” i.e. various flavors of Dirichlet Process, but I was slow on the uptake. Now I’m trying to catch-up. Here are some helpful reading lists:

In addition to a thorough understanding of MCMC–which is relatively simple–it’s also important to at least have an awareness of variational methods, which is relatively hard. Jason Eisner recently wrote a high-level introduction to variational inference that is a soft(er) encounter with the subject than the canonical reference:

M. I. Jordan, Z. Ghahramani, T. S. Jaakkola, and L. K. Saul. An introduction to variational methods for graphical models. Machine Learning, 1999.

Where will this lead? It is argued that the Bayesian framework offers a more appealing cognitive model. That may be. What interests me is the pairing of Bayesian updating with data collection from the web. Philip Resnik recently covered efforts to translate voicemails during the revolution in Egypt as one method of re-connecting that country with the world. This data is clearly useful, but it what is unclear is how to use it to retrain standard (e.g., frequentist) probabilistic NLP models. Cache models, at least in principle, offer an alternative.

Written by Spence

February 8th, 2011 at 4:46 pm

Posted in Machine Learning,NLP

NLP Software That People Actually Use

without comments

A tired lament in NLP is that people don’t release their code, or that they release incomprehensible code, or that they wrote code in Haskell, or whatever. As models get more complicated, the burden of software engineering increases, making it hard to quickly test new ideas. It’s getting to the point where you have to invest $100k without reading a prospectus. I’ve been thinking about the good libraries that people do actually use, and why people use them. Here is the list I made (in no particular order):

  1. OpenFST — Finite-state toolkit
  2. SRILM — Language modeling
  3. CharniakBerkeley / Stanford / Bikel parsers — Statistical constituency parsing
  4. MST / MALT dependency parsers
  5. Stanford NER system — named entity recognition!
  6. LingPipe — The kitchen sink
  7. Mallet — A smaller kitchen sink
  8. GIZA++ — Word alignment
  9. Moses — Phrase-based machine translation
  10. Joshua — Hierarchical machine translation

I don’t know the histories of all of these packages. But a few conservative generalilzations are:

  • They work.
  • They don’t necessary provide “best published” performance, but they get very close.
  • Most of them started as someone’s grad school project, or at least had significant student contributions.
  • You can easily name a person associated with all of them.

The end result: a good open-source package helps other people and makes you famous. That sounds like a good bargain.

Written by Spence

February 1st, 2011 at 4:14 pm

Posted in NLP,Software

HOWTO: Basic Arabic Preprocessing for NLP

without comments

Raw Arabic text is difficult to process. Errant diacritics, strange unicode characters, and haphazard use whitespace are all common obstacles for even basic tasks. For statistical systems, cliticization and morphological variation can induce sparsity. As a result, sophisticated preprocessing techniques have been developed, the best of which are described in these three papers:

  • Nizar Habash and Owen Rambow. 2005. Arabic tokenization, part-of-speech tagging and morphological disambiguation in one fell swoop. In NAACL. [pdf]
  • Nizar Habash and Fatiha Sadat. 2006. Arabic preprocessing schemes for statistical machine translation. In NAACL. [pdf]
  • Mona Diab, Kadri Hacioglu and Daniel Jurafsky. 2007. Automatic processing of Modern Standard Arabic text. In Arabic Computational Morphology. [pdf]
Recently I have had success with a less sophisticated normalization process, which requires these scripts and MADA (freely available from Columbia University). Two scripts are included:
  1. basic_ortho_norm.py — Simple orthographic normalization.
  2. run_mada — Run script for MADA+TOKAN 3.1 that performs morphological analysis and clitic segmentation (like the Penn Arabic Treebank).

Both of my 2010 conference papers used these scripts.

Written by Spence

January 19th, 2011 at 6:48 am

Posted in Arabic,HOWTO,NLP