Spence Green

التكرار يعلم الحمار

I work at Lilt. In addition to computers and languages, my interests include travel, running, and scuba diving. more...

Archive for the ‘HOWTO’ Category

Installing SRILM on Ubuntu 11.10

with 4 comments

I’ve recently upgraded to Ubuntu 11.10, which I had avoided due to the Unity fiasco, among other reports of general “bugginess.” 10.04 worked really well for me for years. One of the first jobs was to install SRILM. Previously, installation proceeded without incident. Not this time. Here is how I got it to install in /usr/share/srilm on a 64-bit architecture:

  1. mkdir /usr/share/srilm
  2. mv srilm.tgz /usr/share/srilm
  3. cd /usr/share/srilm
  4. tar xzf srilm.tgz
  5. sudo apt-get install tcl tcl-dev csh gawk
  6. In Makefile, uncomment the SRILM= parameter and point it to /usr/share/srilm (or your equivalent path)
  7. make NO_TCL=1 MACHINE_TYPE=i686-ubuntu World
  8. Add the following to your .bashrc

export PATH=$PATH:$SRILM/bin:$SRILM/bin/i686-ubuntu

Now you should be able to run ‘make test’ successfully.


Written by Spence

February 1st, 2012 at 5:42 pm

Posted in HOWTO,NLP,Ubuntu

HOWTO: Basic Arabic Preprocessing for NLP

without comments

Raw Arabic text is difficult to process. Errant diacritics, strange unicode characters, and haphazard use whitespace are all common obstacles for even basic tasks. For statistical systems, cliticization and morphological variation can induce sparsity. As a result, sophisticated preprocessing techniques have been developed, the best of which are described in these three papers:

  • Nizar Habash and Owen Rambow. 2005. Arabic tokenization, part-of-speech tagging and morphological disambiguation in one fell swoop. In NAACL. [pdf]
  • Nizar Habash and Fatiha Sadat. 2006. Arabic preprocessing schemes for statistical machine translation. In NAACL. [pdf]
  • Mona Diab, Kadri Hacioglu and Daniel Jurafsky. 2007. Automatic processing of Modern Standard Arabic text. In Arabic Computational Morphology. [pdf]
Recently I have had success with a less sophisticated normalization process, which requires these scripts and MADA (freely available from Columbia University). Two scripts are included:
  1. basic_ortho_norm.py — Simple orthographic normalization.
  2. run_mada — Run script for MADA+TOKAN 3.1 that performs morphological analysis and clitic segmentation (like the Penn Arabic Treebank).

Both of my 2010 conference papers used these scripts.

Written by Spence

January 19th, 2011 at 6:48 am

Posted in Arabic,HOWTO,NLP

HOWTO: WordPress, 1and1, and MySQL

without comments

For several months now I have been unable to automatically update WordPress. Then, this afternoon, I found that I could no longer update the site manually. A quick glance at the 2.9.1 release notes revealed the problem:

Requires MySQL 4.1.2 or greater (old requirement was 4.0).

A few searches revealed that many other 1&1 users have encountered the same issue, which I resolved by making two administrative changes:

  1. Ensure that WordPress is running on php5 by adding a line to .htaccess in the WordPress root directory.
  2. Migrate the WordPress database to MySQL 5.0. I followed this set of instructions exactly.

Written by Spence

January 9th, 2010 at 11:49 pm

Posted in HOWTO