Spence Green

التكرار يعلم الحمار

I work at Lilt. In addition to computers and languages, my interests include travel, running, and scuba diving. more...

Archive for the ‘Corpora’ Category

Uses of the Cross-Lingual Link Structure of Wikipedia

without comments

We recently became interested in obtaining topically-aligned data from Wikipedia via cross-lingual links. For example, we can use the document tuple (George Bush, جيورج بوش) for all sorts of things, even without sentence alignment. There have been a few recent, related papers, among them:

  1. Kevin Duh. 2011. Providing Cross-Lingual Editing Assistance to Wikipedia Users. In CICLING. 
  2. Gerard de Melo and Gerhard Weikum. 2010. Untangling the cross-lingual structure of Wikipedia. In ACL.
  3. Philipp Sorg, Philipp Cimiano. 2008. Enriching the Crosslingual Link Structure of Wikipedia – A Classification-Based Approach. In AAAI.

The consensus seems to be that cleanup is required prior to information extraction. However, this observation is language-pair specific. For English-Arabic at least, we have not noticed an improvement by applying the algorithms of [2].


Written by Spence

February 8th, 2011 at 4:56 pm

Posted in Corpora,NLP

Awesome Arabic Corpus Search Tool

without comments

Evidently an Arabic corpus search tool has existed at BYU for some time, but a post on the Arabic LinguistList this morning brought it to my attention:

This is to announce that two new ‘sub’ corpora have been added to newspaper section of arabiCorpus.byu.edu:
Masri2010:
This is the entire year of 2010 worth of the newspaper Al-Masri Al-Yawm.  This paper was chosen partly because of its popularity, partly because it contrasts markedly in style from the Ahram, and partly because it is one of the papers that uses the new ‘quoting’ style: they actually write down what people say, even if it is in colloquial Arabic or some mixed form (look up وتعاليمها تخاخل الإنجيل using ‘string’ for a relatively hilarious example quoting Baba Shanouda during last summers ‘divorce controversy'(. (almost 14 million words)
ShuruqColumns:
This is a large set of columns from the Egyptian newspaper Al-Shuruuq.  This paper is reputed to have attracted some of the best editorial writers in Egypt, and many people buy it just for the writers and columns, rather than for the news.  This would be a good (small) corpus to use if you wanted samples of what is considered to be ‘fine’ current writing on politics and social life.  Writers include Fahmy Huwaidi, Khaled Al-Khamissi (of Taxi fame), Alaa’ Al-Aswaani (of Yaqubian Building fame), and many others.  Enjoy.  (about 2 million words)
I cannot contain my excitement. Not only does the search provide full citations, but also does it show frequency distributions of various word forms (e.g., مكتب => مكتبهم, المكتب) and tokens appearing both before and after the query term. In the past I have used Google search as a corpus tool, but the pollution in the Arabic web due to chat forums subverts the discovery of meaningful linguistic examples. Bravo, BYU.

Written by Spence

January 11th, 2011 at 11:58 am

Posted in Arabic,Corpora,NLP