Archive for the ‘Corpora’ Category
Uses of the Cross-Lingual Link Structure of Wikipedia
We recently became interested in obtaining topically-aligned data from Wikipedia via cross-lingual links. For example, we can use the document tuple (George Bush, جيورج بوش) for all sorts of things, even without sentence alignment. There have been a few recent, related papers, among them:
- Kevin Duh. 2011. Providing Cross-Lingual Editing Assistance to Wikipedia Users. In CICLING.
- Gerard de Melo and Gerhard Weikum. 2010. Untangling the cross-lingual structure of Wikipedia. In ACL.
- Philipp Sorg, Philipp Cimiano. 2008. Enriching the Crosslingual Link Structure of Wikipedia – A Classification-Based Approach. In AAAI.
The consensus seems to be that cleanup is required prior to information extraction. However, this observation is language-pair specific. For English-Arabic at least, we have not noticed an improvement by applying the algorithms of [2].
Awesome Arabic Corpus Search Tool
Evidently an Arabic corpus search tool has existed at BYU for some time, but a post on the Arabic LinguistList this morning brought it to my attention:
This is to announce that two new ‘sub’ corpora have been added to newspaper section of arabiCorpus.byu.edu:Masri2010:This is the entire year of 2010 worth of the newspaper Al-Masri Al-Yawm. This paper was chosen partly because of its popularity, partly because it contrasts markedly in style from the Ahram, and partly because it is one of the papers that uses the new ‘quoting’ style: they actually write down what people say, even if it is in colloquial Arabic or some mixed form (look up وتعاليمها تخاخل الإنجيل using ‘string’ for a relatively hilarious example quoting Baba Shanouda during last summers ‘divorce controversy'(. (almost 14 million words)ShuruqColumns:This is a large set of columns from the Egyptian newspaper Al-Shuruuq. This paper is reputed to have attracted some of the best editorial writers in Egypt, and many people buy it just for the writers and columns, rather than for the news. This would be a good (small) corpus to use if you wanted samples of what is considered to be ‘fine’ current writing on politics and social life. Writers include Fahmy Huwaidi, Khaled Al-Khamissi (of Taxi fame), Alaa’ Al-Aswaani (of Yaqubian Building fame), and many others. Enjoy. (about 2 million words)