Abstract
The paper presents design scheme and details of
the first large publically available corpus of Urdu
language. This includes the collection and cleaning
techniques for the first 100k derivative of the larger
corpus and the issues related to corpus design such as
size, genres along with their ratio. The same design
and techniques are being scaled to develop larger
derivatives of the corpus with 500k, 1000k and 5000k
words. The corpus, due to its public license, will
significantly contribute towards linguistic and
computational aspects of Urdu analysis.
Saba Urooj, Farah Adeeba, Sarmad Hussain, Farhat Jabeen, Rahila Parveen. (2012) CLE Urdu Digest Corpus, Conference on Language and Technology 2012.
-
Viewed
1571 -
Downloads
312