Center for Language Engineering
ناشر
Pakistan
ملک
Lahore
شہر
09-11-2012
تاریخِ آغاز
10-11-2012
تاریخِ اختتام


تلخیص
The paper presents design scheme and details of the first large publically available corpus of Urdu language. This includes the collection and cleaning techniques for the first 100k derivative of the larger corpus and the issues related to corpus design such as size, genres along with their ratio. The same design and techniques are being scaled to develop larger derivatives of the corpus with 500k, 1000k and 5000k words. The corpus, due to its public license, will significantly contribute towards linguistic and computational aspects of Urdu analysis.

Saba Urooj, Farah Adeeba, Sarmad Hussain, Farhat Jabeen, Rahila Parveen. (2012) CLE Urdu Digest Corpus, Conference on Language and Technology 2012.
  • Viewed 1569
  • Downloads 312