Abstract
Text segmentation is a process of dividing a
sentence into its constituent words. For Natural
Language Processing, Word Segmentation is an initial
and obligatory step. Research in word segmentation
has been done in different languages like English,
Dutch, Chinese, Norwegian, Swedish and much more
but this research focuses on Urdu language. Unlike
English language, words in Urdu language are not
always separated by spaces and spaces are not
consistently used, which gives rise to both space
omission and space insertion errors in Urdu. Space
omission and space insertion error is the major
challenge for segmentation task. This paper discusses
the problems of Urdu Word segmentation and also
suggests a solution to the space omission problem and
space insertion problem. First, the clustered words are
segmented and then each clustered word is divided
into valid word. We use dictionary for marking word
boundaries and for validating that the word is
segmented correctly. This technique can be used for
any application of Urdu text. This work has been tested
on words collected from Geo1
, Jang2
, BBC3
news sites
and other online documents available on internet. The
proposed solution is tested on 11,995 words and the
result is around 97.2%.
Rabiya Rashid, Seemab Latif. (2012) A Dictionary Based Urdu Word Segmentation Using Dynamic Programming for Space Omission Problem, Conference on Language and Technology 2012.
-
Viewed
1519 -
Downloads
270