Abstract
While the World Wide Web is an attractive resource, few researchers can access or manage a Web-scale corpus. Instead they use search-hit counts as a substitute for direct measurements on a web corpus. In contrast, one can download a small high quality corpus like Wikipedia and carry out exact measurements. By extensive experiments with multiple word-association measures and several public datasets, we show that for exploring document level co-occurrence based word associations, despite being three orders of magnitude smaller in size, the Wikipedia is a reasonable alternative to a web corpus that can only be accessed using search engines. Further, with Wikipedia, one can carry out measurements at a granularity finer than document scale. Instead of document level co-occurrence, one can consider a word-pair occurrence significant, only if the two words occur within a certain threshold distance of each-other. In general, such fine-grained information cannot be obtained from search engines. Our experiments show that the word level co-occurrence measures perform better than the document level measures. This indicates another practical advantage of the Wikipedia, or any other downloadable corpus, over a Web corpus which can only be accessed using search engines.

Om P. Damani, Pankhil Chedda, Dipak Chaudhari. (2012) Wikipedia is a Practical Alternative to the Web for measuring Co-occurrence based Word Association, Conference on Language and Technology 2012.
  • Viewed 1501
  • Downloads 136
Publisher
Center for Language Engineering
Country
Pakistan
City
Lahore
From
09-11-2012
To
10-11-2012