An unsupervised method for discovering lexical variations in Roman Urdu informal text

Abdul Rafae, Abdul Qayyum, Muhammad Moeenuddin, Asim Karim, Hassan Sajjad, Faisal Kamiran

Research output: Chapter in Book/Report/Conference proceedingConference contribution

2 Citations (Scopus)

Abstract

We present an unsupervised method to find lexical variations in Roman Urdu informal text. Our method includes a phonetic algorithm UrduPhone, a featurebased similarity function, and a clustering algorithm Lex-C. UrduPhone encodes roman Urdu strings to their phonetic equivalent representations. This produces an initial grouping of different spelling variations of a word. The similarity function incorporates word features and their context. Lex-C is a variant of k-medoids clustering algorithm that group lexical variations. It incorporates a similarity threshold to balance the number of clusters and their maximum similarity. We test our system on two datasets of SMS and blogs and show an f-measure gain of up to 12% from baseline systems.

Original languageEnglish
Title of host publicationConference Proceedings - EMNLP 2015: Conference on Empirical Methods in Natural Language Processing
PublisherAssociation for Computational Linguistics (ACL)
Pages823-828
Number of pages6
ISBN (Print)9781941643327
Publication statusPublished - 2015
EventConference on Empirical Methods in Natural Language Processing, EMNLP 2015 - Lisbon, Portugal
Duration: 17 Sep 201521 Sep 2015

Other

OtherConference on Empirical Methods in Natural Language Processing, EMNLP 2015
CountryPortugal
CityLisbon
Period17/9/1521/9/15

    Fingerprint

ASJC Scopus subject areas

  • Computational Theory and Mathematics
  • Computer Science Applications
  • Information Systems

Cite this

Rafae, A., Qayyum, A., Moeenuddin, M., Karim, A., Sajjad, H., & Kamiran, F. (2015). An unsupervised method for discovering lexical variations in Roman Urdu informal text. In Conference Proceedings - EMNLP 2015: Conference on Empirical Methods in Natural Language Processing (pp. 823-828). Association for Computational Linguistics (ACL).