An unsupervised method for discovering lexical variations in Roman Urdu informal text

Abdul Rafae, Abdul Qayyum, Muhammad Moeenuddin, Asim Karim, Hassan Sajjad, Faisal Kamiran

Research output: Chapter in Book/Report/Conference proceedingConference contribution

1 Citation (Scopus)

Abstract

We present an unsupervised method to find lexical variations in Roman Urdu informal text. Our method includes a phonetic algorithm UrduPhone, a featurebased similarity function, and a clustering algorithm Lex-C. UrduPhone encodes roman Urdu strings to their phonetic equivalent representations. This produces an initial grouping of different spelling variations of a word. The similarity function incorporates word features and their context. Lex-C is a variant of k-medoids clustering algorithm that group lexical variations. It incorporates a similarity threshold to balance the number of clusters and their maximum similarity. We test our system on two datasets of SMS and blogs and show an f-measure gain of up to 12% from baseline systems.

Original languageEnglish
Title of host publicationConference Proceedings - EMNLP 2015: Conference on Empirical Methods in Natural Language Processing
PublisherAssociation for Computational Linguistics (ACL)
Pages823-828
Number of pages6
ISBN (Print)9781941643327
Publication statusPublished - 2015
EventConference on Empirical Methods in Natural Language Processing, EMNLP 2015 - Lisbon, Portugal
Duration: 17 Sep 201521 Sep 2015

Other

OtherConference on Empirical Methods in Natural Language Processing, EMNLP 2015
CountryPortugal
CityLisbon
Period17/9/1521/9/15

Fingerprint

Speech analysis
Clustering algorithms
Blogs

ASJC Scopus subject areas

  • Computational Theory and Mathematics
  • Computer Science Applications
  • Information Systems

Cite this

Rafae, A., Qayyum, A., Moeenuddin, M., Karim, A., Sajjad, H., & Kamiran, F. (2015). An unsupervised method for discovering lexical variations in Roman Urdu informal text. In Conference Proceedings - EMNLP 2015: Conference on Empirical Methods in Natural Language Processing (pp. 823-828). Association for Computational Linguistics (ACL).

An unsupervised method for discovering lexical variations in Roman Urdu informal text. / Rafae, Abdul; Qayyum, Abdul; Moeenuddin, Muhammad; Karim, Asim; Sajjad, Hassan; Kamiran, Faisal.

Conference Proceedings - EMNLP 2015: Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics (ACL), 2015. p. 823-828.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Rafae, A, Qayyum, A, Moeenuddin, M, Karim, A, Sajjad, H & Kamiran, F 2015, An unsupervised method for discovering lexical variations in Roman Urdu informal text. in Conference Proceedings - EMNLP 2015: Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics (ACL), pp. 823-828, Conference on Empirical Methods in Natural Language Processing, EMNLP 2015, Lisbon, Portugal, 17/9/15.
Rafae A, Qayyum A, Moeenuddin M, Karim A, Sajjad H, Kamiran F. An unsupervised method for discovering lexical variations in Roman Urdu informal text. In Conference Proceedings - EMNLP 2015: Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics (ACL). 2015. p. 823-828
Rafae, Abdul ; Qayyum, Abdul ; Moeenuddin, Muhammad ; Karim, Asim ; Sajjad, Hassan ; Kamiran, Faisal. / An unsupervised method for discovering lexical variations in Roman Urdu informal text. Conference Proceedings - EMNLP 2015: Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics (ACL), 2015. pp. 823-828
@inproceedings{4ef9d03e01ae4482a759985fe0b086e0,
title = "An unsupervised method for discovering lexical variations in Roman Urdu informal text",
abstract = "We present an unsupervised method to find lexical variations in Roman Urdu informal text. Our method includes a phonetic algorithm UrduPhone, a featurebased similarity function, and a clustering algorithm Lex-C. UrduPhone encodes roman Urdu strings to their phonetic equivalent representations. This produces an initial grouping of different spelling variations of a word. The similarity function incorporates word features and their context. Lex-C is a variant of k-medoids clustering algorithm that group lexical variations. It incorporates a similarity threshold to balance the number of clusters and their maximum similarity. We test our system on two datasets of SMS and blogs and show an f-measure gain of up to 12{\%} from baseline systems.",
author = "Abdul Rafae and Abdul Qayyum and Muhammad Moeenuddin and Asim Karim and Hassan Sajjad and Faisal Kamiran",
year = "2015",
language = "English",
isbn = "9781941643327",
pages = "823--828",
booktitle = "Conference Proceedings - EMNLP 2015: Conference on Empirical Methods in Natural Language Processing",
publisher = "Association for Computational Linguistics (ACL)",

}

TY - GEN

T1 - An unsupervised method for discovering lexical variations in Roman Urdu informal text

AU - Rafae, Abdul

AU - Qayyum, Abdul

AU - Moeenuddin, Muhammad

AU - Karim, Asim

AU - Sajjad, Hassan

AU - Kamiran, Faisal

PY - 2015

Y1 - 2015

N2 - We present an unsupervised method to find lexical variations in Roman Urdu informal text. Our method includes a phonetic algorithm UrduPhone, a featurebased similarity function, and a clustering algorithm Lex-C. UrduPhone encodes roman Urdu strings to their phonetic equivalent representations. This produces an initial grouping of different spelling variations of a word. The similarity function incorporates word features and their context. Lex-C is a variant of k-medoids clustering algorithm that group lexical variations. It incorporates a similarity threshold to balance the number of clusters and their maximum similarity. We test our system on two datasets of SMS and blogs and show an f-measure gain of up to 12% from baseline systems.

AB - We present an unsupervised method to find lexical variations in Roman Urdu informal text. Our method includes a phonetic algorithm UrduPhone, a featurebased similarity function, and a clustering algorithm Lex-C. UrduPhone encodes roman Urdu strings to their phonetic equivalent representations. This produces an initial grouping of different spelling variations of a word. The similarity function incorporates word features and their context. Lex-C is a variant of k-medoids clustering algorithm that group lexical variations. It incorporates a similarity threshold to balance the number of clusters and their maximum similarity. We test our system on two datasets of SMS and blogs and show an f-measure gain of up to 12% from baseline systems.

UR - http://www.scopus.com/inward/record.url?scp=84959922739&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84959922739&partnerID=8YFLogxK

M3 - Conference contribution

SN - 9781941643327

SP - 823

EP - 828

BT - Conference Proceedings - EMNLP 2015: Conference on Empirical Methods in Natural Language Processing

PB - Association for Computational Linguistics (ACL)

ER -