Typesetting for improved readability using lexical and syntactic information

Ahmed Salama, Kemal Oflazer, Susan Hagan

Research output: Chapter in Book/Report/Conference proceedingConference contribution

1 Citation (Scopus)

Abstract

We present results from our study of which uses syntactically and semantically motivated information to group segments of sentences into unbreakable units for the purpose of typesetting those sentences in a region of a fixed width, using an otherwise standard dynamic programming line breaking algorithm, to minimize ragged-ness. In addition to a rule-based baseline segmenter, we use a very modest size text, manually annotated with positions of breaks, to train a maximum entropy classifier, relying on an extensive set of lexical and syntactic features, which can then predict whether or not to break after a certain word position in a sentence. We also use a simple genetic algorithm to search for a subset of the features optimizing F1, to arrive at a set of features that delivers 89.2% Precision, 90.2% Recall (89.7% F1) on a test set, improving the rule-based baseline by about 11 points and the classifier trained on all features by about 1 point in F 1.

Original languageEnglish
Title of host publicationACL 2013 - 51st Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference
PublisherAssociation for Computational Linguistics (ACL)
Pages719-724
Number of pages6
Volume2
ISBN (Print)9781937284510
Publication statusPublished - 2013
Event51st Annual Meeting of the Association for Computational Linguistics, ACL 2013 - Sofia
Duration: 4 Aug 20139 Aug 2013

Other

Other51st Annual Meeting of the Association for Computational Linguistics, ACL 2013
CitySofia
Period4/8/139/8/13

Fingerprint

entropy
programming
Group
Syntax
Readability
Classifier
Genetic Algorithm
Syntactic Features
Maximum Entropy
Train
Programming

ASJC Scopus subject areas

  • Language and Linguistics
  • Linguistics and Language

Cite this

Salama, A., Oflazer, K., & Hagan, S. (2013). Typesetting for improved readability using lexical and syntactic information. In ACL 2013 - 51st Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference (Vol. 2, pp. 719-724). Association for Computational Linguistics (ACL).

Typesetting for improved readability using lexical and syntactic information. / Salama, Ahmed; Oflazer, Kemal; Hagan, Susan.

ACL 2013 - 51st Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference. Vol. 2 Association for Computational Linguistics (ACL), 2013. p. 719-724.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Salama, A, Oflazer, K & Hagan, S 2013, Typesetting for improved readability using lexical and syntactic information. in ACL 2013 - 51st Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference. vol. 2, Association for Computational Linguistics (ACL), pp. 719-724, 51st Annual Meeting of the Association for Computational Linguistics, ACL 2013, Sofia, 4/8/13.
Salama A, Oflazer K, Hagan S. Typesetting for improved readability using lexical and syntactic information. In ACL 2013 - 51st Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference. Vol. 2. Association for Computational Linguistics (ACL). 2013. p. 719-724
Salama, Ahmed ; Oflazer, Kemal ; Hagan, Susan. / Typesetting for improved readability using lexical and syntactic information. ACL 2013 - 51st Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference. Vol. 2 Association for Computational Linguistics (ACL), 2013. pp. 719-724
@inproceedings{7996e9cbd8334e22b34df52b3e8435b5,
title = "Typesetting for improved readability using lexical and syntactic information",
abstract = "We present results from our study of which uses syntactically and semantically motivated information to group segments of sentences into unbreakable units for the purpose of typesetting those sentences in a region of a fixed width, using an otherwise standard dynamic programming line breaking algorithm, to minimize ragged-ness. In addition to a rule-based baseline segmenter, we use a very modest size text, manually annotated with positions of breaks, to train a maximum entropy classifier, relying on an extensive set of lexical and syntactic features, which can then predict whether or not to break after a certain word position in a sentence. We also use a simple genetic algorithm to search for a subset of the features optimizing F1, to arrive at a set of features that delivers 89.2{\%} Precision, 90.2{\%} Recall (89.7{\%} F1) on a test set, improving the rule-based baseline by about 11 points and the classifier trained on all features by about 1 point in F 1.",
author = "Ahmed Salama and Kemal Oflazer and Susan Hagan",
year = "2013",
language = "English",
isbn = "9781937284510",
volume = "2",
pages = "719--724",
booktitle = "ACL 2013 - 51st Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference",
publisher = "Association for Computational Linguistics (ACL)",

}

TY - GEN

T1 - Typesetting for improved readability using lexical and syntactic information

AU - Salama, Ahmed

AU - Oflazer, Kemal

AU - Hagan, Susan

PY - 2013

Y1 - 2013

N2 - We present results from our study of which uses syntactically and semantically motivated information to group segments of sentences into unbreakable units for the purpose of typesetting those sentences in a region of a fixed width, using an otherwise standard dynamic programming line breaking algorithm, to minimize ragged-ness. In addition to a rule-based baseline segmenter, we use a very modest size text, manually annotated with positions of breaks, to train a maximum entropy classifier, relying on an extensive set of lexical and syntactic features, which can then predict whether or not to break after a certain word position in a sentence. We also use a simple genetic algorithm to search for a subset of the features optimizing F1, to arrive at a set of features that delivers 89.2% Precision, 90.2% Recall (89.7% F1) on a test set, improving the rule-based baseline by about 11 points and the classifier trained on all features by about 1 point in F 1.

AB - We present results from our study of which uses syntactically and semantically motivated information to group segments of sentences into unbreakable units for the purpose of typesetting those sentences in a region of a fixed width, using an otherwise standard dynamic programming line breaking algorithm, to minimize ragged-ness. In addition to a rule-based baseline segmenter, we use a very modest size text, manually annotated with positions of breaks, to train a maximum entropy classifier, relying on an extensive set of lexical and syntactic features, which can then predict whether or not to break after a certain word position in a sentence. We also use a simple genetic algorithm to search for a subset of the features optimizing F1, to arrive at a set of features that delivers 89.2% Precision, 90.2% Recall (89.7% F1) on a test set, improving the rule-based baseline by about 11 points and the classifier trained on all features by about 1 point in F 1.

UR - http://www.scopus.com/inward/record.url?scp=84907321012&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84907321012&partnerID=8YFLogxK

M3 - Conference contribution

SN - 9781937284510

VL - 2

SP - 719

EP - 724

BT - ACL 2013 - 51st Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference

PB - Association for Computational Linguistics (ACL)

ER -