Sublanguage Corpus Analysis Toolkit: A tool for assessing the representativeness and sublanguage characteristics of corpora

Irina Temnikova, William A. Baumgartner, Negacy D. Hailu, Ivelina Nikolova, Tony McEnery, Adam Kilgarriff, Galia Angelova, K. Bretonnel Cohen

Research output: Chapter in Book/Report/Conference proceedingConference contribution

4 Citations (Scopus)

Abstract

Sublanguages are varieties of language that form "subsets" of the general language, typically exhibiting particular types of lexical, semantic, and other restrictions and deviance. SubCAT, the Sublanguage Corpus Analysis Toolkit, assesses the representativeness and closure properties of corpora to analyze the extent to which they are either sublanguages, or representative samples of the general language. The current version of SubCAT contains scripts and applications for assessing lexical closure, morphological closure, sentence type closure, over-represented words, and syntactic deviance. Its operation is illustrated with three case studies concerning scientific journal articles, patents, and clinical records. Materials from two language families are analyzed - English (Germanic), and Bulgarian (Slavic). The software is available at sublanguage.sourceforge.net under a liberal Open Source license.

Original languageEnglish
Title of host publicationProceedings of the 9th International Conference on Language Resources and Evaluation, LREC 2014
PublisherEuropean Language Resources Association (ELRA)
Pages1714-1718
Number of pages5
ISBN (Electronic)9782951740884
Publication statusPublished - 1 Jan 2014
Event9th International Conference on Language Resources and Evaluation, LREC 2014 - Reykjavik, Iceland
Duration: 26 May 201431 May 2014

Other

Other9th International Conference on Language Resources and Evaluation, LREC 2014
CountryIceland
CityReykjavik
Period26/5/1431/5/14

Fingerprint

deviant behavior
language
scientific journal
license
patent
semantics
Toolkit
Corpus Analysis
Closure
Representativeness
Language
Deviance
software
Open Source
Language Families
Scientific Journals
Software
Syntax
Patents
Lexical Semantics

Keywords

  • Corpus linguistics
  • Sublanguage characterisation
  • Sublanguage recognition

ASJC Scopus subject areas

  • Linguistics and Language
  • Library and Information Sciences
  • Education
  • Language and Linguistics

Cite this

Temnikova, I., Baumgartner, W. A., Hailu, N. D., Nikolova, I., McEnery, T., Kilgarriff, A., ... Cohen, K. B. (2014). Sublanguage Corpus Analysis Toolkit: A tool for assessing the representativeness and sublanguage characteristics of corpora. In Proceedings of the 9th International Conference on Language Resources and Evaluation, LREC 2014 (pp. 1714-1718). European Language Resources Association (ELRA).

Sublanguage Corpus Analysis Toolkit : A tool for assessing the representativeness and sublanguage characteristics of corpora. / Temnikova, Irina; Baumgartner, William A.; Hailu, Negacy D.; Nikolova, Ivelina; McEnery, Tony; Kilgarriff, Adam; Angelova, Galia; Cohen, K. Bretonnel.

Proceedings of the 9th International Conference on Language Resources and Evaluation, LREC 2014. European Language Resources Association (ELRA), 2014. p. 1714-1718.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Temnikova, I, Baumgartner, WA, Hailu, ND, Nikolova, I, McEnery, T, Kilgarriff, A, Angelova, G & Cohen, KB 2014, Sublanguage Corpus Analysis Toolkit: A tool for assessing the representativeness and sublanguage characteristics of corpora. in Proceedings of the 9th International Conference on Language Resources and Evaluation, LREC 2014. European Language Resources Association (ELRA), pp. 1714-1718, 9th International Conference on Language Resources and Evaluation, LREC 2014, Reykjavik, Iceland, 26/5/14.
Temnikova I, Baumgartner WA, Hailu ND, Nikolova I, McEnery T, Kilgarriff A et al. Sublanguage Corpus Analysis Toolkit: A tool for assessing the representativeness and sublanguage characteristics of corpora. In Proceedings of the 9th International Conference on Language Resources and Evaluation, LREC 2014. European Language Resources Association (ELRA). 2014. p. 1714-1718
Temnikova, Irina ; Baumgartner, William A. ; Hailu, Negacy D. ; Nikolova, Ivelina ; McEnery, Tony ; Kilgarriff, Adam ; Angelova, Galia ; Cohen, K. Bretonnel. / Sublanguage Corpus Analysis Toolkit : A tool for assessing the representativeness and sublanguage characteristics of corpora. Proceedings of the 9th International Conference on Language Resources and Evaluation, LREC 2014. European Language Resources Association (ELRA), 2014. pp. 1714-1718
@inproceedings{c59dc552475448b9a177ab0b6851c650,
title = "Sublanguage Corpus Analysis Toolkit: A tool for assessing the representativeness and sublanguage characteristics of corpora",
abstract = "Sublanguages are varieties of language that form {"}subsets{"} of the general language, typically exhibiting particular types of lexical, semantic, and other restrictions and deviance. SubCAT, the Sublanguage Corpus Analysis Toolkit, assesses the representativeness and closure properties of corpora to analyze the extent to which they are either sublanguages, or representative samples of the general language. The current version of SubCAT contains scripts and applications for assessing lexical closure, morphological closure, sentence type closure, over-represented words, and syntactic deviance. Its operation is illustrated with three case studies concerning scientific journal articles, patents, and clinical records. Materials from two language families are analyzed - English (Germanic), and Bulgarian (Slavic). The software is available at sublanguage.sourceforge.net under a liberal Open Source license.",
keywords = "Corpus linguistics, Sublanguage characterisation, Sublanguage recognition",
author = "Irina Temnikova and Baumgartner, {William A.} and Hailu, {Negacy D.} and Ivelina Nikolova and Tony McEnery and Adam Kilgarriff and Galia Angelova and Cohen, {K. Bretonnel}",
year = "2014",
month = "1",
day = "1",
language = "English",
pages = "1714--1718",
booktitle = "Proceedings of the 9th International Conference on Language Resources and Evaluation, LREC 2014",
publisher = "European Language Resources Association (ELRA)",

}

TY - GEN

T1 - Sublanguage Corpus Analysis Toolkit

T2 - A tool for assessing the representativeness and sublanguage characteristics of corpora

AU - Temnikova, Irina

AU - Baumgartner, William A.

AU - Hailu, Negacy D.

AU - Nikolova, Ivelina

AU - McEnery, Tony

AU - Kilgarriff, Adam

AU - Angelova, Galia

AU - Cohen, K. Bretonnel

PY - 2014/1/1

Y1 - 2014/1/1

N2 - Sublanguages are varieties of language that form "subsets" of the general language, typically exhibiting particular types of lexical, semantic, and other restrictions and deviance. SubCAT, the Sublanguage Corpus Analysis Toolkit, assesses the representativeness and closure properties of corpora to analyze the extent to which they are either sublanguages, or representative samples of the general language. The current version of SubCAT contains scripts and applications for assessing lexical closure, morphological closure, sentence type closure, over-represented words, and syntactic deviance. Its operation is illustrated with three case studies concerning scientific journal articles, patents, and clinical records. Materials from two language families are analyzed - English (Germanic), and Bulgarian (Slavic). The software is available at sublanguage.sourceforge.net under a liberal Open Source license.

AB - Sublanguages are varieties of language that form "subsets" of the general language, typically exhibiting particular types of lexical, semantic, and other restrictions and deviance. SubCAT, the Sublanguage Corpus Analysis Toolkit, assesses the representativeness and closure properties of corpora to analyze the extent to which they are either sublanguages, or representative samples of the general language. The current version of SubCAT contains scripts and applications for assessing lexical closure, morphological closure, sentence type closure, over-represented words, and syntactic deviance. Its operation is illustrated with three case studies concerning scientific journal articles, patents, and clinical records. Materials from two language families are analyzed - English (Germanic), and Bulgarian (Slavic). The software is available at sublanguage.sourceforge.net under a liberal Open Source license.

KW - Corpus linguistics

KW - Sublanguage characterisation

KW - Sublanguage recognition

UR - http://www.scopus.com/inward/record.url?scp=85015717253&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85015717253&partnerID=8YFLogxK

M3 - Conference contribution

AN - SCOPUS:85015717253

SP - 1714

EP - 1718

BT - Proceedings of the 9th International Conference on Language Resources and Evaluation, LREC 2014

PB - European Language Resources Association (ELRA)

ER -