Detecting semantic uncertainty by learning hedge cues in sentences using an HMM

Xiujun Li, Wei Gao, Jude W. Shavlik

Research output: Chapter in Book/Report/Conference proceedingConference contribution

2 Citations (Scopus)

Abstract

Detecting speculative assertions is essential to distinguish semantically uncertain information from the factual ones in text. This is critical to the trustworthiness of many intelligent systems that are based on information retrieval and natural language processing techniques, such as question answering or information extraction. We empirically explore three fundamental issues of uncertainty de- tection: (1) the predictive ability of different learning methods on this task; (2) whether using unlabeled data can lead to a more ac- curate model; and (3) whether closed-domain training or cross- domain training is better. For these purposes, we adopt two statistical learning approaches to this problem: the commonly used bag-of-words model based on Naive Bayes, and the sequence labeling approach using a Hidden Markov Model (HMM). We empirically compare between our two approaches as well as externally compare with prior results on the CoNLL-2010 Shared Task 1. Overall, our results are promising: (1) on Wikipedia and biomedical datasets, the HMM model improves over Naive Bayes up to 17.4% and 29.0%, respectively, in terms of absolute F score; (2) compared to CoNLL-2010 systems, our best HMM model achieves 62.9% F score with MLE parameter estimation and 64.0% with EM parameter estimation on Wikipedia dataset, both outperforming the best result (60.2%) of the CoNLL-2010 systems, but our results on the biomedical dataset are less impressive; (3) when the expression ability of a model (e.g., Naive Bayes) is not strong enough, cross-domain training is helpful, and when a model is powerful (e.g., HMM), cross-domain training may produce biased parameters; and (4) under Maximum Likelihood Estimation, combining the unlabeled examples with the labeled helps.

Original languageEnglish
Title of host publicationCEUR Workshop Proceedings
PublisherCEUR-WS
Pages30-37
Number of pages8
Volume1204
Publication statusPublished - 2014
EventWorkshop on Semantic Matching in Information Retrieval, SMIR 2014 - Gold Coast, Australia
Duration: 11 Jul 2014 → …

Other

OtherWorkshop on Semantic Matching in Information Retrieval, SMIR 2014
CountryAustralia
CityGold Coast
Period11/7/14 → …

Fingerprint

Hidden Markov models
Semantics
Maximum likelihood estimation
Parameter estimation
Intelligent systems
Information retrieval
Labeling
Uncertainty
Processing

Keywords

  • Cross-domain training
  • Hedge cues
  • HMM
  • Naive Bayes
  • Uncertainty detection

ASJC Scopus subject areas

  • Computer Science(all)

Cite this

Li, X., Gao, W., & Shavlik, J. W. (2014). Detecting semantic uncertainty by learning hedge cues in sentences using an HMM. In CEUR Workshop Proceedings (Vol. 1204, pp. 30-37). CEUR-WS.

Detecting semantic uncertainty by learning hedge cues in sentences using an HMM. / Li, Xiujun; Gao, Wei; Shavlik, Jude W.

CEUR Workshop Proceedings. Vol. 1204 CEUR-WS, 2014. p. 30-37.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Li, X, Gao, W & Shavlik, JW 2014, Detecting semantic uncertainty by learning hedge cues in sentences using an HMM. in CEUR Workshop Proceedings. vol. 1204, CEUR-WS, pp. 30-37, Workshop on Semantic Matching in Information Retrieval, SMIR 2014, Gold Coast, Australia, 11/7/14.
Li X, Gao W, Shavlik JW. Detecting semantic uncertainty by learning hedge cues in sentences using an HMM. In CEUR Workshop Proceedings. Vol. 1204. CEUR-WS. 2014. p. 30-37
Li, Xiujun ; Gao, Wei ; Shavlik, Jude W. / Detecting semantic uncertainty by learning hedge cues in sentences using an HMM. CEUR Workshop Proceedings. Vol. 1204 CEUR-WS, 2014. pp. 30-37
@inproceedings{c18ed1cdfe0242198523fd516d86c65e,
title = "Detecting semantic uncertainty by learning hedge cues in sentences using an HMM",
abstract = "Detecting speculative assertions is essential to distinguish semantically uncertain information from the factual ones in text. This is critical to the trustworthiness of many intelligent systems that are based on information retrieval and natural language processing techniques, such as question answering or information extraction. We empirically explore three fundamental issues of uncertainty de- tection: (1) the predictive ability of different learning methods on this task; (2) whether using unlabeled data can lead to a more ac- curate model; and (3) whether closed-domain training or cross- domain training is better. For these purposes, we adopt two statistical learning approaches to this problem: the commonly used bag-of-words model based on Naive Bayes, and the sequence labeling approach using a Hidden Markov Model (HMM). We empirically compare between our two approaches as well as externally compare with prior results on the CoNLL-2010 Shared Task 1. Overall, our results are promising: (1) on Wikipedia and biomedical datasets, the HMM model improves over Naive Bayes up to 17.4{\%} and 29.0{\%}, respectively, in terms of absolute F score; (2) compared to CoNLL-2010 systems, our best HMM model achieves 62.9{\%} F score with MLE parameter estimation and 64.0{\%} with EM parameter estimation on Wikipedia dataset, both outperforming the best result (60.2{\%}) of the CoNLL-2010 systems, but our results on the biomedical dataset are less impressive; (3) when the expression ability of a model (e.g., Naive Bayes) is not strong enough, cross-domain training is helpful, and when a model is powerful (e.g., HMM), cross-domain training may produce biased parameters; and (4) under Maximum Likelihood Estimation, combining the unlabeled examples with the labeled helps.",
keywords = "Cross-domain training, Hedge cues, HMM, Naive Bayes, Uncertainty detection",
author = "Xiujun Li and Wei Gao and Shavlik, {Jude W.}",
year = "2014",
language = "English",
volume = "1204",
pages = "30--37",
booktitle = "CEUR Workshop Proceedings",
publisher = "CEUR-WS",

}

TY - GEN

T1 - Detecting semantic uncertainty by learning hedge cues in sentences using an HMM

AU - Li, Xiujun

AU - Gao, Wei

AU - Shavlik, Jude W.

PY - 2014

Y1 - 2014

N2 - Detecting speculative assertions is essential to distinguish semantically uncertain information from the factual ones in text. This is critical to the trustworthiness of many intelligent systems that are based on information retrieval and natural language processing techniques, such as question answering or information extraction. We empirically explore three fundamental issues of uncertainty de- tection: (1) the predictive ability of different learning methods on this task; (2) whether using unlabeled data can lead to a more ac- curate model; and (3) whether closed-domain training or cross- domain training is better. For these purposes, we adopt two statistical learning approaches to this problem: the commonly used bag-of-words model based on Naive Bayes, and the sequence labeling approach using a Hidden Markov Model (HMM). We empirically compare between our two approaches as well as externally compare with prior results on the CoNLL-2010 Shared Task 1. Overall, our results are promising: (1) on Wikipedia and biomedical datasets, the HMM model improves over Naive Bayes up to 17.4% and 29.0%, respectively, in terms of absolute F score; (2) compared to CoNLL-2010 systems, our best HMM model achieves 62.9% F score with MLE parameter estimation and 64.0% with EM parameter estimation on Wikipedia dataset, both outperforming the best result (60.2%) of the CoNLL-2010 systems, but our results on the biomedical dataset are less impressive; (3) when the expression ability of a model (e.g., Naive Bayes) is not strong enough, cross-domain training is helpful, and when a model is powerful (e.g., HMM), cross-domain training may produce biased parameters; and (4) under Maximum Likelihood Estimation, combining the unlabeled examples with the labeled helps.

AB - Detecting speculative assertions is essential to distinguish semantically uncertain information from the factual ones in text. This is critical to the trustworthiness of many intelligent systems that are based on information retrieval and natural language processing techniques, such as question answering or information extraction. We empirically explore three fundamental issues of uncertainty de- tection: (1) the predictive ability of different learning methods on this task; (2) whether using unlabeled data can lead to a more ac- curate model; and (3) whether closed-domain training or cross- domain training is better. For these purposes, we adopt two statistical learning approaches to this problem: the commonly used bag-of-words model based on Naive Bayes, and the sequence labeling approach using a Hidden Markov Model (HMM). We empirically compare between our two approaches as well as externally compare with prior results on the CoNLL-2010 Shared Task 1. Overall, our results are promising: (1) on Wikipedia and biomedical datasets, the HMM model improves over Naive Bayes up to 17.4% and 29.0%, respectively, in terms of absolute F score; (2) compared to CoNLL-2010 systems, our best HMM model achieves 62.9% F score with MLE parameter estimation and 64.0% with EM parameter estimation on Wikipedia dataset, both outperforming the best result (60.2%) of the CoNLL-2010 systems, but our results on the biomedical dataset are less impressive; (3) when the expression ability of a model (e.g., Naive Bayes) is not strong enough, cross-domain training is helpful, and when a model is powerful (e.g., HMM), cross-domain training may produce biased parameters; and (4) under Maximum Likelihood Estimation, combining the unlabeled examples with the labeled helps.

KW - Cross-domain training

KW - Hedge cues

KW - HMM

KW - Naive Bayes

KW - Uncertainty detection

UR - http://www.scopus.com/inward/record.url?scp=84921059948&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84921059948&partnerID=8YFLogxK

M3 - Conference contribution

VL - 1204

SP - 30

EP - 37

BT - CEUR Workshop Proceedings

PB - CEUR-WS

ER -