Sibyl, a factoid question-answering system for spoken documents

Pere R. Comas, Jordi Turmo, Lluis Marques

Research output: Contribution to journalArticle

11 Citations (Scopus)

Abstract

In this article, we present a factoid question-answering system, Sibyl, specifically tailored for question answering (QA) on spoken-word documents. This work explores, for the first time, which techniques can be robustly adapted from the usual QA on written documents to the more difficult spoken document scenario. More specifically, we study new information retrieval (IR) techniques designed or speech, and utilize several levels of linguistic information for the speech-based QA task. These include named-entity detection with phonetic information, syntactic parsing applied to speech transcripts, and the use of coreference resolution. Sibyl is largely based on supervised machine-learning techniques, with special focus on the answer extraction step, and makes little use of handcrafted knowledge. Consequently, it should be easily adaptable to other domains and languages. Sibyl and all its modules are extensively evaluated on the European Parliament Plenary Sessions English corpus, comparing manual with automatic transcripts obtained by three different automatic speech recognition (ASR) systems that exhibit significantly different word error rates. This data belongs to the CLEF 2009 track for QA on speech transcripts. The main results confirm that syntactic information is very useful for learning to rank question candidates, improving results on both manual and automatic transcripts, unless the ASR quality is very low. At the same time, our experiments on coreference resolution reveal that the state-of-the-art technology is not mature enough to be effectively exploited for QA with spoken documents. Overall, the performance of Sibyl is comparable or better than the state-of-the-art on this corpus, confirming the validity of our approach.

Original languageEnglish
Article number19
JournalACM Transactions on Information Systems
Volume30
Issue number3
DOIs
Publication statusPublished - 1 Aug 2012
Externally publishedYes

Fingerprint

Syntactics
Speech recognition
Speech analysis
Information retrieval
Linguistics
Learning systems
Question answering
Experiments

Keywords

  • Question answering
  • Spoken document retrieval

ASJC Scopus subject areas

  • Information Systems
  • Business, Management and Accounting(all)
  • Computer Science Applications

Cite this

Sibyl, a factoid question-answering system for spoken documents. / Comas, Pere R.; Turmo, Jordi; Marques, Lluis.

In: ACM Transactions on Information Systems, Vol. 30, No. 3, 19, 01.08.2012.

Research output: Contribution to journalArticle

@article{598c64387da4483981a9d53fffbfb200,
title = "Sibyl, a factoid question-answering system for spoken documents",
abstract = "In this article, we present a factoid question-answering system, Sibyl, specifically tailored for question answering (QA) on spoken-word documents. This work explores, for the first time, which techniques can be robustly adapted from the usual QA on written documents to the more difficult spoken document scenario. More specifically, we study new information retrieval (IR) techniques designed or speech, and utilize several levels of linguistic information for the speech-based QA task. These include named-entity detection with phonetic information, syntactic parsing applied to speech transcripts, and the use of coreference resolution. Sibyl is largely based on supervised machine-learning techniques, with special focus on the answer extraction step, and makes little use of handcrafted knowledge. Consequently, it should be easily adaptable to other domains and languages. Sibyl and all its modules are extensively evaluated on the European Parliament Plenary Sessions English corpus, comparing manual with automatic transcripts obtained by three different automatic speech recognition (ASR) systems that exhibit significantly different word error rates. This data belongs to the CLEF 2009 track for QA on speech transcripts. The main results confirm that syntactic information is very useful for learning to rank question candidates, improving results on both manual and automatic transcripts, unless the ASR quality is very low. At the same time, our experiments on coreference resolution reveal that the state-of-the-art technology is not mature enough to be effectively exploited for QA with spoken documents. Overall, the performance of Sibyl is comparable or better than the state-of-the-art on this corpus, confirming the validity of our approach.",
keywords = "Question answering, Spoken document retrieval",
author = "Comas, {Pere R.} and Jordi Turmo and Lluis Marques",
year = "2012",
month = "8",
day = "1",
doi = "10.1145/2328967.2328972",
language = "English",
volume = "30",
journal = "ACM Transactions on Information Systems",
issn = "1046-8188",
publisher = "Association for Computing Machinery (ACM)",
number = "3",

}

TY - JOUR

T1 - Sibyl, a factoid question-answering system for spoken documents

AU - Comas, Pere R.

AU - Turmo, Jordi

AU - Marques, Lluis

PY - 2012/8/1

Y1 - 2012/8/1

N2 - In this article, we present a factoid question-answering system, Sibyl, specifically tailored for question answering (QA) on spoken-word documents. This work explores, for the first time, which techniques can be robustly adapted from the usual QA on written documents to the more difficult spoken document scenario. More specifically, we study new information retrieval (IR) techniques designed or speech, and utilize several levels of linguistic information for the speech-based QA task. These include named-entity detection with phonetic information, syntactic parsing applied to speech transcripts, and the use of coreference resolution. Sibyl is largely based on supervised machine-learning techniques, with special focus on the answer extraction step, and makes little use of handcrafted knowledge. Consequently, it should be easily adaptable to other domains and languages. Sibyl and all its modules are extensively evaluated on the European Parliament Plenary Sessions English corpus, comparing manual with automatic transcripts obtained by three different automatic speech recognition (ASR) systems that exhibit significantly different word error rates. This data belongs to the CLEF 2009 track for QA on speech transcripts. The main results confirm that syntactic information is very useful for learning to rank question candidates, improving results on both manual and automatic transcripts, unless the ASR quality is very low. At the same time, our experiments on coreference resolution reveal that the state-of-the-art technology is not mature enough to be effectively exploited for QA with spoken documents. Overall, the performance of Sibyl is comparable or better than the state-of-the-art on this corpus, confirming the validity of our approach.

AB - In this article, we present a factoid question-answering system, Sibyl, specifically tailored for question answering (QA) on spoken-word documents. This work explores, for the first time, which techniques can be robustly adapted from the usual QA on written documents to the more difficult spoken document scenario. More specifically, we study new information retrieval (IR) techniques designed or speech, and utilize several levels of linguistic information for the speech-based QA task. These include named-entity detection with phonetic information, syntactic parsing applied to speech transcripts, and the use of coreference resolution. Sibyl is largely based on supervised machine-learning techniques, with special focus on the answer extraction step, and makes little use of handcrafted knowledge. Consequently, it should be easily adaptable to other domains and languages. Sibyl and all its modules are extensively evaluated on the European Parliament Plenary Sessions English corpus, comparing manual with automatic transcripts obtained by three different automatic speech recognition (ASR) systems that exhibit significantly different word error rates. This data belongs to the CLEF 2009 track for QA on speech transcripts. The main results confirm that syntactic information is very useful for learning to rank question candidates, improving results on both manual and automatic transcripts, unless the ASR quality is very low. At the same time, our experiments on coreference resolution reveal that the state-of-the-art technology is not mature enough to be effectively exploited for QA with spoken documents. Overall, the performance of Sibyl is comparable or better than the state-of-the-art on this corpus, confirming the validity of our approach.

KW - Question answering

KW - Spoken document retrieval

UR - http://www.scopus.com/inward/record.url?scp=84865235825&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84865235825&partnerID=8YFLogxK

U2 - 10.1145/2328967.2328972

DO - 10.1145/2328967.2328972

M3 - Article

VL - 30

JO - ACM Transactions on Information Systems

JF - ACM Transactions on Information Systems

SN - 1046-8188

IS - 3

M1 - 19

ER -