Arabic community question answering

Research output: Contribution to journalArticle

Abstract

We analyze resources and models for Arabic community Question Answering (cQA). In particular, we focus on CQA-MD, our cQA corpus for Arabic in the domain of medical forums. We describe the corpus and the main challenges it poses due to its mix of informal and formal language, and of different Arabic dialects, as well as due to its medical nature. We further present a shared task on cQA at SemEval, the International Workshop on Semantic Evaluation, based on this corpus. We discuss the features and the machine learning approaches used by the teams who participated in the task, with focus on the models that exploit syntactic information using convolutional tree kernels and neural word embeddings. We further analyze and extend the outcome of the SemEval challenge by training a meta-classifier combining the output of several systems. This allows us to compare different features and different learning algorithms in an indirect way. Finally, we analyze the most frequent errors common to all approaches, categorizing them into prototypical cases, and zooming into the way syntactic information in tree kernel approaches can help solve some of the most difficult cases. We believe that our analysis and the lessons learned from the process of corpus creation as well as from the shared task analysis will be helpful for future research on Arabic cQA.

Original languageEnglish
Pages (from-to)5-41
Number of pages37
JournalNatural Language Engineering
Volume25
Issue number1
DOIs
Publication statusPublished - 1 Jan 2019

Fingerprint

Syntactics
Formal languages
Learning algorithms
community
Learning systems
Classifiers
Semantics
dialect
learning
semantics
Question Answering
evaluation
resources
Kernel
Syntax

ASJC Scopus subject areas

  • Software
  • Language and Linguistics
  • Linguistics and Language
  • Artificial Intelligence

Cite this

Arabic community question answering. / Nakov, Preslav; Marques, Lluis; Moschitti, Alessandro; Mubarak, Hamdy.

In: Natural Language Engineering, Vol. 25, No. 1, 01.01.2019, p. 5-41.

Research output: Contribution to journalArticle

@article{06abb73c96ac4ef7bbc8112975d89059,
title = "Arabic community question answering",
abstract = "We analyze resources and models for Arabic community Question Answering (cQA). In particular, we focus on CQA-MD, our cQA corpus for Arabic in the domain of medical forums. We describe the corpus and the main challenges it poses due to its mix of informal and formal language, and of different Arabic dialects, as well as due to its medical nature. We further present a shared task on cQA at SemEval, the International Workshop on Semantic Evaluation, based on this corpus. We discuss the features and the machine learning approaches used by the teams who participated in the task, with focus on the models that exploit syntactic information using convolutional tree kernels and neural word embeddings. We further analyze and extend the outcome of the SemEval challenge by training a meta-classifier combining the output of several systems. This allows us to compare different features and different learning algorithms in an indirect way. Finally, we analyze the most frequent errors common to all approaches, categorizing them into prototypical cases, and zooming into the way syntactic information in tree kernel approaches can help solve some of the most difficult cases. We believe that our analysis and the lessons learned from the process of corpus creation as well as from the shared task analysis will be helpful for future research on Arabic cQA.",
author = "Preslav Nakov and Lluis Marques and Alessandro Moschitti and Hamdy Mubarak",
year = "2019",
month = "1",
day = "1",
doi = "10.1017/S1351324918000426",
language = "English",
volume = "25",
pages = "5--41",
journal = "Natural Language Engineering",
issn = "1351-3249",
publisher = "Cambridge University Press",
number = "1",

}

TY - JOUR

T1 - Arabic community question answering

AU - Nakov, Preslav

AU - Marques, Lluis

AU - Moschitti, Alessandro

AU - Mubarak, Hamdy

PY - 2019/1/1

Y1 - 2019/1/1

N2 - We analyze resources and models for Arabic community Question Answering (cQA). In particular, we focus on CQA-MD, our cQA corpus for Arabic in the domain of medical forums. We describe the corpus and the main challenges it poses due to its mix of informal and formal language, and of different Arabic dialects, as well as due to its medical nature. We further present a shared task on cQA at SemEval, the International Workshop on Semantic Evaluation, based on this corpus. We discuss the features and the machine learning approaches used by the teams who participated in the task, with focus on the models that exploit syntactic information using convolutional tree kernels and neural word embeddings. We further analyze and extend the outcome of the SemEval challenge by training a meta-classifier combining the output of several systems. This allows us to compare different features and different learning algorithms in an indirect way. Finally, we analyze the most frequent errors common to all approaches, categorizing them into prototypical cases, and zooming into the way syntactic information in tree kernel approaches can help solve some of the most difficult cases. We believe that our analysis and the lessons learned from the process of corpus creation as well as from the shared task analysis will be helpful for future research on Arabic cQA.

AB - We analyze resources and models for Arabic community Question Answering (cQA). In particular, we focus on CQA-MD, our cQA corpus for Arabic in the domain of medical forums. We describe the corpus and the main challenges it poses due to its mix of informal and formal language, and of different Arabic dialects, as well as due to its medical nature. We further present a shared task on cQA at SemEval, the International Workshop on Semantic Evaluation, based on this corpus. We discuss the features and the machine learning approaches used by the teams who participated in the task, with focus on the models that exploit syntactic information using convolutional tree kernels and neural word embeddings. We further analyze and extend the outcome of the SemEval challenge by training a meta-classifier combining the output of several systems. This allows us to compare different features and different learning algorithms in an indirect way. Finally, we analyze the most frequent errors common to all approaches, categorizing them into prototypical cases, and zooming into the way syntactic information in tree kernel approaches can help solve some of the most difficult cases. We believe that our analysis and the lessons learned from the process of corpus creation as well as from the shared task analysis will be helpful for future research on Arabic cQA.

UR - http://www.scopus.com/inward/record.url?scp=85059918340&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85059918340&partnerID=8YFLogxK

U2 - 10.1017/S1351324918000426

DO - 10.1017/S1351324918000426

M3 - Article

VL - 25

SP - 5

EP - 41

JO - Natural Language Engineering

JF - Natural Language Engineering

SN - 1351-3249

IS - 1

ER -