Altruistic Crowdsourcing for Arabic Speech Corpus Annotation

Soumia Bougrine, Hadda Cherroun, Ahmed Abdelali

Research output: Contribution to journalConference article

1 Citation (Scopus)

Abstract

Crowdsourcing is an emerging collaborative approach that can be used for effective annotations of linguistic resources. There are many crowdsourcing genres: paid-for, games with a purpose, or altruistic (volunteer-based) approaches. In this paper, we investigate the use of altruistic crowdsourcing for speech corpora annotation by narrating our experience of validating a semi-automatic task for dialect annotation of Kalam'DZ, a corpus dedicated to Arabic Algerian dialectal varieties. We start by describing the whole process of designing altruistic crowdsourcing project. Using the unpaid crowdcrafting platform, we have performed experiments on a sample of 10% of Kalam'DZ corpus, totaling more than 10 h with 1012 speakers. The evaluation of this crowdsourcing job is ensured through a comparison with a gold standard annotation done by experts which affirms a high level of inter-annotation agreements of 81%. Our results confirm that altruistic crowdsourcing is an effective approach for speech dialect annotation. In addition, we present a set of best practices for altruistic crowdsourcing for corpus annotations.

Original languageEnglish
Pages (from-to)137-144
Number of pages8
JournalProcedia Computer Science
Volume117
DOIs
Publication statusPublished - 1 Jan 2017

    Fingerprint

Keywords

  • Altruistic Crowdsourcing
  • Arabic Algerian Dialects
  • Best Practices
  • Crowdcrafting
  • Dialect Annotation

ASJC Scopus subject areas

  • Computer Science(all)

Cite this