Studying the history of the Arabic language: language technology and a large-scale historical corpus

Yonatan Belinkov, Alexander Magidow, Alberto Barron, Avi Shmidman, Maxim Romanov

Research output: Contribution to journalArticle

Abstract

Arabic is a widely-spoken language with a long and rich history, but existing corpora and language technology focus mostly on modern Arabic and its varieties. Therefore, studying the history of the language has so far been mostly limited to manual analyses on a small scale. In this work, we present a large-scale historical corpus of the written Arabic language, spanning 1400 years. We describe our efforts to clean and process this corpus using Arabic NLP tools, including the identification of reused text. We study the history of the Arabic language using a novel automatic periodization algorithm, as well as other techniques. Our findings confirm the established division of written Arabic into Modern Standard and Classical Arabic, and confirm other established periodizations, while suggesting that written Arabic may be divisible into still further periods of development.

Original languageEnglish
JournalLanguage Resources and Evaluation
DOIs
Publication statusPublished - 1 Jan 2019

Fingerprint

periodization
history
language
written language
spoken language
Language
History
Arabic Language
Periodization
Spoken Language
Natural Language Processing
Classical Arabic

Keywords

  • Arabic
  • Corpus
  • Historical linguistics
  • Periodization
  • Text reuse

ASJC Scopus subject areas

  • Language and Linguistics
  • Education
  • Linguistics and Language
  • Library and Information Sciences

Cite this

Studying the history of the Arabic language : language technology and a large-scale historical corpus. / Belinkov, Yonatan; Magidow, Alexander; Barron, Alberto; Shmidman, Avi; Romanov, Maxim.

In: Language Resources and Evaluation, 01.01.2019.

Research output: Contribution to journalArticle

@article{5729f30dec5e485aad68d7ca18868571,
title = "Studying the history of the Arabic language: language technology and a large-scale historical corpus",
abstract = "Arabic is a widely-spoken language with a long and rich history, but existing corpora and language technology focus mostly on modern Arabic and its varieties. Therefore, studying the history of the language has so far been mostly limited to manual analyses on a small scale. In this work, we present a large-scale historical corpus of the written Arabic language, spanning 1400 years. We describe our efforts to clean and process this corpus using Arabic NLP tools, including the identification of reused text. We study the history of the Arabic language using a novel automatic periodization algorithm, as well as other techniques. Our findings confirm the established division of written Arabic into Modern Standard and Classical Arabic, and confirm other established periodizations, while suggesting that written Arabic may be divisible into still further periods of development.",
keywords = "Arabic, Corpus, Historical linguistics, Periodization, Text reuse",
author = "Yonatan Belinkov and Alexander Magidow and Alberto Barron and Avi Shmidman and Maxim Romanov",
year = "2019",
month = "1",
day = "1",
doi = "10.1007/s10579-019-09460-w",
language = "English",
journal = "Language Resources and Evaluation",
issn = "1574-020X",
publisher = "Springer Netherlands",

}

TY - JOUR

T1 - Studying the history of the Arabic language

T2 - language technology and a large-scale historical corpus

AU - Belinkov, Yonatan

AU - Magidow, Alexander

AU - Barron, Alberto

AU - Shmidman, Avi

AU - Romanov, Maxim

PY - 2019/1/1

Y1 - 2019/1/1

N2 - Arabic is a widely-spoken language with a long and rich history, but existing corpora and language technology focus mostly on modern Arabic and its varieties. Therefore, studying the history of the language has so far been mostly limited to manual analyses on a small scale. In this work, we present a large-scale historical corpus of the written Arabic language, spanning 1400 years. We describe our efforts to clean and process this corpus using Arabic NLP tools, including the identification of reused text. We study the history of the Arabic language using a novel automatic periodization algorithm, as well as other techniques. Our findings confirm the established division of written Arabic into Modern Standard and Classical Arabic, and confirm other established periodizations, while suggesting that written Arabic may be divisible into still further periods of development.

AB - Arabic is a widely-spoken language with a long and rich history, but existing corpora and language technology focus mostly on modern Arabic and its varieties. Therefore, studying the history of the language has so far been mostly limited to manual analyses on a small scale. In this work, we present a large-scale historical corpus of the written Arabic language, spanning 1400 years. We describe our efforts to clean and process this corpus using Arabic NLP tools, including the identification of reused text. We study the history of the Arabic language using a novel automatic periodization algorithm, as well as other techniques. Our findings confirm the established division of written Arabic into Modern Standard and Classical Arabic, and confirm other established periodizations, while suggesting that written Arabic may be divisible into still further periods of development.

KW - Arabic

KW - Corpus

KW - Historical linguistics

KW - Periodization

KW - Text reuse

UR - http://www.scopus.com/inward/record.url?scp=85064354076&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85064354076&partnerID=8YFLogxK

U2 - 10.1007/s10579-019-09460-w

DO - 10.1007/s10579-019-09460-w

M3 - Article

AN - SCOPUS:85064354076

JO - Language Resources and Evaluation

JF - Language Resources and Evaluation

SN - 1574-020X

ER -