Arabic Corpora

Below are some free and useful Arabic corpora that I have created for researchers working on Arabic Natural Language Processing, Corpus and Computational Linguistics.

1- Habibi - a multi Dialect multi National Arabic Song Lyrics Corpus

Habibi the first Arabic Song Lyrics corpus. The corpus comprises more than 30,000 Arabic song lyrics in 6 Arabic dialects for singers from 18 different Arabic countries. The lyrics are segmented into more than 500,000 sentences (song verses) with more than 3.5 million words. I provide the corpus in both comma separated value (csv) and annotated plain text (txt) file formats. In addition, I converted the csv version into JavaScript Object Notation (json) and eXtensible Markup Language (xml) file formats.

Click to download Habibi Corpus and word embeddings
Habibi Corpus LREC 2020 Paper

2- KALIMAT a Multipurpose Arabic Corpus

KALIMAT is an Arabic natural language resource that consists of:
1) 20,291 Arabic articles collected from the Omani newspaper Alwatan by (Abbas et al. 2011).
2) 20,291 Extractive Single-document system summaries.
3) 2,057 Extractive Multi-document system summaries.
4) 20,291 Named Entity Recognised articles.
5) 20,291 Part of Speech Tagged articles.
6) 20,291 Morphologically Analyse articles.

The data collection articles fall into six categories:
culture, economy, local-news, international-news, religion, and sports.

Click here to download the corpus

Read More...

3- Essex Arabic Summaries Corpus (EASC)

Click the link below for a copy of EASC Corpus:
Download 2013 Corpus
About EASC:
The EASC is an Arabic natural language resources. It contains 153 Arabic articles and 765 human-generated extractive summaries of those articles. These summaries were generated using Mechanical Turk (http://www.mturk.com/). Among the major features of EASC are:
Names and extensions are formatted to be compatible with current evaluation systems such as ROUGE and AutoSummENG. Available in two encoding formats UTF-8 and ISO-8859-6 (Arabic).
The Essex Arabic Summaries Corpus (EASC) uses copyright material. Users of the corpus are responsible for ensuring that they comply with the terms of the copyrights that apply to the source material and the derived works (summaries) and the terms of relevant copyright law.

4- Multi-document Summaries Corpora (2011 and 2013)

I contributed in co-ordinating and creating the Arabic dataset through my time at Essex and Lancaster Universities.
The dataset is derived from publicly available WikiNews (http://www.wikinews.org/) English texts.

The source texts were under CC Attribution Licence V2.5 (cf. http://creativecommons.org/licenses/by/2.5/). Texts in other languages have been translated by native speakers of each language.
(1) Direct Link to MultiLing 2011 dataset.

(2) Direct Link to MultiLing 2013 dataset.

The documents hold no meta-data or tags: they consist plain text files encoded in UTF-8 (without a Byte Order Marker - BOM).
Tables and formatting have been removed.
languages:
- Arabic - Czech - English - French - Greek - Hebrew - Hindi - Chinese (2013) - Romanian (2013) - Spanish (2013)
MultiLing URL:
http://multiling.iit.demokritos.gr/


References:
[1]:TAC 2011 MultiLing Pilot Overview
[2]:Multi-document multilingual summarization corpus preparation, Part 1: Arabic, English, Greek, Chinese, Romanian
[3]:ACL 2013 MultiLing Workshop

5- Arabic in Business and Management Corpora (ABMC) 2016

    1200 Arabic articles as plain text and also tagged using Stanford Arabic Part of Speect Tagger.

    The Corpora is distributed as follows:

  • 400 Arab companys' chairman and chief executive manager statements.
  • 400 Arabic Economic News articles
  • 400 Arabic Stock Market news articles

Direct download from SourceForge:

https://sourceforge.net/projects/arabic-business-copora/

6- Arabic Dialects Dataset

    Dataset of Arabic dialects for GULF, EGYPT, LEVANT, TONESIAN Arabic dialects in addition to MSA.

    The Corpora is distributed as follows:

  • Dialects Full Text (Attached)
  • Bivalency Removed (dialect - MSA)
  • Dialect's MSA
  • Dialects Tokens WITH Frequency Count
  • Dialects Tokens NO frequency Count

Direct download:

Arabic Dialects Dataset