Below are some
free and useful Arabic corpora that I have created for researchers working on Arabic Natural Language
Processing, Corpus and Computational Linguistics.
1- KALIMAT a Multipurpose Arabic Corpus
Multipurpose Arabic Corpus
We are pleased to announce the immediate availability of KALIMAT 1.0,
KALIMAT is an Arabic natural language resource that consists of:
1) 20,291 Arabic articles collected from the Omani newspaper Alwatan by (Abbas
et al. 2011).
2) 20,291 Extractive Single-document system summaries.
3) 2,057 Extractive Multi-document system summaries.
4) 20,291 Named Entity Recognised articles.
5) 20,291 Part of Speech Tagged articles.
6) 20,291 Morphologically Analyse articles.
The data collection articles fall into six categories:
culture, economy, local-news, international-news, religion, and sports.
Click here to download the
2- Essex Arabic Summaries Corpus (EASC)
Click the link
below for a copy of EASC Corpus:
Download 2013 Corpus
The EASC is an Arabic natural language resources. It contains 153 Arabic
articles and 765 human-generated extractive summaries of those articles.
These summaries were generated using Mechanical Turk (http://www.mturk.com/).
Among the major features of EASC are:
Names and extensions are formatted to be compatible with current evaluation
systems such as ROUGE and AutoSummENG. Available in two encoding formats
UTF-8 and ISO-8859-6 (Arabic).
The Essex Arabic Summaries Corpus (EASC) uses copyright material. Users of
the corpus are responsible for ensuring that they comply with the terms of
the copyrights that apply to the source material and the derived works
(summaries) and the terms of relevant copyright law.
4- Small Corpora of Arabic in Business and Management 2016
1200 Arabic articles. Distributed as follows:
400 Arab companys' chairman and chief executive manager statements.
400 Arabic Economic News articles
400 Arabic Stock Market news articles