The EMILLE Corpus

incorporating the CIIL Corpora





General Editors

Paul Baker

Andrew Hardie



EMILLE Project Principal Investigator

Tony McEnery



Robert Gaizauskas

Funded by

The Engineering and Physical Science Research Council


Participating institutions:

University of Lancaster

University of Sheffield

in collaboration with

The Central Institute of Indian Languages, Mysore, India


The CIIL Corpora

Developed by

The Central Institute of Indian Languages (Mysore)

including Western Regional Language Centre (Pune), India

in collaboration with

The Indian Institute of Technology (Delhi)

The Institute of Applied Language Sciences (Bhubaneshwar)

Aligarh Muslim University (Aligarh)



General Editors

Udaya Narayana Singh

B.D. Jayaram



D.P. Pattanayak

Francis Ekka

M. Ganesan

Usha Nair

K.S. Rajyashree

K.P. Lekhwani

S.N. Maheswari

S. Imtiaz Hasnain



EMILLE Corpus Principal Transcribers

Sutapa Ghosh

Vijay Vyas

Hrishikesh Rajhans

Raheela Iqbal

Winfocus PVT Ltd. (Director: Atul Jain)



EMILLE Corpus post-edited and validated by

Richard Xiao



Sinhala Corpus compiled by

Vincent Halahakone



Other EMILLE written corpora compiled by

Celia Worth

Paul Baker

Andrew Hardie


Urdu Morphosyntactic Annotation by

Andrew Hardie


Hindi Anaphora Annotation by

Srija Sinha



Special Thanks To:


Sameena Ali Khan and Hisam Mukaddam
at the BBC Asian Network

Rizwan Ahmad of the University of Michigan

The editors of the many South Asian language websites who generously allowed us to use their data

The ministries and other UK government bodies who gave us permission to make use of their information leaflets




For further information on new CIIL Corpora:

B. Mallikarjun (General Corpora)

Rekha Sharma (Speech Corpora)



For further information on the EMILLE Project:

See the EMILLE website: