Lancaster University is part of a multi-institution project, which will capture and inform the past, present and future use of the Welsh language.
Dr Paul Rayson, of the School of Computing and Communications, will play a key role in developing the first ever large-scale corpus of the Welsh language, compiling an initial data set of 10 million Welsh words.
The interdisciplinary, collaborative project, entitled The National Corpus of Contemporary Welsh, or Corpws Cenedlaethol Cymraeg Cyfoes (CorCenCC), has secured £1.8 million in funding from the Economic and Social Research Council (ESRC) and the Arts and Humanities Research Council (AHRC).
The corpus – a large collection of texts, or a body of written or spoken material for linguistic analysis – will represent Welsh language use across all communication types. This will include spoken, written and digital language, encompassing different genres, language varieties (regional and social) and contexts.
Contributors will be drawn from the 562,000 Welsh speakers in the UK, who will contribute via crowdsourcing digital technologies and community collaboration.
Dr Paul Rayson said: “We are excited to be part of this important project to create the largest corpus of contemporary Welsh. The novel crowdsourcing techniques in the research will allow us to connect with the Welsh language teaching and learning community to inform the development of the project.”
Commencing in March 2016, the project will run for three and a half years. Led by Cardiff University, it will also draw on expertise from Swansea and Bangor universities, and break new ground as both a language resource and a model of corpus construction.
Further detail on the project’s construction and the ways in which users will be able to participate will be shared once it is live in 2016.