....L E R - B I M L.....L E R - B I M L.....L E R - B I M L.....L E R - B I M L.....L E R - B I M L.....L E R - B I M L.....
minority languages of the British Isles and Ireland (or "BIMLs") – Cornish,
(Scottish) Gaelic, Irish, Manx, Scots, Ulster Scots (Ullans) and Welsh – are
becoming increasingly widely used in both public and private life.
Thus, speech and language technology applications for these languages are
now becoming an urgent need. These are needed not only for monolingual content
management, but also to aid translators and interpreters since the BIMLs are
nearly always used in bilingual contexts alongside English. To develop such
applications, basic language resources (such as corpora of machine-readable
texts, machine-readable dictionaries, speech databases and so on) are required.
Two recent EPSRC-funded projects at Lancaster (MILLE and EMILLE) have provided a great service to the non-indigenous minority language communities in the UK by locating existing resources, investigating end-user needs and wants, examining basic technical issues and beginning to generate appropriate resources. However, no such consolidated survey or examination of issues has yet been undertaken for the BIMLs. The present project thus has three broad goals: first, to survey the existing resources and tools for the various BIMLs; second, to obtain information about end-user needs and wants in these areas; and, third, to investigate some of the technical and practical issues that the BIMLs raise, primarily for the collection, transcription and annotation of spoken corpus material. The latter will involve collecting and annotating a small sample corpus of spoken Welsh and Gaelic.
Survey the existing language engineering (LE) resources and tools that
are available for the BIMLs.
Survey end-users regarding the LE resources and tools which are needed
and/or wanted for these languages.
3. Investigate the practical and technical issues that the BIMLs raise for LE resource development, especially in terms
of spoken corpus collection, transcription and annotation.
4. Build a very small (ca. 80,000-word) spoken corpus of Welsh and Gaelic as a testbed for (3).
(Ullans) and Welsh.
6. Manually annotate the Welsh or Gaelic corpus with the tagset developed in (5).