....L E R - B I M L.....L E R - B I M L.....L E R - B I M L.....L E R - B I M L.....L E R - B I M L.....L E R - B I M L.....

The Project

Project outline:

The indigenous minority languages of the British Isles and Ireland (or "BIMLs") – Cornish, (Scottish) Gaelic, Irish, Manx, Scots, Ulster Scots (Ullans) and Welsh – are becoming increasingly widely used in both public and private life. Thus, speech and language technology applications for these languages are now becoming an urgent need. These are needed not only for monolingual content management, but also to aid translators and interpreters since the BIMLs are nearly always used in bilingual contexts alongside English. To develop such applications, basic language resources (such as corpora of machine-readable texts, machine-readable dictionaries, speech databases and so on) are required.

Two recent EPSRC-funded projects at Lancaster (MILLE and EMILLE) have provided a great service to the non-indigenous minority language communities in the UK by locating existing resources, investigating end-user needs and wants, examining basic technical issues and beginning to generate appropriate resources. However, no such consolidated survey or examination of issues has yet been undertaken for the BIMLs. The present project thus has three broad goals: first, to survey the existing resources and tools for the various BIMLs; second, to obtain information about end-user needs and wants in these areas; and, third, to investigate some of the technical and practical issues that the BIMLs raise, primarily for the collection, transcription and annotation of spoken corpus material. The latter will involve collecting and annotating a small sample corpus of spoken Welsh and Gaelic.

Project objectives:

1. Survey the existing language engineering (LE) resources and tools that are available for the BIMLs.

2. Survey end-users regarding the LE resources and tools which are needed and/or wanted for these languages.

3. Investigate the practical and technical issues that the BIMLs raise for LE resource development, especially in terms

of spoken corpus collection, transcription and annotation.

4. Build a very small (ca. 80,000-word) spoken corpus of Welsh and Gaelic as a testbed for (3).

5. Develop EAGLES-conformant part-of-speech tagsets for Cornish, Scottish Gaelic, Irish, Manx, Scots, Ulster Scots

(Ullans) and Welsh.

6. Manually annotate the Welsh or Gaelic corpus with the tagset developed in (5).

....L E R - B I M L.....L E R - B I M L.....L E R - B I M L.....L E R - B I M L.....L E R - B I M L.....L E R - B I M L.....

The Project

Project outline:

[Home] [ Project ] [ Team ] [ Languages ] [ Links ] [ Reports ] [ Contact ]