....L E R - B I M L.....L E R - B I M L.....L E R - B I M L.....L E R - B I M L.....L E R - B I M L.....L E R - B I M L.....

Reports

 

LER-BIML WORKING PAPER 2 

Surveying End-User Needs for the

Indigenous Minority Languages of the British Isles and Ireland

 

The Department of Linguistics at Lancaster University has been engaged with two recent EPSRC-funded projects drawing attention to the non-indigenous minority language communities in the UK by locating existing resources, investigating end-user needs and wants, examining basic technical issues and beginning to generate appropriate resources. This identified a subsequent gap in the market for the associated indigenous minority languages of the British Isles and Ireland, or “BIML”s – Cornish, (Scottish) Gaelic, Irish, Manx, Scots, Ulster Scots (Ullans) and Welsh - which are becoming increasingly widely used in both public and private life. Speech and language technology applications for these languages are now also becoming an increasing urgent need. To develop such applications, basic language resources are  therefore required.  

The LER-BIML project has three primary aims: 

    i)   to survey the existing language engineering resources and tools for the BIMLs in question

    ii)   to obtain information regarding end-user needs and demands in these areas

    iii)  to investigate some of the particular technical issues that these BIMLs raise, principally in view of spoken corpus collection and annotation 

This workpackage concentrates on the second of these objectives.

 

METHODOLOGY 

The most effective way of ensuring as wide a scope as possible of potential end-user needs of BIML resources was to be by means of a web questionnaire posted on the project website. Notice of this questionnaire was emailed to over fifteen Internet bulletin boards and mailing lists including HUMANIST, CORPORA, TERMCELT and CELTLING. This secured the questionnaire being disseminated to all the BIML linguistic regions, and also outside of the British Isles and Ireland to groups working with the BIMLs as non-indigenous minority languages. The questionnaire focuses on the response to language engineering resources and corpus construction for the BIMLs by varying groups of users.

 

RESULTS 

There were 128 responses, 57 of which were interested in receiving feedback from the survey. This would be done by emailing out a copy of the report.

Scottish Gaelic had the highest demand for corpus resources; there was no demand at all for Ulster Scots[1]. There was strong interest in seeing more availability of resources for Breton and Shetlandic, with individual requests for the Channel Island languages and Romany amongst others. 

A bilingual corpus was the most favoured corpus type:

to contain English alongside the BIMLs in question,  

to contain sentence-aligned translations of the same texts in each language. 

Most wanted to see an equal balance of written and spoken data built for the BIMLs, and for this to be done within general balanced corpora rather than in genre specific corpora. For genre specific corpora news, history and fiction proved the most popular areas of interest. Whilst people thought it important to envisage the ideal of all types of genre being made available for the individual languages, suggestions other than those proposed on the questionnaire included arts and music, youth culture, environment, travel, technology, oral literature, folklore, food and drink, media and advertising and religion and ethics. 

As regards linguistically annotating the data, most would prefer just plain text. However of the methods of annotation on offer part-of-speech was the next most popular. They would be happy with anything that could be made available, although there was special mention of IPA and metaphorical and dialect annotation. The question of textual mark-up returned the highest number of nil responses, but amongst those who were interested in seeing mark-up, html was the favourite. 

The Internet was the favourite medium for receiving corpus data, with the CD a close second.  

On the issue of the listed features and their perceived importance within a corpus, the general consensus was that there was ‘no opinion’ on their preferred status. The only features which received a majority rating of ‘essential’ were the header elements ‘author’, ‘source of data’ and ‘language use’. The spoken data features attracted the only number of ‘not wanted’ responses. 

The majority of respondents were linguists rather than language engineers. Applications which the language engineers envisaged using the BIML data to build included frequency tables, speech synthesis and recognition, spelling, style, syntax and grammar checkers, bilingual dictionaries and lexica and pedagogical tools. Questions the linguists wanted to explore with the data included effects of linguistic shift and borrowings; frequencies and variations of syntactic structures, dialect, registers and discourse; patterns of code switching across genres; reported versus actual usage; patterns of growth and decline; reception by young people. Suggested support tools included concordances, checkers, search and recognition tools, glossaries, taggers, text aligners and audio and video files. 

The optimistic end result was that there was an overwhelming majority that people were very likely to be working with the BIMLs in the future.

 

CONCLUSION 

The encouraging number of responses to the survey indicated that work in progress and current activity regarding the BIMLs is healthy and positive. The higher demand for Scottish Gaelic, Irish and Welsh most likely correlates with the more specialised mailing lists that cater specifically for these languages and are in much wider use. An amendment for a future survey of this type would be to ask the respondents to indicate from which mailing list they received details of the questionnaire to secure a better overview of the distribution of the respondents. It would also be beneficial to determine the professional status of the respondents to indicate from what type of linguistic background they are approaching the data. These factors could help resolve why these results might appear to go somewhat against expectation in light of the results of the previous survey of existing resources.

 

APPENDIX:  STATISTICS

 

 

 



[1] This zero return could be explained by the presumption that the respondents identified more with the category Scots than Ulster Scots, or from a more negative point of view, that there is simply a lack of interest in this area.