Integrated Resources Working Group: Invitation Draft

Document LE-EAGLES-WP4-1A.1

Geoffrey Leech

17 November 1997

------------------------------------------------------------------------

1. Introduction

1.1. This draft

The purposes of this document are

(a) To present a partial survey of current and developing work in the areas of research covered by WP4 (Integrated Spoken and Written Language Resources).

(b) To present the tentative basis for a set of recommendations for standards to be developed in these areas (to be developed further in the next 12 months).

(c) To indicate areas where further information and further discussion are needed, with the help of the Working Group, to achieve purposes (a) and (b) above.

1.2 The subject of this document: What is meant by 'Integrated Resources'?

In the 1980s, the speech community and the natural language community were effectively two research communities working on a common subject matter -- human language -- but otherwise having little communication with one another. In the 1990s this situation has changed, simply because many of the applications of language engineering (LE) involve both the domains of 'speech' and of 'natural language'. In 1997-8 it is more evident than ever before that these communities have to pool their specialist knowledge and to strive to become a single research community. The NL community has in the past concentrated (a) on written language processing, and (b) on the processing of language at higher levels of analysis (e.g. the syntactic and lexical levels) which apply both to written and spoken language, and where the distinction between the two channels is relatively unimportant. The speech community, on the other hand, has in the past tended to concentrate on 'lower' levels of analysis which relate fairly directly to the speech signal.

However, it has already become clear that this division of interest can no longer be maintained: many of the most forward-looking and challenging application goals of LE today (e.g. high-quality speech synthesis, large-vocabulary speech recognition, speech-to-speech translation, dialogue systems) involve both low-level and high-level processing. A parser, for example, is needed for processing both spoken and written language data. In fact, current research is working towards integrated spoken language systems undertaking all levels of speech understanding and speech synthesis, such as are needed for the appropriate understanding and production of speech in dialogue.

1.3 Limitations of the task to be undertaken in WP4

Hence integrated resources for spoken and written language refers to LE resources which are to be shared by both speech and NL processing research. They include corpora, lexicons, grammars and tools. What can be achieved within the scope of this Work Package, however, is limited in several ways.

Limitation 1: In WP4 we restrict our attention primarily to (a) corpora, because this is the area in which the need for standardization arises most compellingly. Lack of 'resources' (this time in the sense of 'funds') prevent us from considering (b) lexicons and (c) grammars. On the other hand, (d) tools cannot be ignored in the current project, since the transcription and annotation of spoken corpora is in part constrained by what tools can be used or developed to facilitate these tasks.

A corpus in this context is simply a body of spoken language data which has been recorded, has been transcribed (in part or in toto) for use in the development of LE systems, and in principle at least, is available for use by more than one reseach team in the community. The needs for standards, or rather guidelines, for the representation and annotation of spoken language data arises primarily because of the need to ensure interchange of data between different sites and language communities in a multilinguistic community such as the EC, so that progress in the provision of resources can be shared and can provide a springboard for further collaboration and advances in the future.

Limitation 2: Apart from the focus on corpora, there is an additional restriction on the scope of this WP, again necessitated by lack of funding. This is the decision to limit the work of WP4 to dialogue corpora. For the present purposes we define a dialogue as a discourse in which two or more participants interact communicatively, and where at least one of the participants is human. This covers cases of human-machine as well as human-human dialogue.

The focus on dialogue is timely, in view of the recent emergence of dialogue as an area ripe for rapid development, and the consequent demand for empirical research on dialogue corpora. In the words of Walker and Moore (1997: 1):

In the past, research in this area focused on specifying the mechanisms underlying particular discourse phenomena; the models proposed were often motivated by a few constructed examples... Recently however the field has turned to issues of robustness and the coverage of theories... This new empirical focus is supported by several recent advances: an increasing theoretical consensus on discourse models; a large amount of on-line dialogue and textual corpora available; and improvements in component technologies and tools for building and testing discourse and dialogue testbeds. This means that it is now possible to determine how representative particular discourse phenomena are, how frequently they occur, whether they are related to other phenomena, what percentage of the cases a particular model covers, the inherent difficulty of the problem, and how well an algorithm for processing or generating the phenomena should perform to be considered a good model.

Research in this field can be either close to or distant from practical commercial or industrial applications. Less applications-oriented studies may concentrate on certain modules or levels of analysis to the exclusion of others. All such studies can, however, be valuable in leading to richer and more accurate predictive models of human dialogue behaviour.

Limitation 3: A third limitation on our study of 'integrated resources' is that we focus attention primarily on applications-oriented task-driven dialogue, bearing in mind that the objective of EAGLES is to promote the setting of standards in language engineering, rather than more generally in such fields of linguistics or social science as dialectology, sociolinguistics, discourse analysis or conversational analysis. In recent years, corpora of spoken dialogue have been compiled for a wide variety of reasons. For example, one well-developed initiative is the CHILDES database (MacWhinney 1991) which sets standards for the interchange of data between researchers in the area of child language acquisition. Another instance of incipient standardization is the spoken subcorpus of the BNC (British National Corpus) (see Burnard 1995), which contains c.10 million words of spoken English, all transcribed and marked up in accordance with the guidelines of the TEI (Text Encoding Initiative) - see Johansson (1995). The need for a standard in this case had to be reconciled with the requirement of a corpus large enough to be usable for dictionary compilation and other wide-ranging fields of linguistic research. Other examples could be added: there can be many reasons for introducing standards or guidelines for the representation of dialogue, apart from those which are most salient to the LE community. While it is instructive to take note of these other initiatives, especially where they come to conclusions of value to LE specialists, they should not necessarily be treated as a model to be followed in this WP.

Limitation 4: Finally, yet another limitation of this task is the following. We have restricted attention to certain levels or tiers of representation/annotation where there is felt to be a particular need to propose guidelines. The levels of transcription on which a representation of dialogue can be provided are many: see Gibbon et al (1997: Section 5.1.2) for a reasonably complete list. However, for the present purpose we will ignore phonetic/phonemic and physical levels of transcription, on which considerable standardizing work has been done already (see Wells et al, 1992), and confine our attention to the following levels:

orthographic (Section 3.1) - verbatim record + macro-features of the dialogue morphosyntactic (Section 3.2) - part-of-speech or word-class tagging

syntactic (Section 3.3) - treebanks (either partially or fully parsed)

pragmatic (Section 3.4) - functional units or speech acts in dialogue

prosodic (Section 3.5) - representation of stress and intonation

(The prosodic level can only be considered impressionistically in the present context.)

At the same time, we assume that the different levels of annotation above all need to be integrated in a multi-layer structure, and linked through time alignment to the sound recording.

It has to be admitted that these levels (particularly the orthographic, pragmatic and prosodic) do not yet show a highly developed trend towards standardization. Consequently, the work of the WP4 should concentrate heavily on the surveying current practices, and should avoid imposing a standard where the conditions for consensus do not yet exist. On the whole, recommendations

(a) will be tentative,

(b) will be directed to matters of general principle rather than of detailed practice,

(c) will offer options where (as is usually the case) no one solution or 'standard' is likely to be suitable for all purposes.

We remember, too, that these recommendations will be simply the beginning of a process of consultation and - yes - 'dialogue', involving specialists outside the Working Group as well as members of the LE research and user communities.

2. A provisional typology of dialogue corpora

Before we turn to the different levels of representation or annotation, it is worth considering the various types of dialogue which might be investigated or modelled for LE purposes. In principle, we need a typology of dialogues geared to the foreseen needs of LE. In practice, judging by present research, there is likely to be a heavy concentration on certain rather constrained and simple kinds of dialogue: those with the features marked * below.

1. NUMBER OF PARTICIPANTS

1.1. TWO PARTICIPANTS *

1.2. MORE THAN TWO PARTICIPANTS

Most dialogues in LE research have two participants only (at any one stage). More than two participants would greatly complicate the task of modelling all levels of analysis/synthesis. On the other hand, large spoken corpora such as the demographic component of the BNC contain conversational dialogues with many participants.

2. APPLICATIONS ORIENTATION

2.1 TASK-DRIVEN *

2.1.1 APPLICATIONS-ORIENTED *

2.1.2 NON-APPLICATIONS-ORIENTED

2.2 NON-TASK-DRIVEN

Most dialogues in LE research are task-driven. That is, there is a specific task (or possibly more than one task) which at least one participant aims to accomplish with the aid of the other(s). The Edinburgh Map Task Corpus (Carletta et al. 1997) is an example; another is the TRAINS corpus (Allen et al 1995), in which speakers develop plans to move trains and cargo from one city to another. The Map Task may be given as an example of a non-applications-oriented dialogue type. In contrast, dialogues which have a clear application to useful human-machine interfaces, such as those dealing with airline or hotel reservations, are applications-oriented.

3. DOMAIN

3.1 RESTRICTED DOMAIN *

3.2 UNRESTRICTED DOMAIN

Again, most dialogues in LE are restricted to a relatively tightly-defined domain of subject-matter. The examples cited in (2.) above also apply here.

4. ACTIVITY TYPE

4.1 SERVICE ENCOUNTER (One participant provides a service for the other, in the form of information, directions, etc.) *

4.2 INTERVIEW (One or more participants, the INTERVIEWER(S), play a minimal but talk-guiding role, with the goal of inducing another participant, the INTERVIEWEE to reveal information, beliefs, opinions, etc.)

4.3 GAME DIALOGUE (One participant aims to accomplish a challenging task, either in collaboration with or in competition with other participants)

4.4 (etc.) ...

Alongside domain, the activity type (Levinson 1979) to which the dialogue belongs is another variable defining the nature of a dialogue, particularly in terms of the constraints on the dialogue roles adopted by participants. A service encounter, for example, is a kind of dialogue activity in which one participant aims to elicit from another participant a useful form of behaviour known as a 'service'. The service, for example, may be a matter of supplying information, or carrying out a set of actions.

Relations between variables (2.) 'applications orientation' and (4.) 'activity type' are obvious. On the whole, applications-oriented dialogue corpora will be characterized as service encounters, since the human-machine interface application would be one in which the role of service-provider would be assumed by a computer. There are also, however, service encounter dialogue corpora which are not applications-oriented. One example is the Pixi corpus (Aston 1988) which consists of dialogues (in Italian and in English) between customers and service-providers in a bookshop.

Similarly, constraints on domain (3.) and activity type (4.) are clearly interrelated variables. In combination, they help to specify scenario (see 5. below). However, they may have to be considered independently: the Switchboard Corpus, for example, has dialogues in which the speakers share a pre-determined topic or domain of discourse, but their roles are not constrained in any specific way.

5. SCENARIO

5.1 Arranging appointments (VERBMOBIL)

5.2 Dealing with airline / travel inquiries

5.3 Developing plans for moving trains and cargo (TRAINS)

5.4 Furnishing rooms (COCONUT)

5.5 Giving directions to find a route on a map (Map Task)

5.6 Giving directions on how to find places and services in the vicinity of MIT (SUMMIT)

5.7. (etc.) ...

The number of scenarios in which dialogue takes place is very large. Also, the amount of detail which may be specified to define the scenario for a particular dialogue is open-ended. Hence no closed list of 'scenarios' may be specified. As an example, consider the following as a succinct description of the Map Task scenario (Thompson et al 1995: 168):

Each participant has a schematic map in front of them, not visible to the other. Each map is comprised of an outline and roughly a dozen labelled features (e.g. 'white cottage', 'Green Bay', 'oak forest'). Most features are common to the two maps, but not all. One map has a route drawn in, the other does not. The task is for the participant without the route to draw one on the basis of discussion with the participant with the route.

The Map Task is an example of a 'laboratory' activity type, with no useful application outside research, rather than a 'service encounter'. However, as already noted, service encounters are likely to be more popular activities for LE dialogue corpora, since they accord with the applications which may become commercially or industrially exploitable. That is, if the service provider's role can be computerized, the dialogue corpus can provide a model for the human-machine dialogue.

6. HUMAN/MACHINE PARTICIPATION

6.1 HUMAN--HUMAN DIALOGUE

6.1.1 MACHINE-MEDIATED (cf. VERBMOBIL, ATR)

6.1.2 NON-MACHINE-MEDIATED *

6.2 HUMAN--MACHINE DIALOGUE

6.2.1 SIMULATED (WIZARD OF OZ) *

6.2.2 NON-SIMULATED

In corpus-driven methodology, however, there is always a problem of matching the naturally-collected spontaneously-produced data to the future development needs of an artificial system. One problem of dialogue research where this shows up strongly is in our lack of knowledge of how human beings will behave when conversing with computer dialogue systems. How far will they adapt, when talking to a machine, so that their dialogic behaviour is 'unnatural' by the standards of human--human dialogue? To answer this question, Wizard of Oz experiments (see Gibbon et al. 1997: 9.5) have been set up to simulate the behaviour of a machine in dialogue with a human being, and to record both the behaviour of the machine and the behaviour of a human being who believes he or she is interacting with the machine.

3. Levels of representation or annotation

3.1 Orthographic transcription

The aim here is to represent that macro-features of the dialogue, including a verbatim record of what was said. A 'verbatim record' is a useful abstraction for many purposes, so long as it is not mistaken for the actual speech event.

[Document LE-EAGLES-WP4-1B.1 by Andrew Wilson deals with the topic of this section.]

3.2 Morphosyntactic annotation

Morphosyntactic annotation is more familiarly known as part-of-speech or word-class tagging. It may be argued that this is not a problem area for integrated spoken and written resources, since the same word-class categories appear in both spoken and written texts. (Even 'ums' and 'ers' occur occasionally in fictional dialogue.) Moreover, EAGLES recommendations have already been made in this area (Leech and Wilson 1994).

However, most tagsets have been devised primarily for written language, and it is curious to note that early speech synthesis software (MITALK) made use of the tagged Brown Corpus (a corpus of edited written English) as a basis for a word-class model for spoken language! There are, however, two aspects of morphosyntactic tagging which need to be considered in adapting a tagset from written to spoken language:

(a) Disfluency phenomena:

(i) How to tag hesitation fillers (um, er, etc.);

(ii) How to tag word fragments (i.e. where a speaker discontinues speech in the middle of a word).

In the EAGLES guidelines (Leech and Wilson 1994) there is a 'catch all' peripheral part-of-speech tag U (or 'unassigned') which can be used for these semi-word-like phenomena. The guidelines also allow for the subdivision of this U category into subcategories such as Ux 'hesitation filler' and Uy 'word partial' (where x and y are digits). Alternatively, the guidelines would allow the I (interjection) category to be extended to include hesitation fillers. Hence in this respect, the existing morphosyntactic annotation guidelines are sufficient. Optional extensions such as the devising of the new subcategories are already allowed for, and we recommend that these be introduced if required. On the other hand, yet a third solution is not to assign morphosyntactic tag to these items at all, but to regard them as non-word vocalizations comparable to laughs and snorts (see Wilson, Document LE-EAGLES-WP4-1B).

(b) Word-classes which are characteristic of speech, but not of writing

Tagsets may need to be augmented to deal with conversational language phenomena such as pragmatic particles (e.g. German doch, ja), discourse markers (e.g. English well, right), and various kinds of adverbs (especially stance adverbs and linking adverbs) which are strongly associated with spoken language. All these forms might in a very general sense be termed 'adverbial' on the grounds that they are peripheral to the clause/sentence, are detachable from it, and may often occur in varying positions, particularly initial or final, in relation to any larger grammatical structures of which they a part.

The existing EAGLES guidelines (Leech and Wilson 1994) deal with adverbs cursorily, simply allowing that (besides morphological variants) various syntactico-semantic functions of adverbs may be recognized in the morphosyntactic tagset. On the whole, tagsets have avoided subcategorising adverbs, on the following grounds. Adverbs constitute a loosely-organized word class, and well-known subcategories (such as time, place, manner, degree, stance) are notoriously difficult to distinguish by hard-and-fast criteria, and certainly difficult to recognize and tag automatically. It is worth noting, however, that two tagsets for English which were devised with spoken corpora in mind subcategorize adverbs in considerable detail. These are the London-Lund tagset (Svartvik and Eeg-Olofsson 1982) and the International Corpus of English (ICE) tagset (Greenbaum and Ni 1996). The following extracts from the London-Lund tagset will give an impression of the detail of adverbial classification provided:

TAG CAT SUBCAT SUBSUB or ITEM EXAMPLE

AApro adverb adjunct process correctly

AAspa adverb adjunct space outdoors

AAtim adverb adjunct time how

... ... ... ... ...

AQapp adverb discourse item appositional I'm sorry

AQexp adverb discourse item expletive fuck off

AQgre dverb discourse item greeting goodbye

AQhes adverb discourse item hesitator now

AQneg adverb discourse item negative no

AQord adverb discourse item order give over

AQpol adverb discourse item politeness please

AQpos adverb discourse item positive yes, [mm]

AQres adverb discourse item response I see

... ... ... ... ...

Asemp adverb subjunct emphasiser actually

ASfoc adverb subjunct focusing mainly

ASint adverb subjunct intensifier a bit

... ... ... ... ...

While this partial list is not intended as a model to be recommended, it does illustrates something of the diversity and importance of adverbial components in speech, and the need to consider carefully the addition of subcategories to the tagset before undertaking a morphosyntactic tagging of spoken data. Also illustrated here is another phenomenon of spoken language: the tendency for multiword expressions (such as I see, I'm sorry, thank you, sort of) to occur with greater density than in written language. The question here, for morphosyntactic tagging, is whether these expressions should be decomposed into their individual words for tagging purposes, or should be assigned a single tag labelling the whole expression, as in the list above. We do not provide a detailed answer to this question here, but suggest again that careful consideration be given to this issue, and appropriate guidelines drawn up, before the tagging of a dialogue corpus is completed.

3.3 Syntactic annotation

Syntactic annotation has up to now taken the form of developing treebanks, or corpora in which each sentence is assigned a tree structure, usually on the basis of some kind of phrase structure model: but dependency grammar models have also been employed. Until very recently, little spoken data has been syntactically annotated. There is an EAGLES document proposing some provisional guidelines for syntactic annotation (Leech et al 1996).

With syntactic annotation, as with tagsets, the inventory of annotation symbols has generally been drawn up with written language in mind. As with morphosyntactic annotation (3.2), we note that in early development of syntactic annotation in the 1980s (IBM-Lancaster treebank, 1987-91) there seemed to be nothing peculiar in the use of skeleton-parsed written texts on a large scale as a training corpus for speech recognition applications. Again as in 3.2, we need to consider what changes to syntactic annotation need to be made if an annotation scheme is to be adapted to spoken language data.

(a) Disfluency phenomena

Again as with morphosyntactic annotation, this adaptation is most needful to deal with disfluency. The main phenomena requiring special treatment are:

(i) Incomplete parse trees: where the speaker fails to complete an utterance. This may be due to self-correction, to interruption, or to some other disruption of the production process.

(ii) Unplanned repetition: where the speaker displays hesitation by repeating the same word, or the same sequence of words, before going on to complete the utterance.

(iii) Syntactic blends: where in the course of an utterance, a speaker changes direction, failing to complete the syntactic construction with which he/she began, and instead substituting the completion of an alternative construction. E.g. the switch to a non-matching tag question in: And there's an accident up by the Flying Fox, is it? (examples from the BNC).

(b) Unintelligible speech

Another problem related to that of syntactic incompleteness arises in dialogue when the circumstances of the recording or of speech production leave passages of speech unintelligible or unclear.

(c) Segmentation difficulties

The syntax of spoken dialogue may seem fragmentary or disorderly for reasons other than disfluency or unintelligibility. Some reasons are these:

(i) The canonical sentence of written language, as a structure containing a finite verb, is far from being a satisfactory basis for the segmentation of speech into independent syntagms. According to one count, c. 36% of the independent syntagms of conversational dialogue have no finite verb: many are single-word utterances, and many others consist of a single verbless phrase.

(ii) Another reason is that the criteria for what counts as a syntactically independent segment in speech are difficult to determine, and may rely on prosodic separation.

(iii) There are utterance turns in dialogue where one speaker completes a syntagm initiated by another speaker.

The EAGLES guidelines (Leech et al 1996) acknowledge the need for extensions of the guidelines to deal with typical features of spoken language, but no solutions are offered. Recently, the development of treebanks of spoken language has confronted a number of research groups with the same problem of adapting syntactic annotation practices to spontaneous spoken language. Four attempts to specify guidelines to deal with the problems of spoken English parsing are those of:

AUTHOR(S): CORPUS UNIVERSITY

Eyes, E.: British National Corpus Lancaster

Marcus, M.: The Penn Treebank Pennsylvania

Fang, A. C.: The ICE-GB Corpus London

Sampson, G.: CHRISTINE Sussex

[Further information and discussion to be added]

3.4 Pragmatic Annotation: Dialogue Acts

[Document LE-EAGLES-WP4-1C.1 by Martin Weisser deals with the topic of this section.]

3.5 Prosodic Annotation

[To be added.]

References

Allen, J.F., Bradford, W.M., Ringger, E.K. and Sikorshi, T. (1996), 'A robust system for natural spoken dialogue'. In Proceedings of the Annual Meeting,Association for Computational Linguistics, pp.62-70.

Aston, G. (ed.) (1988), Negotiating service: Studies in the discourse of bookshop encounters: the Pixi project. Bologna: CLUEB.

Burnard, L. (ed.) (1995), Users' reference guide for the British National Corpus version 1.0. Oxford: Oxford University Computing Services

Carletta, J., Isard, A., Isard, S., Kowtko, J., Newlands, A., Doherty-Sneddon, G. and Anderson, A., (1995), HCRC Dialogue Structure Coding Manual, HCRC, 2 Buccleugh Place, Edinburgh EH8 8LW, Scotland

Gibbon, D., Moore, R. and Winski, R. (1997), Handbook of standards and resources for spoken language systems. Berlin: Mouton de Gruyter.

Greenbaum, S. and Ni, Y. (1996) 'About the ICE tagset', in Greenbaum, S. (ed.) English Worldwide: the International Corpus of English, Oxford: Clarendon Press, pp.92-109.

Jekat, S., Klein, A., Maier, E., Maleck, I., Mast, M., Quantz, J. (1995) Dialogue Acts in VERBMOBIL, VM-Report 65, DFKI GmbH, Stuhlsatzenhausweg 3, 66123 Saarbrücken

Johansson, S. (1995), 'The approach of the Text Encoding Initiative to the encoding of spoken discourse', in Leech, G., Myers, G. and Thomas, J. (eds.) (1995), Spoken English on computer: Transcription, mark-up and application. London and New York: Longman, pp.82-98.

Leech, G. and Wilson, A. (1994), EAGLES Morphosyntactic annotation. EAGLES Report EAGCSG/IR-T3. 1. Pisa: Istituto di Linguistica Computazionale.

Leech, G., Barnett, R. and Kahrel, P. (1996), Guidelines for the standardization of syntactic annotation of corpora. EAGLES Document EAG-TCWG-SASG/1.8.

Levinson, S. (1979), 'Activity types and language,' Linguistics, 17.5/6, 356-99.

MacWhinney, (1991), The CHILDES project: tools for analyzing talk. Hillsdale, NJ: Lawrence Erlbaum.

Svartvik, J. and Eeg-Olofsson, M. (1982), 'Tagging the London-Lund Corpus of Spoken English'. In Johansson, S. (ed.), Computer corpora in English language research, Bergen: Norwegian Computer Centre for the Humanities, pp.85-109.

Thompson, H., Anderson, A. and Bader, M. (1995), 'Publishing a spoken and written cvorpus on CD-ROM: the HCRC Map Task experience, in Leech, G. , Myers, G. and Thomas, J., Spoken English on computer: Transcription, mark-up and application, pp.168-80.

Wells, J.C., Barry, W., Grice, M., Fourcin, A., and Gibbon, D. (1992), Standard Computer-compatible transcription. Esprit project 2589 (SAM), Doc. no. SAM-UCL-037. London: Phonetics and Linguistics Dept., UCL.