Andrew Wilson
17 Nov 97
These recommendations take account of those made by Llisterri (1996) and by Gibbon et al. (1997) within the EAGLES framework and of those made by Johansson et al. (1991) for the TEI, now largely codified in P3 (Sperberg-McQueen and Burnard 1994). The corpus survey on which the recommendations are based comes partly from the document of Johansson et al. (1991) and partly from a fresh extension of it which pays particular reference both to corpora produced for dialogue projects and to corpora in European languages other than English.
We try in particular to address the issue of integrating spoken and written resources -- e.g., making representations of spoken corpora accessible to the language engineering (not just the speech technology) community. For this reason, we sometimes focus on processibility of texts (e.g., by stochastic or rule-based taggers and parsers) as an issue.
We make no strong recommendations as to the means of representation, so that, e.g., whilst we may use examples based on the TEI, we do not necessarily aim to push people into TEI conformance. Rather, we concentrate on the features that should be represented. However, some forms of representation naturally capture certain phenomena more easily than others: for instance, the start and end tags used in SGML/TEI are particularly useful for indicating the duration of a simultaneous phenomenon such as a non-verbal noise. It is also recommended that, in choosing a representation scheme, individual symbols that might be confused with other markup be avoided: for example, the @ character used by VERBMOBIL to mark overlapping speech could possibly be confused with the SAMPA representation of the schwa character. The use of tags with whole-word representations (e.g., the Spanish <simultáneo>) would minimize this kind of confusion. Furthermore, one should also consider whether coding representations might be standardized on a single language, regardless of the language of the corpus data. This would make interchangeability more straightforward than if there were a dozen translations of every single tag or feature.
The issue of obligatory vs. recommended vs. optional levels (cf. the recommendations on morphosyntax [Leech and Wilson 1996]) is one that should also be addressed: obviously, some applications will require more detailed transcription and analysis than others.
There are two primary ways of documenting information about texts:
As a bare minimum, the header should contain an identifier for the specific text and basic information on the speakers. By the latter, we do not necessarily mean personal details but rather a list of the codes used to identify every contributor to the dialogue. Additional optional information may include:
The most common text units in dialogue corpora are the text (i.e., a self-contained dialogue or dialogue sample with a natural or editorially created beginning and end) and the turn (or contribution). Tone units are also sometimes marked. Some researchers additionally posit a functionally-defined unit of utterance, which is not synonymous with the turn. Orthographic sentences are also often present, but these should probably be viewed as artefacts of transcription, rather than as intentional text units per se.
We recommend that the text and turn should be the basic text units in transcription. We do not recommend the use of tone units in orthographic transcription, as these are difficult to identify reliably (see Knowles 1991): any marking of tone units should be left to the interpretative stage of prosodic markup (Llisterri's [1996] S3 level). The notion of turn is itself not wholly unproblematic, since interruptions and overlap can occur, but there are methods for representing these aspects (see, e.g., 6 below). Sentences should be used in transcription for greater intelligibility and processibility (e.g., by taggers that assume the sentence as the basic processing unit), but it should be emphasized that the turn is the basic unit of spoken text.
A reference system -- i.e., a set of codes that allow reference to be made to specific texts and locations in texts -- may be absent from transcribed spoken corpora. This is partly due to the fact that multiple versions of spoken corpora often exist, with a basic transcription being stored as one file and a time-aligned version being stored as a different file. A time-aligned file has, in essence, already a reference system, in that the time points can be used to refer to specific locations in the dialogue. Nevertheless, we feel that it is both useful and straightforward to introduce a basic reference system into ordinary orthographic transcriptions also. The references may be encoded either as a separate field, as in the TRAINS corpora:
58.3 : load the tanker 58.4 : then go backor merged with speaker codes as in VERBMOBIL:
TIS019: gut , bin mit einverstanden , dann ist das klar . HAH020: danke sch"on <A> .
Speaker attribution is most often indicated by a letter code at the left-hand margin. The code may or may not be enclosed in some kind of markup. Also, a speaker's turn may or may not be closed by an end tag. Sometimes, the code may be longer than a single letter; in VERBMOBIL, it also includes digits to indicate the turn number -- see 4 above. Some examples are:--
FROM TRAINS: 57.1 M: puts the OJs in the tanker 58.1 S: +southern route+ FROM THE SPANISH ORAL REFERENCE CORPUS: <H2> Bueno. <H1> Epi que no... Lisar... BASED ON THE TEI RECOMMENDATIONS: <u who=A> Have you heard that Hildegard is back?</u> <u who=B> No.</u>We make no strong recommendation about the form of the speaker attribution, other than to say that one should be included. Any codes used should relate to information already given in the text header.
Cases where there is more than one speaker, or where the transcriber is unsure who is speaking, should be indicated. The TEI recommends the following practices:--
Speaker overlap, i.e., synchronous speech by more than one participant in the dialogue, is one of the most important issues in dialogue transcription. An examination of existing corpora demonstrates that the most common method of indicating overlapping speech is by `bracketing' the relevant segments of both interlocutors' speech, although the choice of bracketing characters varies considerably (e.g., round brackets in VERBMOBIL, plus signs in TRAINS, SGML tags in the Spanish oral reference corpus). Sometimes, the speech of only one of the two or more overlapping interlocutors is bracketed: we recommend that all overlapping stretches of speech should be explicitly marked up.
Three other methods of handling overlap may also be encountered:--
<timeLine> <when id=P1 synch='A1 B1 C1'> <when id=P2 synch='A2 C2'> </timeLine> ... <u who=A>this is <anchor id=A1> my <anchor id=A2> turn</u> <u who=B id=B1>balderdash</u> <u who=C id=C1> no <anchor id=C2> it's mine</u></u>
Our recommendation is to continue with the practice of bracketing the overlapping speech. Because of the occurrence of multiple overlaps, we also feel that some kind of numerical indexing of the brackets is highly desirable. The extent to which this is done at present varies. Typically, it is omitted. Sometimes, universal `pointer' characters are used instead of numbers to show interdependence of overlapping segments; for instance, in VERBMOBIL, pairs of @) and (@ are used:--
HAH008: nein, da mu"s ich zu einem (Besuch@) nach Leipzig . TIS009: (@m , aha) .However, true numerical indexing reduces substantially the risk of confusing what overlaps with what. An example of this might be:--
HAH008: nein, da mu"s ich zu einem (Besuch 1) nach (Leipzig 2) . TIS009: (1 m) , (2 aha) .which would, perhaps, be clearer than the following, especially in a more complex context:--
HAH008: nein, da mu"s ich zu einem (Besuch@) nach (Leipzig@) . TIS009: (@m) , (@aha) .
We also suggest that overlap bracketing should not cross turns. In the Spanish oral reference corpus, for example, a single overlap tag encloses the stretch of overlapping speech across speaker boundaries:--
<H1> <simultáneo>Sí, sí. <H2> ...había</simultáneo> sido mucho más compleja la posiciónWe think it would be clearer if the overlap markup were nested within the turns, thus:--
<H1> <simultáneo n=1>Sí, sí.</simultáneo> <H2> <simultáneo n=1>...había</simultáneo> sido mucho más compleja la posiciónwhere `n=1' has also been added to instantiate our recommendation on index numbers.
Most corpora transcribe speech using the standard (or dictionary) forms of words, regardless of their actual pronunciation. The use of standard word forms has a huge advantage, in that annotation and retrieval tools, for example, may be applied relatively unproblematically to speech as well as to writing.
Furthermore, everything (including numbers) is typically written out in full. This is important to distinguish, for example, different ways of saying the same string of numerals: for instance, 1980 can be said as `nineteen eighty' (the year) or as `one nine eight oh' (a telephone number) or as `one thousand nine hundred and eighty' (an ordinary number). Similarly, units of time, currency, percentages, degrees, and so on should normally be transcribed in full to capture their pronunciations -- e.g., two hundred dollars and fifty cents rather than $200.50; or ten to twelve rather than 11.50. However, in some cases, it may be more straightforward to transcribe numbers simply in arabic numerals: for example, in a restricted domain such as airline travel dialogues, the majority of numerical expressions may be flight numbers, which will conform to a uniform system of pronunciation.
Common contractions and merges that are also encountered in written texts (e.g., can't, gonna) are usually allowed, but otherwise dictionary forms are used, with special pronunciations indicated instead by editorial comments (see section 13 below). It is recommended that a supplementary list be drawn up of those common allowable contractions, &c., that are not included in the standard dictionary.
Pseudo-phonetic/modified orthographic transcription tends to be reserved for oddities such as non-words or neologisms that have no true dictionary form. Letters of the alphabet that are pronounced individually should be indicated, to distinguish, for example, the two different pronunciations of VIP -- /vIp/ vs. /vi: aI pi:/. It is sufficient to separate these with spaces (e.g., V I P), but sometimes additional markup is encountered, as, e.g., in VERBMOBIL: $V $I $P.
We recommend that dictionary-form transcription, and the other practices mentioned above, should continue to be the norm.
It has been suggested that a standard dictionary should be employed for each language in order to arrive at these dictionary forms, in the same way that the Duden has already been used for German in VERBMOBIL. However, this may be a little too idealistic. Often, dictionaries present more than one possible spelling of a word -- e.g., analyze vs. analyse. Also, it is difficult to conceive of transcribers checking spellings in a standard dictionary, when they feel confident of how to spell something. It may be that a style guide, such as Hart's Rules for English, would help with restricting common variant spellings. Alternatively, the standard dictionary could be that used by the spell checker of a specific word processor. Even here, however, different projects will use different software, and the dictionaries will thus also vary. Perhaps the best that can be hoped for, except where very detailed and extensive checking is feasible, is a fuzzy consensus notion of what constitutes, e.g., correctly spelled English. For languages with less spelling variation and/or with one standard `academy' dictionary, the situation may be somewhat more straightforward.
An interesting aspect of the guidelines used by the TRAINS project is that an interpretation (or expansion to full form) of word partials is added where possible. This has both advantages and disadvantages: where a partial is not part of a repeated sequence that includes a full form, it enables more content to be extracted for language understanding and so on, but, on the other hand, it may be argued that to interpret such partials -- even when they are unambiguous -- is to read additional (and perhaps unwarranted) information into the transcript beyond what needs to be represented.
<reg>Bert</reg>Obviously, in these circumstances, the orig feature, which normally encodes the original form of words, cannot be used.
By `speech management' we understand phenomena such as quasi-lexical vocalizations, pauses, false starts, restarts, and so on.
<vocal type=quasi-lexical desc=uh-huh>These representations could also be linked to the phonetic level in a multi-level corpus. However, the above approach may be found to be too verbose and cumbersome. It may be better simply to use a standard list of orthographic forms for these phenomena, without any additional markup, and this approach is also sanctioned by the TEI. Whichever approach is adopted, a universal list of these standardized forms should be drawn up for each language.
<del type=truncation>s</del>see <del type=repetition>you you</del>you know <del type=falseStart>it's</del>he's crazy
By `paralinguistic features' we mean those concomitant aspects of voice such as laughter, tempo, loudness, and so on. We exclude features that do not accompany speech but rather occur in isolation (e.g., laughter not superimposed on speech), for which see section 10 below.
We recommend that these should be encoded using, as far as possible, a finite set of standard features. This is sometimes the case already, but sometimes also free comment is allowed. A standard list of codes will enable features to be retrieved and counted in concordancing software, &c. Unconstrained comment tags should be avoided as much as possible. The TEI has already produced a basic list of paralinguistic features, which can be used or amended for EAGLES purposes; these are reproduced in Appendix A of this document.
The use of balanced start and end tags will enable the duration of a paralinguistic phenomenon to be encoded more clearly.
Non-verbal sounds are typically transcribed as a form of comment. Sometimes a standard set of codes is defined in place of free comment and this is to be recommended. However, if using a fixed set of specific sounds, it may be advisable for at least one more general feature to be retained (e.g., noise), to allow for unattributable sounds or those for some reason omitted from a standard list. It may be possible, following the practice of the Spanish oral reference corpus, to combine standard features and free comment, so that additional information is available as well as a basic indication of broadly what kind of noise has occurred.
We recommend that, minimally, a five-point typology of non-verbal sounds be encoded:--
Again, as with paralinguistic features, the use of start and end tags will allow a continuous noise to be represented.
These comprise what is, in informal speech, termed `body language' -- e.g., eye contact, gesture, and so on. Few corpora represent these features, since transcription is typically from audio tape rather than from video tape or live performance. Kinesic features are also of dubious relevance to work in natural language and speech processing, but may become more important as multimedia research progresses. Thus, at present, we do not recommend the development of a feature set or practices for their transcription. If desired, however, they could be included as editorial comments or using the TEI's <kinesic> tag, which has attributes to indicate the `actor', a description of the action, and whether or not it is a repeated action. (It might, in future, be advisable to adopt a different nomenclature for this tag, since the term kinesic may be confused with the phonological term kinetic.)
Basic information about the context of a dialogue (e.g., the participants, location, &c.) tends to be included the text header. More `short-term' information, such as the arrival or departure of a participant, is normally introduced as editorial comment. We recommend that these practices should continue. The TEI suggests a special comment tag (<event>) for these features, with the same attribute set as <kinesic>.
Editorial comment comprises a number of cases where an interpretative comment needs to be added over and above the transcription of the feature types described above:
<reg orig='booer'>butter</reg>If more than one standard orthographic word is included in a variant pronunciation, VERBMOBIL also adds a number indicating how many of the standardly transcribed words are represented by a given pronunication. This feature is not part of the TEI syntax for <reg>, but might be a optional addition. It would be less important in a TEI representation than in VERBMOBIL, since VERBMOBIL does not use start and end tags to bracket the stretch of speech. If using a number, gonna, for example -- though this is a fairly standard form that would probably, in fact, not be normalized in transcription -- might be represented with something like:--
<reg words=2 orig='gonna'>going to</reg>In view of the development of the SAMPA conventions for encoding phonetic (IPA) transcriptions in 7-bit ASCII, it might be recommended that alternative pronunciations be represented in SAMPA format rather than in an idiosyncratic modified orthography:--
<reg orig='bU?@'>butter</reg>Since many computers still use a 7-bit character set, it is probably advisable, for the time being, to stick with SAMPA rather than attempting to use forms of encoding such as UNICODE.
That is what <note comment="Which one?">Geoff</note> said.