LE-EAGLES--WP4--1B.1

ORTHOGRAPHIC TRANSCRIPTION OF DIALOGUES

Andrew Wilson

17 Nov 97

1 Background

These recommendations take account of those made by Llisterri (1996) and by Gibbon et al. (1997) within the EAGLES framework and of those made by Johansson et al. (1991) for the TEI, now largely codified in P3 (Sperberg-McQueen and Burnard 1994). The corpus survey on which the recommendations are based comes partly from the document of Johansson et al. (1991) and partly from a fresh extension of it which pays particular reference both to corpora produced for dialogue projects and to corpora in European languages other than English.

We try in particular to address the issue of integrating spoken and written resources -- e.g., making representations of spoken corpora accessible to the language engineering (not just the speech technology) community. For this reason, we sometimes focus on processibility of texts (e.g., by stochastic or rule-based taggers and parsers) as an issue.

We make no strong recommendations as to the means of representation, so that, e.g., whilst we may use examples based on the TEI, we do not necessarily aim to push people into TEI conformance. Rather, we concentrate on the features that should be represented. However, some forms of representation naturally capture certain phenomena more easily than others: for instance, the start and end tags used in SGML/TEI are particularly useful for indicating the duration of a simultaneous phenomenon such as a non-verbal noise. It is also recommended that, in choosing a representation scheme, individual symbols that might be confused with other markup be avoided: for example, the @ character used by VERBMOBIL to mark overlapping speech could possibly be confused with the SAMPA representation of the schwa character. The use of tags with whole-word representations (e.g., the Spanish <simultáneo>) would minimize this kind of confusion. Furthermore, one should also consider whether coding representations might be standardized on a single language, regardless of the language of the corpus data. This would make interchangeability more straightforward than if there were a dozen translations of every single tag or feature.

The issue of obligatory vs. recommended vs. optional levels (cf. the recommendations on morphosyntax [Leech and Wilson 1996]) is one that should also be addressed: obviously, some applications will require more detailed transcription and analysis than others.

2 Documentation on texts

There are two primary ways of documenting information about texts:

1.
a separate set of documentation -- e.g., a manual
2.
a header within the text itself, which may be
(a)
structured -- e.g., a TEI header
(b)
relatively unstructured -- e.g., a few lines of COCOA references.
We recommend that a header be the minimum requirement for text documentation. An in-text header -- as opposed to external documentation -- makes it less easy to confuse texts; it can be used as part of an automatic analysis, to output background information; and it enables quick reference, especially when a manual is for some reason not to hand.

As a bare minimum, the header should contain an identifier for the specific text and basic information on the speakers. By the latter, we do not necessarily mean personal details but rather a list of the codes used to identify every contributor to the dialogue. Additional optional information may include:

3 Basic text units

The most common text units in dialogue corpora are the text (i.e., a self-contained dialogue or dialogue sample with a natural or editorially created beginning and end) and the turn (or contribution). Tone units are also sometimes marked. Some researchers additionally posit a functionally-defined unit of utterance, which is not synonymous with the turn. Orthographic sentences are also often present, but these should probably be viewed as artefacts of transcription, rather than as intentional text units per se.

We recommend that the text and turn should be the basic text units in transcription. We do not recommend the use of tone units in orthographic transcription, as these are difficult to identify reliably (see Knowles 1991): any marking of tone units should be left to the interpretative stage of prosodic markup (Llisterri's [1996] S3 level). The notion of turn is itself not wholly unproblematic, since interruptions and overlap can occur, but there are methods for representing these aspects (see, e.g., 6 below). Sentences should be used in transcription for greater intelligibility and processibility (e.g., by taggers that assume the sentence as the basic processing unit), but it should be emphasized that the turn is the basic unit of spoken text.

4 Reference system

A reference system -- i.e., a set of codes that allow reference to be made to specific texts and locations in texts -- may be absent from transcribed spoken corpora. This is partly due to the fact that multiple versions of spoken corpora often exist, with a basic transcription being stored as one file and a time-aligned version being stored as a different file. A time-aligned file has, in essence, already a reference system, in that the time points can be used to refer to specific locations in the dialogue. Nevertheless, we feel that it is both useful and straightforward to introduce a basic reference system into ordinary orthographic transcriptions also. The references may be encoded either as a separate field, as in the TRAINS corpora:

58.3  : load the tanker
58.4  : then go back
or merged with speaker codes as in VERBMOBIL:
TIS019: gut , bin mit einverstanden , dann ist das klar .
HAH020: danke sch"on <A> .

5 Speaker attribution

Speaker attribution is most often indicated by a letter code at the left-hand margin. The code may or may not be enclosed in some kind of markup. Also, a speaker's turn may or may not be closed by an end tag. Sometimes, the code may be longer than a single letter; in VERBMOBIL, it also includes digits to indicate the turn number -- see 4 above. Some examples are:--

FROM TRAINS:

57.1 M: puts the OJs  in the   tanker
58.1 S:       +southern route+

FROM THE SPANISH ORAL REFERENCE CORPUS:

<H2> Bueno.
<H1> Epi que no... Lisar...

BASED ON THE TEI RECOMMENDATIONS:

<u who=A> Have you heard that Hildegard is back?</u>
<u who=B> No.</u>
We make no strong recommendation about the form of the speaker attribution, other than to say that one should be included. Any codes used should relate to information already given in the text header.

Cases where there is more than one speaker, or where the transcriber is unsure who is speaking, should be indicated. The TEI recommends the following practices:--

The same features can be marked with slightly different conventions in non-TEI markup schemes.

6 Speaker overlap

Speaker overlap, i.e., synchronous speech by more than one participant in the dialogue, is one of the most important issues in dialogue transcription. An examination of existing corpora demonstrates that the most common method of indicating overlapping speech is by `bracketing' the relevant segments of both interlocutors' speech, although the choice of bracketing characters varies considerably (e.g., round brackets in VERBMOBIL, plus signs in TRAINS, SGML tags in the Spanish oral reference corpus). Sometimes, the speech of only one of the two or more overlapping interlocutors is bracketed: we recommend that all overlapping stretches of speech should be explicitly marked up.

Three other methods of handling overlap may also be encountered:--

1.
Vertical alignment of overlapping segments (widely used in conversation analysis, &c.).
2.
Reorganization of overlaps into separate turns, without representing where overlaps occur (as used, e.g., in the Czech national corpus).
3.
The TEI practice of using time-line pointers, for example:--
<timeLine>
     <when id=P1 synch='A1 B1 C1'>
     <when id=P2 synch='A2 C2'>
</timeLine>
     ...
<u who=A>this is <anchor id=A1> my <anchor id=A2> turn</u>
<u who=B id=B1>balderdash</u>
<u who=C id=C1> no <anchor id=C2> it's mine</u></u>
We strongly dissuade the use of the first two alternatives. The first is technically problematic, as it often does not delimit with markup the stretches of speech that overlap: often only the start of an overlap is marked. Thus this information can easily be lost, especially when different display or print fonts are used that alter the visible alignment. The second is simply an idealization: it falsifies what is happening and obliterates any evidence of overlap in favour of neat, drama-like turns. The third (TEI) option is less objectionable, and has the advantage of dealing very well with multiple overlaps, but it is perhaps a little too cumbersome.

Our recommendation is to continue with the practice of bracketing the overlapping speech. Because of the occurrence of multiple overlaps, we also feel that some kind of numerical indexing of the brackets is highly desirable. The extent to which this is done at present varies. Typically, it is omitted. Sometimes, universal `pointer' characters are used instead of numbers to show interdependence of overlapping segments; for instance, in VERBMOBIL, pairs of @) and (@ are used:--

HAH008: nein, da mu"s ich zu einem (Besuch@) nach Leipzig .
TIS009: (@m , aha) .
However, true numerical indexing reduces substantially the risk of confusing what overlaps with what. An example of this might be:--
HAH008: nein, da mu"s ich zu einem (Besuch 1) nach (Leipzig 2) .
TIS009: (1 m)  , (2 aha) .
which would, perhaps, be clearer than the following, especially in a more complex context:--
HAH008: nein, da mu"s ich zu einem (Besuch@) nach (Leipzig@) .
TIS009: (@m)  , (@aha) .

We also suggest that overlap bracketing should not cross turns. In the Spanish oral reference corpus, for example, a single overlap tag encloses the stretch of overlapping speech across speaker boundaries:--

<H1> <simultáneo>Sí, sí.
<H2> ...había</simultáneo> sido mucho más compleja la posición
We think it would be clearer if the overlap markup were nested within the turns, thus:--
<H1> <simultáneo n=1>Sí, sí.</simultáneo>
<H2> <simultáneo n=1>...había</simultáneo> sido mucho más compleja la posición
where `n=1' has also been added to instantiate our recommendation on index numbers.

7 Word form

Most corpora transcribe speech using the standard (or dictionary) forms of words, regardless of their actual pronunciation. The use of standard word forms has a huge advantage, in that annotation and retrieval tools, for example, may be applied relatively unproblematically to speech as well as to writing.

Furthermore, everything (including numbers) is typically written out in full. This is important to distinguish, for example, different ways of saying the same string of numerals: for instance, 1980 can be said as `nineteen eighty' (the year) or as `one nine eight oh' (a telephone number) or as `one thousand nine hundred and eighty' (an ordinary number). Similarly, units of time, currency, percentages, degrees, and so on should normally be transcribed in full to capture their pronunciations -- e.g., two hundred dollars and fifty cents rather than $200.50; or ten to twelve rather than 11.50. However, in some cases, it may be more straightforward to transcribe numbers simply in arabic numerals: for example, in a restricted domain such as airline travel dialogues, the majority of numerical expressions may be flight numbers, which will conform to a uniform system of pronunciation.

Common contractions and merges that are also encountered in written texts (e.g., can't, gonna) are usually allowed, but otherwise dictionary forms are used, with special pronunciations indicated instead by editorial comments (see section 13 below). It is recommended that a supplementary list be drawn up of those common allowable contractions, &c., that are not included in the standard dictionary.

Pseudo-phonetic/modified orthographic transcription tends to be reserved for oddities such as non-words or neologisms that have no true dictionary form. Letters of the alphabet that are pronounced individually should be indicated, to distinguish, for example, the two different pronunciations of VIP -- /vIp/ vs. /vi: aI pi:/. It is sufficient to separate these with spaces (e.g., V I P), but sometimes additional markup is encountered, as, e.g., in VERBMOBIL: $V $I $P.

We recommend that dictionary-form transcription, and the other practices mentioned above, should continue to be the norm.

It has been suggested that a standard dictionary should be employed for each language in order to arrive at these dictionary forms, in the same way that the Duden has already been used for German in VERBMOBIL. However, this may be a little too idealistic. Often, dictionaries present more than one possible spelling of a word -- e.g., analyze vs. analyse. Also, it is difficult to conceive of transcribers checking spellings in a standard dictionary, when they feel confident of how to spell something. It may be that a style guide, such as Hart's Rules for English, would help with restricting common variant spellings. Alternatively, the standard dictionary could be that used by the spell checker of a specific word processor. Even here, however, different projects will use different software, and the dictionaries will thus also vary. Perhaps the best that can be hoped for, except where very detailed and extensive checking is feasible, is a fuzzy consensus notion of what constitutes, e.g., correctly spelled English. For languages with less spelling variation and/or with one standard `academy' dictionary, the situation may be somewhat more straightforward.

Word partials.

Word partials are typically transcribed as follows: as much of the word as is pronounced is transcribed, followed by a `break-off' character -- for instance a dash or an asterisk. Sometimes a tag is used instead of a special character, e.g., <palabra cortada> in the Spanish oral reference corpus. Some guidelines (e.g., the Gothenburg corpus of spoken Swedish) also allow for word-end partials, in which case the `word partial' character may occur at the beginning rather than the end of a string. Most transcriptions of word partials use standard or modified orthography, but this can be confusing in cases like the English digraph po-, which may represent either the diphthong of poll or the simple vowel of pot. It may thus be better to use some form of phonetic representation, such as SAMPA, for word partials.

An interesting aspect of the guidelines used by the TRAINS project is that an interpretation (or expansion to full form) of word partials is added where possible. This has both advantages and disadvantages: where a partial is not part of a repeated sequence that includes a full form, it enables more content to be extracted for language understanding and so on, but, on the other hand, it may be argued that to interpret such partials -- even when they are unambiguous -- is to read additional (and perhaps unwarranted) information into the transcript beyond what needs to be represented.

Orthography.

As to the more general form of transcription, the use of a basic subset of the standard orthography is both normal and desirable. Sentence-initial capitals may be omitted, but, otherwise, normal capitalization and at least full stops should be used. This improves readability for the human user and improves processibility for taggers, parsers, and so on. Obviously, it is understood that such standard orthography is, to an extent, interpretative when applied to speech, but its advantages outweigh its disadvantages. The use of punctuation characters other than full stops is an open question. Whatever punctuation scheme is adopted, the general rule must be to explain it in the text header: for example, if impressionistic punctuation has been used, this should be explicitly stated.

Unintelligible speech.

Normally a single code is used -- e.g, <inintelligible> in the Spanish oral reference corpus or <%> in VERBMOBIL. Sometimes a form of bracketing is employed instead, with the number of unintelligible syllables given. We recommend that a straightforward single code be used, since the guessing of how many syllables are missing is probably of dubious accuracy and may be beyond the ability of the transcribers.

Uncertain transcription.

Normally, uncertain transcriptions are bracketed, but with different conventions to truly unintelligible speech (e.g., double instead of single brackets). Sometimes, a special character is added to a word instead -- e.g., % in VERBMOBIL. Where a transcription is possible, but not completely certain, we recommend the following practices:--
1.
Uncertain syllables or sounds should be bracketed within the word, as is the practice of the Spanish oral reference corpus -- e.g., burri<(t)>o.
2.
Uncertain words and phrases should be placed inside a set of start and end tags, e.g., <unclear>burrito</unclear>. The TEI tag shown here also has an optional attribute reason.

Substitutions.

Also to be considered under this heading are those cases where words -- normally proper nouns -- are to be replaced for confidentiality or other reasons. We recommend that these be marked with codes, since this makes it more clear where an original text word has been replaced. The practice of simply substituting an alternative name without comment should be avoided, but a replacement may be used if it is commented, e.g., by the use of a TEI regularization tag:--
<reg>Bert</reg>
Obviously, in these circumstances, the orig feature, which normally encodes the original form of words, cannot be used.

8 Speech management

By `speech management' we understand phenomena such as quasi-lexical vocalizations, pauses, false starts, restarts, and so on.

Pauses.

Unfilled pauses (by which we mean perceived pauses, rather than silence in the speech signal) are typically marked with suspense dots (...) or some other special punctuation such as an oblique slash. The Gothenburg Swedish corpus uses various numbers of slashes (/, //, or ///) to give an impression of the length of a pause. Sometimes a tag is used instead of punctuation -- e.g., <P> in VERBMOBIL. Both methods may allow additional comments to be added as to the length of a pause. We suggest that a tag markup is clearer than the use of punctuation characters, which could possibly be confused with genuine punctuation; this also makes the addition of length information a little more systematic, but we suggest that the latter be an optional feature: its value in simple orthographic transcription is questionable.

Quasi-lexical vocalizations.

Most corpora make some attempt to standardize the transcription of quasi-lexical vocalizations, interjections, and filled pauses, such as um, uh-huh, oi, ooh and ah. In contrast, the Spanish oral reference corpus avoids the use of invented/idealized word forms and instead uses markup to indicate where quasi-lexical vocalizations occur. The down side of this, however, is that the features used by the Spanish corpus confuse transcription with speech-act annotation: they require an interpretation of the function of a vocalization (e.g., agreement, negation). A possible compromise, which is mentioned as possible by the TEI guidelines, would be to merge the two systems, so that quasi-lexical vocalizations have standardized forms but occur in the form of markup to indicate that standardization has occurred. For example:--
<vocal type=quasi-lexical desc=uh-huh>
These representations could also be linked to the phonetic level in a multi-level corpus. However, the above approach may be found to be too verbose and cumbersome. It may be better simply to use a standard list of orthographic forms for these phenomena, without any additional markup, and this approach is also sanctioned by the TEI. Whichever approach is adopted, a universal list of these standardized forms should be drawn up for each language.

Other phenomena.

Many corpora do not identify repetitions, false starts, &c. However, for the purpose of activities such as part-of-speech tagging, it may be important to do so, so that, for example, repetitions do not disturb the working or training of a Markov model of category transitions. If repetitions and so on are identified in the transcription, we suggest that one full-word transcription should be retained in the main running text and the rest marked up with some kind of bracketing. The TEI's <del> tag is one possible way of representing this and allows the various types of phenomenon to be noted (but see section 7 above for a preferred method of transcribing truncations [phonetic representation rather than orthographic characters]):--
<del type=truncation>s</del>see
<del type=repetition>you you</del>you know
<del type=falseStart>it's</del>he's crazy

9 Paralinguistic features

By `paralinguistic features' we mean those concomitant aspects of voice such as laughter, tempo, loudness, and so on. We exclude features that do not accompany speech but rather occur in isolation (e.g., laughter not superimposed on speech), for which see section 10 below.

We recommend that these should be encoded using, as far as possible, a finite set of standard features. This is sometimes the case already, but sometimes also free comment is allowed. A standard list of codes will enable features to be retrieved and counted in concordancing software, &c. Unconstrained comment tags should be avoided as much as possible. The TEI has already produced a basic list of paralinguistic features, which can be used or amended for EAGLES purposes; these are reproduced in Appendix A of this document.

The use of balanced start and end tags will enable the duration of a paralinguistic phenomenon to be encoded more clearly.

10 Non-verbal sounds

Non-verbal sounds are typically transcribed as a form of comment. Sometimes a standard set of codes is defined in place of free comment and this is to be recommended. However, if using a fixed set of specific sounds, it may be advisable for at least one more general feature to be retained (e.g., noise), to allow for unattributable sounds or those for some reason omitted from a standard list. It may be possible, following the practice of the Spanish oral reference corpus, to combine standard features and free comment, so that additional information is available as well as a basic indication of broadly what kind of noise has occurred.

We recommend that, minimally, a five-point typology of non-verbal sounds be encoded:--

1.
non-verbal but vocal utterances attributable to the speaker (e.g., laugh, snort)
2.
non-verbal but vocal utterances not attributable to the speaker
3.
non-vocal noises attributable to the speaker (e.g., snapping fingers)
4.
non-vocal noises not attributable to the speaker (e.g., doorbell ringing)
5.
noises that are not humanly produced or communicative (e.g., dog barking)

Again, as with paralinguistic features, the use of start and end tags will allow a continuous noise to be represented.

11 Non-verbal gestures

These comprise what is, in informal speech, termed `body language' -- e.g., eye contact, gesture, and so on. Few corpora represent these features, since transcription is typically from audio tape rather than from video tape or live performance. Kinesic features are also of dubious relevance to work in natural language and speech processing, but may become more important as multimedia research progresses. Thus, at present, we do not recommend the development of a feature set or practices for their transcription. If desired, however, they could be included as editorial comments or using the TEI's <kinesic> tag, which has attributes to indicate the `actor', a description of the action, and whether or not it is a repeated action. (It might, in future, be advisable to adopt a different nomenclature for this tag, since the term kinesic may be confused with the phonological term kinetic.)

12 Situational features

Basic information about the context of a dialogue (e.g., the participants, location, &c.) tends to be included the text header. More `short-term' information, such as the arrival or departure of a participant, is normally introduced as editorial comment. We recommend that these practices should continue. The TEI suggests a special comment tag (<event>) for these features, with the same attribute set as <kinesic>.

13 Editorial comment

Editorial comment comprises a number of cases where an interpretative comment needs to be added over and above the transcription of the feature types described above:

Alternative transcriptions.

Pseudo-phonetic or modified orthographic transcription is largely avoided as a general rule. However, in at least some cases, it may be desirable to indicate, separately from a full phonetic/phonemic transcription, how a word or phrase was pronounced, e.g., because it is a dialect form. We would prefer not to see modified orthography in the transcription itself: this may cause difficulty in concordancing or processing the text and may, in any case, be misleading -- e.g., for non-native speakers using the corpus. We recommend an approach similar to that adopted by VERBMOBIL, namely that alternative transcriptions should be enclosed within markup brackets. A similar approach is recommended by the TEI using the <reg> tag:--
<reg orig='booer'>butter</reg>
If more than one standard orthographic word is included in a variant pronunciation, VERBMOBIL also adds a number indicating how many of the standardly transcribed words are represented by a given pronunication. This feature is not part of the TEI syntax for <reg>, but might be a optional addition. It would be less important in a TEI representation than in VERBMOBIL, since VERBMOBIL does not use start and end tags to bracket the stretch of speech. If using a number, gonna, for example -- though this is a fairly standard form that would probably, in fact, not be normalized in transcription -- might be represented with something like:--
<reg words=2 orig='gonna'>going to</reg>
In view of the development of the SAMPA conventions for encoding phonetic (IPA) transcriptions in 7-bit ASCII, it might be recommended that alternative pronunciations be represented in SAMPA format rather than in an idiosyncratic modified orthography:--
<reg orig='bU?@'>butter</reg>
Since many computers still use a 7-bit character set, it is probably advisable, for the time being, to stick with SAMPA rather than attempting to use forms of encoding such as UNICODE.

General comments.

General comments are typically introduced within some form of distinctive bracketing. The Gothenburg corpus of spoken Swedish encloses the stretch of text to which the comment refers as well as the comment itself. Comments in this scheme can also be numbered. We feel that enclosing the text commented on may make the comments more transparent. Numbers are probably not essential (in the Gothenburg corpus, comments occur on a different line to transcribed text, which is why the are used there). In an SGML (but non-TEI conformant) representation, this might look something like the following:--
That is what <note comment="Which one?">Geoff</note> said.

14 Summary of recommendations

Obligatory

Recommended

Optional

References

Gibbon, D., Moore, R. and Winski, R. (1997). Handbook of standards and resources for spoken language systems. Berlin: Mouton de Gruyter.

Johansson, S. et al. (1991). Text Encoding Initiative, Spoken Text Work Group: Working paper on spoken texts (October 1991). Manuscript.

Knowles, G. (1991). Prosodic labelling: the problem of tone group boundaries. In: S. Johansson and A.-B. Stenström (eds), English computer corpora: selected papers and research guide. Berlin: Mouton de Gruyter, pp. 149-63.

Leech, G.N. and Wilson, A. (1996). EAGLES recommendations for the morphosyntactic annotation of corpora. EAGLES document EAG-TCWG-MAC/R.

Llisterri, J. (1996). EAGLES preliminary recommendations on spoken texts. EAGLES document EAG-TCWG-SPT/P.

Sperberg-McQueen, C.M. and Burnard, L. (1994). Guidelines for text encoding and interchange (TEI P3). Chicago and Oxford: ACH-ACL-ALLC Text Encoding Initiative.

Appendix A. TEI paralinguistic features

Tempo

Loudness

Pitch range

Tension

Rhythm

Voice quality


Andrew Wilson
17 Nov 1997