The Lancaster Los Angeles Spoken Chinese Corpus (LLSCC)

The Lancaster Los Angeles Spoken Chinese Corpus (LLSCC) is a corpus of spoken Mandarin Chinese. The corpus is composed of 1,002,151 words of dialogues and monologues, both spontaneous and scripted, in 73,976 sentences and 49,670 utterance units (paragraphs). LLSCC has seven subcorpora, which are described below.

Conversations: 6 transcripts of face-to-face conversation, totalling 60,806 words;

Telephone Calls: 120 transcripts of telephone conversation between overseas Chinese and their families in China, totalling 295,026;

Play & Movie Transcripts: 12 transcripts of actual performances of TV plays, operas and movies, totalling 80,446 words;

TV Talk Show Transcripts: 20 transcripts of the CCTV talk show Shi Hua Shi Shuo (Tell It Like It Is), totalling 118,588 words;

Debate Transcripts: 9 transcripts of university students debates (1993-2002), totalling 77,909 words;

Oral Narratives: 49 narratives of native Beijing residents, totalling 102,262 words;

Edited Oral Narratives: 100 Chinese profiles (Beijing Ren edited by Zhang Xinxin & Sang Ye), totalling 267,114 words.

The corpus is XML-compliant. Each corpus file is composed of a corpus header and a text body. The header gives general information of a corpus file. In the body part, utterance units (or paragraphs), sentences and tokens are marked up, with each token also annotated for part of speech.

The corpus is a joint project undertaken by Dr. Richard Xiao (UCREL of Lancaster University) and Professor Hongyin Tao (University of California Los Angeles). Richard is obliged to the UK ESRC for supporting the project Contrasting English and Chinese (ESRC Award Reference RES-000-23-0553). We would also like to thank Dr. Jiajin Xu for providing part of the debate data. Regrettably, this corpus cannot be released to the public for the time being because of copyright restrictions.


Richard Xiao