LCMC: Basic information

1. Aims

The LCMC corpus was built in response to the general lack of publicly available balanced corpora of Chinese. While there are some corpus resources, most of them, for example, the PH Corpus and the PFR People’s Daily Corpus released by the Institute of Computational Linguistics, Peking University, are composed exclusively of newswire texts and are thus not balanced. Neither are the corpora released by the LDC balanced. The latter only contain either newswire texts or official documents. The only balanced corpus of Mandarin Chinese is the Sinica Corpus, which was produced by Academia Sinica, Taiwan. As Taiwan has been separated from Mainland China for decades, the language used in Taiwan is not exactly the same as that used on the mainland. As such, the Sinica corpus does not represent modern Mandarin Chinese as written on the mainland of China. The balanced Chinese corpus built in China, as reported in Zhou & Yu (1997), is not publicly available. Since the corpus approach has increasingly been recognized a useful tool for the linguistic investigation, the research community has felt an increasing need for appropriate corpus resources for this major world language. The LCMC will satisfy this need, because we have made this corpus publicly available worldwide for academic research. In addition to monolingual studies of the Chinese language, LCMC, in combination with FLOB, is also a sound basis for contrastive studies of Chinese and English, whether one wishes to compare the two languages as a whole or compare them by text type.

2. Sampling frame and text collection

In the LCMC corpus, the FLOB sampling frame is followed strictly except for two minor variations. The first variation relates to the sampling frame – we replaced western and adventure fiction (category N) with martial arts fiction. There are three reasons for this decision. Firstly, there is simply no western fiction in China; secondly, martial arts fiction is broadly a type of adventure fiction and it is a very popular and important fiction type in China and hence should be represented; thirdly, the language used in martial arts fiction is a distinctive language type and hence once more one would wish to sample it. Most stories of this type, even though they were published recently, are under the influence of vernacular Chinese, i.e. modern Chinese styled to appear like classical Chinese. While the inclusion of this text type has made the tasks of POS tagging and post-editing more difficult, it may also make it possible to compare representations of vernacular Chinese and modern Chinese. The second variation was caused by problems matching the sampling period. Considering the availability of texts of some categories, we decided to modify the FLOB sampling period slightly by also including some samples in ±2 years of 1991 when there were not enough samples readily available for 1991. As a result, of the 500 samples included in the LCMC corpus, around two thirds were produced in 1991 while the other one third produced within two years of 1991. We assume that such a time span will not influence a language significantly.

The LCMC corpus has been constructed using written Mandarin Chinese texts published in Mainland China to ensure some degree of textual homogeneity. It should be noted that plain written texts alone have been transcribed, with tables, illustrations, pictures, formulae and special symbols omitted and replaced with a gap element marked by the wording ‘omission’. Long citations from translated texts or texts produced outside the sampling period were also omitted so that the effect of translationese could be excluded and L1 quality guaranteed.

While a small number of samples, if they were conformant with our sampling frame, were collected from the Internet, most samples were provided by the SSReader Digital Library in China. As each page of electronic books in the library comes in PDG format, these pages were transferred into text files using an OCR program provided by the digital library. This scanning process resulted in a 1-3% error rate, depending on the quality of the picture files. Each electronic text file was proofread and corrected independently by two native speakers of Mandarin Chinese so as to keep the transcribed raw texts as accurate as possible.

While the digital library has a very large collection of books, it does not provide newspapers. The only sources of newswire texts from the library are a dozen of collections of news awarded at various levels. These collections, however, represent newswire texts from more than eighty newspapers and television or broadcasting stations. The samples from these sources account for around two thirds of texts for the press categories (A-C). The other one third are sampled from newswire texts from Xinhua News Agency (excerpted from the PH Corpus). Considering that this is the most important and representative news provider in China, we believe that this proportion is justified.

Unlike western languages such as English, in which words are typically separated with white spaces and can thus be relatively easily be counted in terms of word number, Chinese contains running characters. Consequently, while it is easy to count the character number, it is not possible to count word number with raw texts. As the proofreading of raw electronic texts is time-consuming and expensive, it was economical to proofread an excessively large sample but use only around 2,000 words. Based on a pilot study of the ratio of words to characters, we decided to adopt a ratio of 1:1.6, which means that we needed a 3,200-character running text for a 2,000-word sample. When a text was less than the required length, texts of similar quality were combined into one sample. For longer samples, e.g. those from books, we adopted a random procedure so that beginning, middle and ending samples have been included in all categories. When the texts were segmented and it was possible to count exact word numbers, they were automatically cut to around 2000 words while keeping the final sentence complete. However, while the ratio that we decided on worked on most texts, a small number of texts finally yielded slightly less than 2,000 words. In this case, the whole processed text was included. While some individual samples contain fewer words, and some more words, than 2,000, the total number of words for each text type is roughly conformant to our sampling frame.

3. Encoding and markup conventions

Unlike single-byte western languages like English, Chinese uses 2 bytes of ASCII codes for each character. Currently there are three encoding systems for Chinese characters: GB2312 for simplified Chinese, Big5 for traditional Chinese, and Unicode. While the original texts were encoded in GB2312, we decided to convert the encoding into Unicode (UTF-8) for the following reasons, namely, (1) to ensure the compatibility of non-Chinese operating system and Chinese characters; and (2) to take advantage of the latest Unicode-compatible concordancers like Xara version 1.0 and the WordSmith Tools version 4.0.

In order to make it more convenient for users with an operating system earlier than Windows 2000 and without a language support pack to use our data, we have produced a Pinyin version of the LCMC corpus in addition to the standard version containing characters. While also encoded using UTF-8, the Pinyin version will be more compatible with older operating and concordance systems. This is also of assistance to users who can read Romanised Chinese but not Chinese characters.

Both versions of the corpus come in fifteen files. The corpus is XML conformant. Each file has two parts: a corpus header and text. The header gives general information about the corpus. The text part is annotated with five levels of details: (1) text category, (2) file identifier, (3) paragraph, (4) sentence and (5) word, punctuation/symbol and elements omitted in transcriptions (see List of codes). These details are useful. Presently Xara version 1.0 is aware of XML markup. With this tool, users can either search the whole corpus or define a subcorpus containing a certain text type or a specific file. The POS tags allow users to search for a certain class of words, and in combination with tokens, to extract a specific word that belongs to a certain class.