LCMC: Basic information
The LCMC corpus was built in response to the general lack of publicly available balanced corpora of Chinese. While there are some corpus resources, most of them, for example, the PH Corpus and the PFR People’s Daily Corpus released by the Institute of Computational Linguistics, Peking University, are composed exclusively of newswire texts and are thus not balanced. Neither are the corpora released by the LDC balanced. The latter only contain either newswire texts or official documents. The only balanced corpus of Mandarin Chinese is the Sinica Corpus, which was produced by Academia Sinica, Taiwan. As Taiwan has been separated from Mainland China for decades, the language used in Taiwan is not exactly the same as that used on the mainland. As such, the Sinica corpus does not represent modern Mandarin Chinese as written on the mainland of China. The balanced Chinese corpus built in China, as reported in Zhou & Yu (1997), is not publicly available. Since the corpus approach has increasingly been recognized a useful tool for the linguistic investigation, the research community has felt an increasing need for appropriate corpus resources for this major world language. The LCMC will satisfy this need, because we have made this corpus publicly available worldwide for academic research. In addition to monolingual studies of the Chinese language, LCMC, in combination with FLOB, is also a sound basis for contrastive studies of Chinese and English, whether one wishes to compare the two languages as a whole or compare them by text type.
2. Sampling frame and text collection
In the LCMC corpus, the FLOB sampling frame is followed strictly except for two minor variations. The first variation relates to the sampling frame – we replaced western and adventure fiction (category N) with martial arts fiction. There are three reasons for this decision. Firstly, there is simply no western fiction in China; secondly, martial arts fiction is broadly a type of adventure fiction and it is a very popular and important fiction type in China and hence should be represented; thirdly, the language used in martial arts fiction is a distinctive language type and hence once more one would wish to sample it. Most stories of this type, even though they were published recently, are under the influence of vernacular Chinese, i.e. modern Chinese styled to appear like classical Chinese. While the inclusion of this text type has made the tasks of POS tagging and post-editing more difficult, it may also make it possible to compare representations of vernacular Chinese and modern Chinese. The second variation was caused by problems matching the sampling period. Considering the availability of texts of some categories, we decided to modify the FLOB sampling period slightly by also including some samples in ±2 years of 1991 when there were not enough samples readily available for 1991. As a result, of the 500 samples included in the LCMC corpus, around two thirds were produced in 1991 while the other one third produced within two years of 1991. We assume that such a time span will not influence a language significantly.
The LCMC corpus has been constructed using written Mandarin Chinese texts published in Mainland China to ensure some degree of textual homogeneity. It should be noted that plain written texts alone have been transcribed, with tables, illustrations, pictures, formulae and special symbols omitted and replaced with a gap element marked by the wording ‘omission’. Long citations from translated texts or texts produced outside the sampling period were also omitted so that the effect of translationese could be excluded and L1 quality guaranteed.
While a small number of samples,
if they were conformant with our sampling frame, were collected from the
Internet, most samples were provided by the SSReader
Digital Library in China. As each page of electronic books in the library
comes in PDG format, these pages were transferred into text files using an OCR
program provided by the digital library. This scanning process resulted in a
1-3% error rate, depending on the quality of the picture files. Each electronic
text file was proofread and corrected independently by two native speakers of
Mandarin Chinese so as to keep the transcribed raw texts as accurate as
While the digital library has a
very large collection of books, it does not provide newspapers. The only sources
of newswire texts from the library are a dozen of collections of news awarded at
various levels. These collections, however, represent newswire texts from more
than eighty newspapers and television or broadcasting stations. The samples from
these sources account for around two thirds of texts for the press categories
(A-C). The other one third are sampled from newswire texts from Xinhua News
Agency (excerpted from the PH Corpus). Considering that this is the most
important and representative news provider in China, we believe that this
proportion is justified.
3. Encoding and markup conventions
Unlike single-byte western
languages like English, Chinese uses 2 bytes of ASCII codes for each character.
Currently there are three encoding systems for Chinese characters: GB2312 for
simplified Chinese, Big5 for traditional Chinese, and Unicode. While the
original texts were encoded in GB2312, we decided to convert the encoding into
Unicode (UTF-8) for the following reasons, namely, (1) to ensure the
compatibility of non-Chinese operating system and Chinese characters; and (2) to
take advantage of the latest Unicode-compatible concordancers like Xara version
1.0 and the WordSmith Tools version 4.0.
In order to make it more
convenient for users with an operating system earlier than Windows 2000 and
without a language support pack to use our data, we have produced a Pinyin
version of the LCMC corpus in addition to the standard version containing
characters. While also encoded using UTF-8, the Pinyin version will be more
compatible with older operating and concordance systems. This is also of
assistance to users who can read Romanised Chinese but not Chinese characters.