1. Corpus design

The ZJU Corpus of Translational Chinese (ZCTC) is created with the explicit aim of studying the features of translated Chinese in relation to non-translated native Chinese. It has modelled the Lancaster Corpus of Mandarin Chinese, a one-million-word balanced corpus which was designed to represent written Mandarin Chinese.

Both LCMC and ZCTC corpora have sampled five hundred 2,000-word text chunks from fifteen written text categories published in China, with each amounting to one million words. The text categories covered in the two corpora, together with their respective proportions, are given below:

Genre label Genre Number of samples Proportion
A Press: Reportage 44 8.8%
B Press: Editorial 27 5.4%
C Press: Review 17 3.4%
D Religious writing 17 3.4%
E Skill / trade / hobby 38 7.6%
F Popular lore 44 8.8%
G Biography and essay 77 15.4%
H Miscellaneous (report and official document) 30 6.0%
J Science (academic prose) 80 16.0%
K General fiction 29 5.8%
L Mystery and detective fiction 24 4.8%
M Science fiction 6 1.2%
N Adventure fiction 29 5.8%
P Romantic fiction 29 5.8%
R Humour 9 1.8%
Total 500 100.0%

Since the LCMC corpus was designed as a Chinese match for the FLOB / Frown corpora  of British / American English, with the specific aim of comparing and contrasting English and Chinese, it has also followed the sampling period of FLOB / Frown and sampled written Mandarin Chinese within three years around 1991. While it was relatively easy to find texts of native Chinese published in this sampling period, it would be much more difficult to get access to translated Chinese texts of some categories - especially in electronic format - published in this time frame. This pragmatic consideration of data collection has forced us to modify the LCMC model slightly by extending the sampling period by a decade, i.e. to 2001, when we built the ZJU Corpus of Translational Chinese. This extension has been particularly useful because the popularisation of the Internet and online publication in the 1990s have made it possible and easier to access a large amount of digitalised texts. Readers are reminded of this modification when they interpret the results based on a comparison of the LCMC and ZCTC corpora. Those who are interested in potential change during this decade in Mandarin Chinese are advised to use the UCLA Written Chinese Corpus, which models LCMC but samples texts one decade apart.

While English is the source language of the vast majority of the text samples included the ZCTC corpus, we have also included a small number of texts translated from other languages to mirror the reality of the world of translations in China.

As Chinese is written as running strings of characters without white spaces delimiting words, it is only possible to know the number of tokens in a text when the text has been tokenised (see corpus annotation). As such, the text chunks were collected at the initial stage by using our best estimate (1:1.67) between the number of characters and number of words based on our previous experience. Only textual data was included, with graphs and tables in the original texts replaced by placeholders. A text chunk included in the corpus can be a sample from a large text (article and book chapter etc) or an assembly of several small texts (e.g. for the press categories). When parts of large texts are selected, an attempt has been made to achieve a balance between initial, medial and ending samples. When the texts are tokenised, a computer program was used to cut large texts to approximately 2,000 tokens while keeping the final sentence complete. As a result, while some text samples may be slightly longer than others, they are typically around 2,000 words. The table below compares the actual numbers of tokens in different genres as well as their corresponding percentages in the ZCTC and LCMC corpora.*

Genre label ZCTC Percentage LCMC Percentage
A 88186 8.63 89201 8.74
B 54171 5.30 54432 5.33
C 34100 3.34 34354 3.36
D 35139 3.44 35199 3.45
E 76681 7.51 77484 7.59
F 89675 8.78 89823 8.80
G 155601 15.23 156433 15.32
H 60352 5.91 60983 5.97
J 168736 16.52 162856 15.95
K 60540 6.93 60183 5.89
L 48924 4.79 49244 4.82
M 12267 1.20 12367 1.21
N 59042 5.78 60197 5.90
P 59033 5.78 59665 5.84
R 19072 1.87 18643 1.83
Total 1021449 100.00 1021064 100.00

*Note: The number of tokens given here for the Lancaster Corpus of Mandarin Chinese (LCMC) may be different from earlier releases, because this edition of LCMC has been retagged using ICTCLAS2008, which was used to tag the ZCTC corpus.