The Babel English-Chinese Parallel Corpus

The Babel English-Chinese Parallel Corpus, which was created on our research project Contrasting English and Chinese (ESRC Award Reference RES-000-23-0553), consists of 327 English articles and their translations in Mandarin Chinese. Of these 115 texts (121,493 English tokens plus 135,493 Chinese tokens) were collected from the World of English between October 2000 and February 2001 while the remaining 212 texts (132,140 English tokens plus 151,969 Chinese tokens) were collected from Time from September 2000 to January 2001. The corpus contains a total of 544,095 words (253,633  English words and 287,462 Chinese tokens). Here is a list of the titles of the articles included in the corpus.

The corpus is tagged for part of speech and aligned at the sentence level. The English texts were tagged using the CLAWS C7 tagset while Chinese texts were tagged using the Peking University tagset. Sentence alignment was done automatically and corrected by hand. The corpus is also marked for paragraph and sentence. But different markup systems were adopted for the two subcorpora. For the component of the World of English, sentences were marked consecutively throughout whereas for Time, sentences were marked within each paragraph.

We give no warranties that the Babel parallel corpus will be suitable for any particular purpose and  accept no responsibility for any technical limitations of the corpus or software.


Created and maintained by Richard Xiao 2005-2013