The PDC2000 Corpus

The PDC2000 Corpus of Chinese News Text

The PDC2000 Corpus of Chinese News Text is built using one year's (year 2000) data provided by the People's Daily Press, Beijing. The corpus contains approximately 15 million tokens. PDC2000 is encoded in Unicode (UTF-8) and marked up in XML. There are 366 files in the corpus, one for a day, which is marked up for the month and the date.

Each corpus file consists of a corpus header and the corpus text proper. The corpus header applies the ELDA (Evaluations and Language Resources Distribution Agency) Metadata Scheme version 1.40. The corpus text is marked up for paragraphs, sentences and tokens. Sentences are numbered consecutively within each file while tokens are annotated for part-of-speech, using the Peking University tagset.

PDC2000 was created on the project Contrasting English and Chinese supported by the UK Economic and Social Research Council (Award Reference RES-000-23-0553). We thank Dr. Jiajin Xu for providing us with the text data and the People's Daily Press for permitting us to use the data. The corpus is distributed free of charge for use in non-profit-making research. For licencing information, please refer to the LCMC licence.

The corpus can be accessed using our web-based concordancer [Sorry the online service is no longer available].

Created and maintained by Richard Xiao 2005-2008