2. Corpus annotation

The ZCTC corpus is annotated using ICTCLAS2008, the latest release of the Chinese Lexical Analysis System developed by the Institute of Computing Technology, the Chinese Academy of Sciences. This annotation tool, which relies on a large lexicon and the Hierarchical Hidden Markov Model, integrates word tokenisation, named entity identification, unknown word recognition, as well as part-of-speech tagging. The ICTCLAS2008 has been reported to achieve a precision rate of 98.54% for word tokenisation. Latest open tests have also given encouraging results, with a precision rate of 98.13% for tokenisation and 94.63% for part-of-speech tagging. The application programming interface (API) of ICTCLAS2008 is publicly available at www.ictclas.org while a compiled program is available at www.corpus4u.org.

In order to ensure maximum comparability, a new release of the LCMC corpus (version 2.0) has been produced, which is retagged using this same tool. The part-of-speech tagset applied on the ZCTC and the new release of LCMC is described as follows.

a         adjective

ad       adverbial use of adjective

ag       adjectival morpheme

an        nominal use of adjective

al         adjectival formulaic expression

b          modifier (non-predicate noun modifier)

bg        noun modifier morpheme

bl         noun modifying formulaic expression

c          conjunction

cc        coordinating conjunction

d         adverb

dg       adverbial morpheme

dl        adverbial formulaic expression

e         interjection

ew      sentence-final punctuation (full stop, semi-colon, question mark, exclamation mark)

f         space word

h        prefix

k        suffix

m       numeral and quantifier

mg     numeral and quantifier morpheme

mq     numeral-classifier

n        noun

ng      nominal morpheme

nl       nominal formulaic expression

nr       person name

nr1     Chinese surname

nr2     Chinese first name

nrf      transliterated foreign person name

nrj      Japanese name

ns      place name

nsf     transliterated foreign place name

nt       organisation name

nz      other proper noun

o        onomatopoeia

p        preposition

pba    preposition ba

pbei   preposition bei

q        classifier

qt       temporal classifier

qv       verbal classifier

r         pronoun

rg       pronominal morpheme

rr       personal pronoun

ry       interrogative pronoun

rys     place interrogative pronoun

ryt      temporal interrogative pronoun

ryv      verbal interrogative pronoun

rz       deictic pronoun

rzs      place pronoun

rzt       temporal pronoun

rzv      verbal pronoun

s         place word

t         time word

tg       time word morpheme

u        auxiliary

ude1   的

ude2   地

ude3   得

udeng 等

udh      的话

uguo  

ule     

ulian   

uls      来说、来讲、而言、说来

usuo  

uyy     一样、一般、似的、般

uzhe  

uzhi   

v        verb

vd      adverbial use of verb

vf       directional verb

vg      verbal morpheme

vi       intransitive verb

vl       verbal formulaic expression

vn      nominal use of verb

vshi   是

vx      pro-verb

vyou  有

w       symbols and punctuations

wb     percentage and permillle signs: % and ‰ of full length; % of half length

wd     full or half-length comma: ,,

wj      full stop of full length: 。

wky   closing brackets: ) 〕  ] } 》  】 〗 〉of full length;  ) ] } > of half length

wkz   opening brackets: ( 〔  [  {  《 【  〖 〈 of full length; ( [ { < of half length

wn     full-length enumeration mark: 、

wp    dash: ——  --  —— -  of full length; ---  ---- of half length

ws    full-length ellipsis: ……  …

wt     full or half-length exclamation mark: !of full length; ! of half length

wyy  full-length single or double closing quote: ” ’ 』

wyz  full-length single or double opening quote: “ ‘ 『

x      non-word character string

y      particle

z      descriptive word