Lancaster University Department of Linguistics and Modern English Language
Corpus Linguistics Home
Page index
WordSmith
Basic WordSmith
Using Concord
Frequency Lists and Keywords
Part-of-speech Tags
BNCweb
DIY Corpora
 
Current page
 
 
Page Two
 
 
Page Three
 
 
Page Four
 
 

Introduction to part-of-speech annotation and the BNC Sampler

 

Part-of-speech (POS) tags are generally codes of a few letters and numbers, in which the first letter has a basic part-of-speech meaning:

  • N... typically indicates a noun
  • V... typically indicates a verb
  • J... typically indicates an adjective

The next letters often add further meaning:

  • NP... often means a proper noun
  • NN... often means an ordinary (common) noun
  • VB... often means part of the verb BE
  • VH... often means part of the verb HAVE
  • VV... often means part of a lexical verb (e.g. play, run)

And at the end of a tag:

...1 often means singular noun ...2 often means plural noun
...0 often means base form verb ...I sometimes means infinitive of verb
...Z often means 3rd person singular verb ...D often means past tense verb
...G often means present participle of verb ...N often means past participle of verb

So you might like to try guessing the meaning of the following tags which are found in today's corpus:

NN2
VVZ
VBZ
VVI
VHN
NP2
JJR

N.B. A full list of tags is called a "tagset". Check the tagset here to see if you guessed correctly.

The BNC Sampler Corpus

The corpus we will be using today is a 2-million word corpus taken from the British National Corpus (BNC). The smaller corpus is known as the BNC Sampler, and it is publicly available on CD-Rom. Its key features are:

  • it contains written and spoken data in almost equal portions
  • it has been POS-tagged (by the CLAWS program) and the POS tags have been hand-corrected, so theoretically there should be no mistakes

The tagset used for the BNC Sampler (and LOB, FLOB etc.) is known locally as the CLAWS "C7" tagset. The following links to Lancaster sites provide more material on this tagset:

A manual for the POS-tagging of the Sampler corpus, describing in detail how each POS-tag is used

A full list of all tags in the C7 tagset.


The table below outlines the structure of the BNC Sampler:

Broad text category WordSmith folder (under BNCsamp) Text category and description Number of words Closest equivalent in Brown,Frown,LOB,FLOB
Written inform "informative" writing 781,801 Sections A-J
imag "imaginative" writing 231,173 Sections K-R
Spoken demog informal conversation which has been demographically sampled across the population of the UK 498,404 none
cg speech recorded at specific locations for specific events, such as business meetings, public talks ("context-governed" 499,998 none