Introduction to part-of-speech annotation and the BNC Sampler
  
 
Part-of-speech (POS) tags are generally codes of a few letters and numbers, in which the 
   first letter has a basic part-of-speech meaning: 
   - N...   typically indicates a noun
 
   - V...   typically indicates a verb
 
   - J...    typically indicates an adjective
 
 
The next letters often add further meaning: 
   - NP...  often means a proper noun
 
   - NN...  often means an ordinary (common) noun
 
   - VB...  often means part of the verb BE
 
   - VH...  often means part of the verb HAVE
 
   - VV...  often means part of a lexical verb (e.g. play, run)
 
 
And at the end of a tag: 
   
      | ...1   often means singular noun | 
      ...2   often means plural noun | 
    
   
      | ...0   often means base form verb | 
      ...I   sometimes means infinitive of verb | 
    
   
      | ...Z   often means 3rd person singular verb | 
      ...D   often means past tense verb | 
    
   
      | ...G   often means present participle of verb | 
      ...N   often means past participle of verb | 
     
So you might like to try guessing the meaning of the following tags which are found in
   today's corpus: 
NN2 VVZ VBZ VVI VHN NP2 JJR 
N.B. A full list of tags is called a "tagset". 
   Check the tagset here to see if you guessed correctly. 
The BNC Sampler Corpus
The corpus we will be using today is a 2-million word corpus taken from the British National 
   Corpus (BNC). The smaller corpus is known as the BNC Sampler, and it is publicly available on CD-Rom. Its key features 
   are: 
   - it contains written and spoken data in almost equal portions
 
   - it has been POS-tagged (by the CLAWS program) and the POS tags have been hand-corrected, so theoretically 
       there should be no mistakes
 
 
The tagset used for the BNC Sampler (and LOB, FLOB etc.)  is known locally as the CLAWS "C7" 
   tagset. The following links to Lancaster sites provide more material on this tagset: 
   A manual for the POS-tagging of the Sampler corpus, describing in detail how each POS-tag is used
 
   A full list of all tags in the C7 tagset.
 
 
The table below outlines the structure of the BNC Sampler: 
   
      | Broad text category | 
      WordSmith folder (under BNCsamp) | 
      Text category and description | 
      Number of words | 
      Closest equivalent in Brown,Frown,LOB,FLOB | 
    
   
      | Written | 
      inform | 
      "informative" writing | 
      781,801 | 
      Sections A-J | 
    
   
      | imag | 
      "imaginative" writing | 
      231,173 | 
      Sections K-R | 
    
   
      | Spoken | 
      demog | 
      informal conversation which has been demographically sampled across 
          the population of the UK | 
      498,404 | 
      none | 
    
   
      | cg | 
      speech recorded at specific locations for specific events, such as 
          business meetings, public talks ("context-governed" | 
      499,998 | 
      none | 
    
 
  
      
                |