Lancaster University Department of Linguistics and Modern English Language
Corpus Linguistics Home
Page index
WordSmith
BNCweb
DIY Corpora
Building DIY Corpora
Headers in DIY Corpora
 
Page One
 
 
Page Two
 
 
Page Three
 
 
Current page
 
 

Using CLAWS over the Web

 

In order to answer some research questions, a plain corpus may not be enough and you may need to annotate your data with POS tags. This will enable you to refine your searches through your corpus. For example, you may be interested in how the demonstratives this and that are distributed in your data. For this, you will need to distinguish between that used as a subordinating conjunction (e.g. I think that...) and that used as a demonstrative (e.g. That problem was...). The part of speech tags will help you to do this.

Previously, we looked at CLAWS, the Part of Speech tagger developed at Lancaster University. You can get limited access to the tagger on the Internet. You can submit only one file at a time and the file cannot exceed 10 thousand words.

Before you use CLAWS you need to prepare your file. Let's assume your corpus consists of recent news articles in the major British newspapers. You can get access to news reports pubilshed in the most popular British newspapers through the Newsbank stored in the Library's website. In order to retrieve a file follow the steps below:

1 Go to http://libweb.lancs.ac.uk

2 Click on Databases & e-journals and then on to Newsbank and Start Search

3 Type any phrase that you want your article to contain and choose the newspapers you are interested in.

4. Submit your query.

5. Choose an article for the list of results (preferably a longer one, containing at least a few paragraphs) and save it in your directory on the hard disk as a text file. Click here if you don't remember how to do this.

Now you have a file to run through CLAWS. Follow the steps below.

1. Go to http://lingo.lancs.ac.uk and choose the option CLAWS & TT. You will get a screen like the one below:

2. Enter your email address, select your options (preferably C7 horizontal format) and browse the disk for your file.

3. Submit the query and wait while CLAWS is tagging the data (this can take a couple of minutes if the file is long).

4. Click on the link a plain a text file (the upper one) and save the result as a text file (use the ".tag") extension so you will not get confused later.

Now you can search your file for the demonstratives this and that. Follow the steps below.

1. Open the Concord tool in WordSmith. 2. Choose the tagged file.
3. Search for this and note its frequency.
4. Do the same for that
5. Now, refine your search and type that_DD1 in the search box. DD1 is the tag for singular demonstratives. Note down the frequency

Which demonstrative is used more frequently?

Additional exercises

  • Look for cases in the text of an adjective followed by a noun. (You can use the * key to specify any word so *_JJ means the same as any adjective).
  • Does to occur more often as an infinitive marker to_TO or a preposition to_II?
  • What's the most frequent singular proper noun (tagged as NP1) in the corpus?
  • How accurate is CLAWS? Edit a file to include some nonsense sentences, ungrammatical sentences or sentences with spelling mistakes or nonsense words and see how CLAWS handles it. Are grammatical tagging decisions always clear-cut? If not, how should ambiguous cases be resolved?
  • CLAWS isn't the only annotation scheme which is available for corpora. Think about other types of linguistic annotation that are possible.