Lancaster University Department of Linguistics and Modern English Language
Corpus Linguistics Home
Page index
WordSmith
Basic WordSmith
Using Concord
Frequency Lists and Keywords
Part-of-speech Tags
BNCweb
DIY Corpora
 
Current page
 
 
Page Two
 
 
Page Three
 
 
Page Four
 
 

Making Wordlists

 

A. Making a wordlist from a single corpus text

Our example file is a newsletter from Queen's Park Baptist Church in Glasgow. Before you create the wordlist, you might like to think about what words it's likely to contain...

Now follow these steps to make the wordlist:

  1. Start WordSmith and open "x:\wsmith\texts\religion\baptist\"
  2. Select the file "bapt-a.txt" and press [OK]
  3. Click on [Tools] -- [Wordlist]
  4. Click on on the Green button and then on [Make a wordlist now]. This will create three new windows.
  5. To save all 3 windows together, click on the [Save As] button and save as "cc1.lst" under your network directory (H: drive).

You should now have something that looks like this:

One window of a wordlist

There are three different wordlist windows. Look at each window one by one. To move between them click on the word Window at the top and choose 1, 2 or 3.

Wordlist (F): frequency list, where words are listed with the most frequent coming first, descending to the least frequent. (Scroll down and see.)

Wordlist (A): alphabetical list. The same list as above but with words in alphabetical order.

Wordlist (S): statistics file
Some important terms in the (S) window:

  • Tokens: the number of individual words in the text. In our case, it is 4,107 tokens.
  • Types: the number of types in a word frequency list is the number of unique word forms, rather than the total number of words in a text. Our text has 1,206 types.
  • Type/Token Ratio (TTR): the number of types divided by the number of tokens. This tells you how rich or "lexically varied" the vocabulary in the text is.
  • In our example, the Type-Token ratio is:

    1206 (types) ÷ 4107 (tokens) x 100 = 29.36 %

  • If a writer uses the same words (= word types) over and over again, the TTR is low, ie the text is not very lexically rich.

Look at the screenshot below. Each word in green is a type. The Freq. column gives the number of tokens.

Types and tokens
  • Standard Type/Token ratio:
    It is difficult to compare the TTR of smaller against larger texts, because as the text gets bigger, so the number of new word types being counted falls. In order to remedy this, WordSmith can calculate TTR based on every 1000 words and produce an average TTR. This figure (rather than simple TTR) would be the one to use in such cases.

B. Making a wordlist from a set of corpus texts

File bapt-a.txt is one of 9 similar Baptist texts. To get a bigger picture of the vocabulary of the newsletter, it will be useful to create one wordlist for all 9 texts.

Make a wordlist of the entire Baptist church corpus (file "bapt-a" through to "bapt-i") and save it as bapt.lst in the same location as before. We will use this file in the next exercise.