Lancaster University Department of Linguistics and Modern English Language
Corpus Linguistics Home
Page index
Basic WordSmith
Using Concord
Frequency Lists and Keywords
Part-of-speech Tags
DIY Corpora
Page One
Current page
Page Three
Page Four

Keyword Analysis:
Comparing two lists


What is a keyword?
Keywords are those whose frequency is unusually high in comparison with some norm.

How can we identify keywords?
The keywords are worked out by first making a wordlist for your corpus, and a wordlist for a "reference" corpus, then comparing the frequency of each word in the two lists. A reference corpus is any corpus chosen as a standard of comparison with your corpus. The reference corpus usually has to be quite large and of a suitable type for keywords to work.

If the word occurs say, 5% of the time in the small wordlist and 6% of the time in the reference corpus, it will not turn out to be "key", but if the scores are 25% and 6% the first would be very "key".

How keywords work

In our keywords analysis, the "your corpus" will be the Baptist newsletters; the reference corpus will be FLOB, because it is general British English taken from the same period as the newsletters, and it's quite large.

  • Which words would you expect to occur in the Baptist church corpus but not in FLOB?

Follow these steps to get the Baptist keywords:

  1. Make a wordlist for FLOB, just as you did for bapt.lst. Save the new one as flob.lst.
  2. Extract keywords from bapt.lst:
    1. On WordSmith Tools Controller, choose [Tools] -- [Keywords]
    2. Press the start button (the green one) and click on [Find the keywords in a text]
    3. On [Choose Wordlists] window, choose "bapt.lst" on the left window and "flob.lst" on the right (as in the picture), and press [OK].
    4. Save in H: drive as "baptflob.kws"
The [Choose Wordlists] window

You should get a window that looks something like this:

Key vocabulary in the Baptist files
  • after the word column, the next two columns are its frequency in the Baptist corpus, and its percentage of all words in the Baptist corpus. The next two columns are the frequency in the FLOB and percentage frequency in FLOB.
  • "Keyness" and "P" together tell you how distinctive the word is: when "keyness" is very high, and "P" (the probability of the keyness being accidental) is very low, the word can fairly safely be called a keyword.
  • Ignore these "words": BQUO, EQUO, MDASH. They are in fact punctuation (open quotation, close quotation, long dash).
  • "We" appears to be the most key item. Can you think of any possible reason? You can check the examples of "we" by starting Concord... (you might need to choose the Baptists texts again)
  • Scroll down the list. If there's any word you want to look at in detail, use the Concord tool.
  • The words at the bottom of the list in red are unusually infrequent or rare in bapt.lst.
    Can you see any similarity between these words, and suggest why they are rarer in the Baptist newsletters?