Comparing two lists
What is a keyword?
Keywords are those whose frequency is unusually high in comparison with some norm.
How can we identify keywords?
The keywords are worked out by first making a wordlist for your corpus, and a wordlist for a "reference"
corpus, then comparing the frequency of each word in the two lists. A reference corpus is any corpus chosen as a standard
of comparison with your corpus. The reference corpus usually has to be quite large and of a suitable type for keywords to
If the word occurs say, 5% of the time in the small wordlist and 6% of the time in the
reference corpus, it will not turn out to be "key", but if the scores are 25% and 6% the first would be very
In our keywords analysis, the "your corpus" will be the Baptist newsletters; the
reference corpus will be FLOB, because it is general British English taken from the same period as the newsletters,
and it's quite large.
- Which words would you expect to occur in the Baptist church corpus but not in FLOB?
Follow these steps to get the Baptist keywords:
- Make a wordlist for FLOB, just as you did for bapt.lst. Save the new one as flob.lst.
- Extract keywords from bapt.lst:
- On WordSmith Tools Controller, choose [Tools] -- [Keywords]
- Press the start button (the green one) and click on [Find the keywords in a text]
- On [Choose Wordlists] window, choose "bapt.lst" on the left window and "flob.lst" on the
right (as in the picture), and press [OK].
- Save in H: drive as "baptflob.kws"
You should get a window that looks something like this:
- after the word column, the next two columns are its frequency in the Baptist corpus, and its percentage of all words
in the Baptist corpus. The next two columns are the frequency in the FLOB and percentage frequency in FLOB.
- "Keyness" and "P" together tell you how distinctive the word is: when "keyness" is
very high, and "P" (the probability of the keyness being accidental) is very low, the word can fairly
safely be called a keyword.
- Ignore these "words": BQUO, EQUO, MDASH. They are in fact punctuation (open quotation, close quotation,
- "We" appears to be the most key item. Can you think of any possible reason? You can check the examples of
"we" by starting Concord... (you might need to choose the Baptists texts again)
- Scroll down the list. If there's any word you want to look at in detail, use the Concord tool.
- The words at the bottom of the list in red are unusually infrequent or rare in bapt.lst.
Can you see any similarity between these words, and suggest why they are rarer in the Baptist newsletters?