Testing for Significance:
It is always possible that the differences you have found between fiction and non-fiction
are just a random, chance happening. The statistical term for this is "significance". If results are
significant, we are reasonably certain (usually 95% certain, sometimes 99% certain) that these results are not due
to chance. Very often, you will need to test whether your results are significant.
So far we have obtained frequencies for various searches, and made the results comparable
using frequencies by per cent, or by per million words. This is called normalising the frequencies. But
normalised scores aren't proof that what you have is significant.
Two common tests of significance are chi-square and log likelihood. We will use log
likelihood in this session.
The only information needed to do the log likelihood test is: -
- frequency in corpus 1
- frequency in corpus 2
- total number of words in corpus 1
- total number of words in corpus 2.
The mathematics behind log likelihood is quite complicated, but fortunately you don't have to
do it yourself! A Web-based log-likelihood wizard is available, provided by Paul Rayson (Computing Department,
University of Lancaster).
Click here to launch the calculator.
Enter the numbers in the boxes ... and click Calculate LL.
Make a note of the results in the table.
Interpreting log likelihood
If the log likelihood for your result is greater than 6.63, the probability of the result -
i.e. the difference between the two corpora - happening by chance is less than 1%. So we can be 99% certain that the
result actually means something. This is usually expressed as p < 0.01.
If the log likelihood is 3.84 or more, the probability of it happening by chance is less than
5%. So we are 95% certain of the result. This is expressed as p < 0.05.
- What do the log likelihood results tell you about start and begin?
Does this fit with what you expected?