Lancaster University Department of Linguistics and Modern English Language
Corpus Linguistics Home
Page index
Intro to BNCweb
More on BNCweb
Refining BNCweb Queries
DIY Corpora
Page One
Page Two
Page Three
Page Four

Testing for Significance:
Log Likelihood


It is always possible that the differences you have found between fiction and non-fiction are just a random, chance happening. The statistical term for this is "significance". If results are significant, we are reasonably certain (usually 95% certain, sometimes 99% certain) that these results are not due to chance. Very often, you will need to test whether your results are significant.

So far we have obtained frequencies for various searches, and made the results comparable using frequencies by per cent, or by per million words. This is called normalising the frequencies. But normalised scores aren't proof that what you have is significant.

Two common tests of significance are chi-square and log likelihood. We will use log likelihood in this session.

The only information needed to do the log likelihood test is: -

  • frequency in corpus 1
  • frequency in corpus 2
  • total number of words in corpus 1
  • total number of words in corpus 2.

The mathematics behind log likelihood is quite complicated, but fortunately you don't have to do it yourself! A Web-based log-likelihood wizard is available, provided by Paul Rayson (Computing Department, University of Lancaster). Click here to launch the calculator.

Enter the numbers in the boxes ... and click Calculate LL.

Make a note of the results in the table.

Interpreting log likelihood

If the log likelihood for your result is greater than 6.63, the probability of the result - i.e. the difference between the two corpora - happening by chance is less than 1%. So we can be 99% certain that the result actually means something. This is usually expressed as p < 0.01.

If the log likelihood is 3.84 or more, the probability of it happening by chance is less than 5%. So we are 95% certain of the result. This is expressed as p < 0.05.

  • What do the log likelihood results tell you about start and begin? Does this fit with what you expected?