Finite Size

The term "corpus" also implies a body of text of finite size, for example, 1,000,000 words. This is not universally so - for example, at Birmingham University, John Sinclair's COBUILD team have been engaged in the construction and analysis of a monitor corpus. This "collection of texts" as Sinclair's team prefer to call them, is an open-ended entity - texts are constantly being added to it, so it gets bigger and bigger. Monitor corpora are of interest to lexicographers who can trawl a stream of new texts looking for the occurence of new words, or for changing meanings of old words. Their main advantages are: Their main disadvantage is:

With the exception of monitor corpora, it should be noted that it is more often the case that a corpus consists of a finite number of words. Usually this figure is determined at the beginning of a corpus-building project. For example, the Brown Corpus contains 1,000,000 running words of text. Unlike the monitor corpus, when a corpus reaches its grand total of words, collection stops and the corpus is not increased in size. (An exception is the London-Lund corpus, which was increased in the mid-1970s to cover a wider variety of genres.)


If you haven't already done so you can go on to read about other characteristics of the modern corpus:

Sampling and representativeness | Machine-readable form | Standard reference