Finite Size
The term "corpus" also implies a body of text of finite size, for example, 1,000,000 words. This is not universally so - for example, at Birmingham University, John Sinclair's COBUILD team have been engaged in the construction and analysis of a monitor corpus. This "collection of texts" as Sinclair's team prefer to call them, is an open-ended entity - texts are constantly being added to it, so it gets bigger and bigger. Monitor corpora are of interest to lexicographers who can trawl a stream of new texts looking for the occurence of new words, or for changing meanings of old words. Their main advantages are:
- They are not static - new texts can always be added, unlike the synchronic "snapshot" provided by finite corpora.
- Their scope - they provide for a large and broad sample of language.
Their main disadvantage is:
- They are not such a reliable source of quantitative data (as opposed to qualitative data) because they are constantly changing in size and are less rigourously sampled than finite corpora.
With the exception of monitor corpora, it should be noted that it is more often the case that a corpus consists of a finite number of words. Usually this figure is determined at the beginning of a corpus-building project. For example, the Brown Corpus contains 1,000,000 running words of text. Unlike the monitor corpus, when a corpus reaches its grand total of words, collection stops and the corpus is not increased in size. (An exception is the London-Lund corpus, which was increased in the mid-1970s to cover a wider variety of genres.)
If you haven't already done so you can go on to read about other characteristics of the modern corpus:
Sampling and representativeness | Machine-readable form | Standard reference