Machine-readable form

Nowadays the term "corpus" nearly always implies the additional feature "machine-readable". This was not always the case as in the past the word "corpus" was only used in reference to printed text.

Today few corpora are available in book form - one which does exist in this way is "A Corpus of English Conversation" (Svartvik and Quirk 1980) which represents the "original" London-Lund corpus. Corpus data (not excluding context-free frequency lists) is occasionally available in other forms of media. For example, a complete key-word-in-context concordance of the LOB corpus is available on microfiche, and with spoken corpora copies of the actual recordings are sometimes available - this is the case with the Lancaster/IBM Spoken English Corpus but not with the London-Lund corpus.

Machine-readable corpora possess the following advantages over written or spoken formats:

If you haven't already done so you can now read about other characteristics of the modern corpus.

Sampling and representativeness | Finite size | Standard reference