Sampling and Representativeness
Often in linguistics we are not merely interested in an individual text or author, but a whole variety of language. In such cases we have two options for data collection:
- We could analyse every single utterance in that variety - however, this option is impracticable except in a few cases, for example with a dead language which only has a few texts. Usually, however, analysing every utterance would be an unending and impossible task.
- We could construct a smaller sample of that variety. This is a more realistic option.
As discussed in lecture 1, one of Chomsky's criticisms of the corpus approach was that language is infinite - therefore, any corpus would be skewed. In other words, some utterances would be excluded because they are rare, others which are much more common might be excluded by chance, and alternatively, extremely rare utterances might also be included several times. Although nowadays modern computer technology allows us to collect much larger corpora than those that Chomsky was thinking about, his criticisms still must be taken seriously. This does not mean that we should abandon corpus linguistics, but instead try to establish ways in which which a much less biased and representative corpus may be constructed.
We are therefore interested in creating a corpus which is maximally representative of the variety under examination, that is, which provides us with an as accurate a picture as possible of the tendencies of that variety, as well as their proportions. What we are looking for is a broad range of authors and genres which, when taken together, may be considered to "average out" and provide a reasonably accurate picture of the entire language population in which we are interested.
If you haven't already done so you can go on to read about other characteristics of the modern corpus:
Finite size | Machine-readable form |
Standard reference