Corpora and Sociolinguistics

Although sociolinguistics is an empircal field of research it has hitherto relied primarily upon the collection of research-specific data which is often not intended for quantitative study and is thus not often rigorously sampled. Sometimes the data are also elicited rather than naturalistic data. A corpus can provide what these kinds of data cannot provide - a representative sample of naturalistic data which can be quantified. Although corpora have not as yet been used to a great extent in sociolinguistics, there is evidence that this is a growing field.

The majority of studies in this area have concerned themselves with lexical studies in the area of language and gender. Kjellmer (1986), for example, used the Brown and LOB corpora to examine the masculine bias in American and British English. He looked at the occurrence of masculine and feminine pronouns, and at the occurrence of the items man/men and woman/women. As one would expect, the frequencies of the female items were much lower than the male items in both corpora. Interestingly, however, the female items were more common in British English than in American English. Another hypothesis of Kjellmer's was not supported in the corpora - that woman would be less "active", that is would be more frequently the objects rather than the subjects of verbs. In fact men and women had similar subject/object ratios.

Holmes (1994) makes two important points about the methodology of these kinds of study, which are worth bearing in mind. First, when classifying and counting occurrences the context of the lexical item should be considered. For instance, whilst there is a non-gender marked alternative for policeman/policewoman, namely police officer, there is no such alternative for the -ess form in Duchess of York. The latter form should therefore be excluded from counts of "sexist" suffixes when looking at gender bias in writing. Second, Holmes points out the difficulty of classifying a form when it is actively undergoing semantic change. She argues that the word man can refer both to a single male (such as in the phrase A 35 year old man was killed, or can have a generic meaning which refers to mankind (such as Man has engaged in warfare for centuries. In phrases such as we need the right man for the job it is difficult to decide whether man is gender specific or could be replaced by person. These simple points should incite a more critical approach to data classification in further sociolinguistic work using corpora, both within and without the area of gender studies.