Corpus-based grammar in contrast

This project explores an application of novel corpus-based methods to a set of issues in grammatical analysis, in the context of a language, Nepali, for which corpus linguistics is in its infancy. It will also extend the analysis to a cross-linguistic comparison bringing in English and Russian.

Research questions

Previous work in the field of Nepali grammar has catalogued combinations of grammatical and lexical elements which can possibly occur. For example, Acharya (1991:78, 153, 157) lists 13 combinations of nouns and case-marking postpositions, and 360 different inflected forms of the Nepali lexical verb. Schmidt et al. (1993:xxi-xxvi) give similar catalogues of possibilities. However, to date little or no work on this topic, or Nepali grammar in general, has been based on the large-scale analysis of grammar in usage that corpus-based methods afford.

The grammatical categories of case (on nouns) and tense, aspect and mood (on verbs) are realised in Nepali as partially-bound elements which typically occur in close proximity to the nouns and verbs they relate to. Case, as well as the plural-collective marker, is indicated by post-nominal elements described variously as suffixes, clitics and postpositions. Tense, aspect and mood are largely marked by compounded auxiliary verbs, which however can also occur independently.

The semi-independence of these grammatical markers implies a degree of variety in their possible positions in the sentence structure. This raises the possibility of studying these markers, and the grammatical patterns in whose formation they participate, via quantitative analysis of their co-occurrence patterns in textual data. As outlined below, this may be accomplished by searching a corpus for statistically valid collocations. Collocation-based methods have been applied to the grammar of English, but not widely in a cross-linguistic context.

The questions to be addressed are in summary:

  • What is the behaviour of Nepali grammatical categories, seen from a corpus-based quantitative perspective using analysis of co-occurrence and collocation, in real, naturally produced text?
  • How does this add to (or amend) our knowledge of Nepali grammar based on earlier, non-corpus-based analyses?
  • What cross-linguistic correspondences exist between the patterns of co-occurrence behaviour of these elements, and those of equivalent elements in two other languages, English and Russian?

Research methods

The methodology that will be employed by the project is an novel empirical approach to grammatical categories and the quantitative patterns in which they are distributed in textual data.

At the core of the methodology are co-occurrence statistics derived from text corpora (primarily written corpora, due to patchy availability of spoken corpora of sufficient extent). These statistics will take two primary forms: raw co-occurrence counts of grammatical categories, and collocation lists.

Co-occurrence counts allow a quantitative profile to be developed for various complex morphosyntactic structures, for example, in Nepali, forms with more than one case-marking postposition, or verbs compounded with multiple auxiliary verb forms. This extends our knowledge of how these structures operate.

Collocation lists present another angle of analysis: they identify word forms (or, alternatively, lemmata or POS tags) which occur in proximity to the form under investigation with (statistically) significantly greater frequency than elsewhere in the corpus. Such lists permit an analysis of both morphosyntactic and semantic preferences of that form. It should be noted that for both types of quantitative data, a subsequent qualitative analysis is necessary to interpret the results and contextualise them in our knowledge of Nepali grammar. That is, lists of high-frequency, statistically significant collocates are examined by the analyst for patterns of morphological, morphosyntactic or semantic similarities.

These methods will be applied to corpora of three languages Nepali, English, and Russian by the CORGRAM research team, to carry out a contrastive quantitative-distributional analysis of grammatical categories associated with nouns and verbs.


The CORGRAM project is still in its initial phases. Watch this space!

CORGRAM is funded by the Arts and Humanities Research Council.

Arts and Humanities Research Council


