Corpus Linguistics & NLP

Researchers in the fields of Corpus Linguistics and Natural Language Processing (NLP) have developed an array of methods for studying both the linguistic form and the content of large collections of texts–or corpora–ranging from the very small (tens of thousands of words) to the very large (hundreds of millions or billions of words).

In their most basic form, corpus analyses provide frequency counts of items encountered in a text. Performing these counts enables the researcher not only to search for and spot key words and phrases, but also to examine their concordances (i.e. the words that occur around them). Other analytic techniques, such as collocation analysis, enable the researcher to identify and extract terms within a corpus that are associated (or, in other words, that collocate) with any other particular word. This allows one to examine how words are used in context.

Other commonly used techniques include:

  • Part-of-speech annotation–grammatical labelling of the words in a corpus;
  • Semantic tagging–automatic grouping of words into categories based on meaning;
  • Named-entity recognition–the process of automatically locating, classifying and annotating named elements, such as people, organisations or places, in running texts.

All of these methods are fundamentally quantitative, since the outputs they generate are based on the statistical processing of corpus data. Quantitative and qualitative approaches are, however, radically intertwined in corpus linguistics because quantitative results are interpreted in a qualitative fashion by the analyst and qualitative statements are always formulated in light of the available quantitative data.

The application of methodologies from corpus-based and NLP has led to dramatic advances in fields such as lexicography, descriptive grammar, language teaching and literary stylistics. But to date, relatively little work has sought to add a spatial dimension to corpus analysis–despite the clear coherence of the corpus-based approach with the ideas underlying the field of Geographical Information Systems (GIS). In this project, we are working to bridge that gap.

In particular, we see three techniques regularly employed in corpus linguistics and NLP as key to a successful integration of corpus data into GIS analysis:

First, named-entity recognition allows all occurrences of place-names in a corpus to be identified. The resulting data, when geo-referenced, provides the basis of a GIS – allowing the underlying geography of the corpus to be visualised.

Second, collocation analysis allows us to undertake large-scale examinations of what words and topics are being discussed in relation to different place-names in a corpus.

Third, semantic tagging allows us to perform collocation analysis at a higher level of generality. Instead of just looking at the words that collocate with a place-name, we can specify a topic-category such as, say, ‘war’ or ‘disease’ and identify all places discussed in relation to any word tagged as relating to that topic.

cropped-bg1_021.jpg© Spatial Humanities: Texts, GIS & Places

Leave a Reply