{"id":40,"date":"2013-12-18T14:32:55","date_gmt":"2013-12-18T14:32:55","guid":{"rendered":"http:\/\/www.lancaster.ac.uk\/fass\/projects\/spatialhum.wordpress\/?page_id=40"},"modified":"2014-01-21T12:25:29","modified_gmt":"2014-01-21T12:25:29","slug":"corpus-linguistics-collocation","status":"publish","type":"page","link":"https:\/\/www.lancaster.ac.uk\/fass\/projects\/spatialhum.wordpress\/?page_id=40","title":{"rendered":"Corpus Linguistics &#038; NLP"},"content":{"rendered":"<p><strong>Researchers in the fields of Corpus Linguistics and Natural Language Processing (NLP)<\/strong> <strong>have developed an array of methods for\u00a0studying both the linguistic form and the content of large collections of texts&#8211;or corpora&#8211;ranging from the very small (tens of thousands of words) to the very large (hundreds of millions or billions of words).<\/strong><\/p>\n<p>In their most basic form, corpus analyses provide frequency counts of items encountered in\u00a0a text. Performing these counts enables the researcher not only to search for and spot key words and phrases, but also to examine their <i><strong>concordances <\/strong><\/i>(i.e.\u00a0the words that occur around them). Other analytic techniques, such as\u00a0<strong>c<i>ollocation analysis<\/i><\/strong>,\u00a0enable the researcher to\u00a0identify and extract terms within a corpus that are associated (or, in other words, that <strong><em>collocate<\/em><\/strong>) with any other particular word. This allows\u00a0one to examine how words are used in context.<\/p>\n<p>Other commonly used techniques include:<\/p>\n<ul>\n<li><strong>Part-of-speech annotation<\/strong>&#8211;grammatical labelling of the words in a corpus;<\/li>\n<li><strong>Semantic tagging<\/strong>&#8211;automatic grouping of words into categories based on meaning;<\/li>\n<li><strong>Named-entity recognition<\/strong>&#8211;the process of automatically\u00a0locating, classifying and annotating named elements, such as\u00a0people, organisations or places, in running texts.<\/li>\n<\/ul>\n<p>All of these methods are fundamentally quantitative, since the outputs they generate are based on the statistical processing of corpus data. Quantitative and qualitative approaches are, however, radically intertwined in corpus linguistics because quantitative results are interpreted in a qualitative fashion by the analyst and qualitative statements are always formulated in light of the available quantitative data.<\/p>\n<p>The application of methodologies from corpus-based and\u00a0NLP has led to dramatic advances in fields such as lexicography, descriptive grammar, language teaching and literary stylistics. But to date, relatively little work has sought to add a spatial dimension to corpus analysis&#8211;despite the clear coherence of the corpus-based approach with the ideas underlying the field of Geographical Information Systems (GIS). In this project, we are working to bridge that gap.<\/p>\n<p>In particular, we see three techniques regularly employed in corpus linguistics and NLP as key to a successful integration of corpus data into GIS analysis:<\/p>\n<p><strong>First<\/strong>, named-entity recognition allows all occurrences of place-names in a corpus to be identified.\u00a0The resulting data, when geo-referenced, provides the basis of a GIS &#8211; allowing the\u00a0underlying geography\u00a0of the corpus to be visualised.<\/p>\n<p><strong>Second<\/strong>, collocation analysis allows us to undertake large-scale examinations of what words and topics are being discussed in relation to different\u00a0place-names in a corpus.<\/p>\n<p><strong>Third<\/strong>, semantic tagging allows us to\u00a0perform collocation analysis at a higher level of generality. Instead of just looking at the words that collocate with a place-name, we can specify a topic-category such as, say,\u00a0&#8216;war&#8217; or &#8216;disease&#8217; and identify all places discussed in relation to any word tagged as relating to that topic.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-112\" alt=\"cropped-bg1_021.jpg\" src=\"http:\/\/www.lancaster.ac.uk\/fass\/projects\/spatialhum.wordpress\/wp-content\/uploads\/2014\/01\/cropped-bg1_021.jpg\" width=\"1600\" height=\"230\" srcset=\"https:\/\/www.lancaster.ac.uk\/fass\/projects\/spatialhum.wordpress\/wp-content\/uploads\/2014\/01\/cropped-bg1_021.jpg 1600w, https:\/\/www.lancaster.ac.uk\/fass\/projects\/spatialhum.wordpress\/wp-content\/uploads\/2014\/01\/cropped-bg1_021-300x43.jpg 300w, https:\/\/www.lancaster.ac.uk\/fass\/projects\/spatialhum.wordpress\/wp-content\/uploads\/2014\/01\/cropped-bg1_021-1024x147.jpg 1024w\" sizes=\"auto, (max-width: 1600px) 100vw, 1600px\" \/>\u00a9 Spatial Humanities: Texts, GIS &amp; Places<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Researchers in the fields of Corpus Linguistics and Natural Language Processing (NLP) have developed an array of methods for\u00a0studying both the linguistic form and the content of large collections of texts&#8211;or corpora&#8211;ranging from the very small (tens of thousands of words) to the very large (hundreds of millions or billions of words). In their most &hellip; <a href=\"https:\/\/www.lancaster.ac.uk\/fass\/projects\/spatialhum.wordpress\/?page_id=40\" class=\"more-link\">Continue reading <span class=\"screen-reader-text\">Corpus Linguistics &#038; NLP<\/span> <span class=\"meta-nav\">&rarr;<\/span><\/a><\/p>\n","protected":false},"author":5,"featured_media":0,"parent":0,"menu_order":0,"comment_status":"open","ping_status":"open","template":"","meta":{"footnotes":""},"class_list":["post-40","page","type-page","status-publish","hentry"],"_links":{"self":[{"href":"https:\/\/www.lancaster.ac.uk\/fass\/projects\/spatialhum.wordpress\/index.php?rest_route=\/wp\/v2\/pages\/40","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.lancaster.ac.uk\/fass\/projects\/spatialhum.wordpress\/index.php?rest_route=\/wp\/v2\/pages"}],"about":[{"href":"https:\/\/www.lancaster.ac.uk\/fass\/projects\/spatialhum.wordpress\/index.php?rest_route=\/wp\/v2\/types\/page"}],"author":[{"embeddable":true,"href":"https:\/\/www.lancaster.ac.uk\/fass\/projects\/spatialhum.wordpress\/index.php?rest_route=\/wp\/v2\/users\/5"}],"replies":[{"embeddable":true,"href":"https:\/\/www.lancaster.ac.uk\/fass\/projects\/spatialhum.wordpress\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=40"}],"version-history":[{"count":23,"href":"https:\/\/www.lancaster.ac.uk\/fass\/projects\/spatialhum.wordpress\/index.php?rest_route=\/wp\/v2\/pages\/40\/revisions"}],"predecessor-version":[{"id":627,"href":"https:\/\/www.lancaster.ac.uk\/fass\/projects\/spatialhum.wordpress\/index.php?rest_route=\/wp\/v2\/pages\/40\/revisions\/627"}],"wp:attachment":[{"href":"https:\/\/www.lancaster.ac.uk\/fass\/projects\/spatialhum.wordpress\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=40"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}