Lancaster UniversityGraduate School
Faculty of Arts and Social Sciences

You are here: Home >

UCREL CRS: Uncovering Complex Facts in Natural-language Documents

Date: 6 June 2013 Time: 2.00-3:00 pm

Venue: FASS Meeting Room 3

UCREL Corpus Research Seminar

Uncovering Complex Facts in Natural-language Documents

Colleen E. Crangle

This talk is about text data mining, specifically about automated methods of discovering complex facts in documents. Applications include biomedical research where articles appear in the scientific literature faster than the scientific community can compile the findings in them. Articles on cellular processes alone, for example, have grown over the past ten years from approximately 9,000 to more than 20,000 a year. Other applications lie in the analysis of large amounts of unstructured text data, such as email exchanges or collections of news stories, to find evidence of a topic or idea not previously noticed. What is needed is not simply a tool to find snippets of data in a document, but a tool that can unearth complex facts synthesized from several places in and across documents.

The starting point for this work is the idea that natural-language documents can be treated as time-series data - that is, modeled as sets of stochastic processes - in order to uncover important semantic information in them. Time-series data consist of sequences of data points for which there is a natural one-way ordering; well-known examples are acoustic data, economic data, hydrology data, and time-varying biological data from sensors such as electrocardiograms and electroencephalograms. Two defining characteristics of time-series data apply directly to natural-language data. First, data points close together in time are more closely related than those further apart; in language, words close together constitute the syntactic structures that give form and meaning to language, and meaningful passages are made up of contiguous sentences and paragraphs. Second, in time-series data the values of data points for a given period depend in some way on the values of earlier data points. The intrinsic temporal order of natural language means that complex facts are built up from the words and sentences, paragraphs and sections as they follow one another in text.

In text data mining, a document is typically represented as a "bag of words." All occurrences of a word in a document are counted and the collection of frequency counts for the words of interest represents the document. With this approach all temporal structure in the occurrences of the words is lost. Document representations are needed that better capture the patterns of distribution of significant concepts throughout the document. This talk presents new such methods of document representation and shows how they lead to the identification of complex facts in document collections. Examples are drawn from biomedical articles and news stories.

Colleen E Crangle is currently the Fulbright-Lancaster STEM Science and Technology Scholar at Lancaster University, Lancaster, United Kingdom (January to July 2013). Her home institution is CONVERSPEECH, a small business in Palo Alto, California that provides computational analyses of text, spoken language, and neurolinguistic data. She is an affiliated scholar with the Center for the Study of Language and Information at Stanford University. She has a PhD in Logic, Philosophy of Language and Science from Stanford, and MSc and BSc degrees in computer science and mathematics.

Event website:


Who can attend: Anyone


Further information

Organising departments and research centres: Computing and Communications, Linguistics and English Language, University Centre for Computer Corpus Research on Language (UCREL)


| Home | Who's who? | Research Training | News and Events | Resources |

Graduate School, Faculty of Arts and Social Sciences, Lancaster University, Lancaster LA1 4YD, UK
Tel: +44 (0) 1524 510880 E-mail:
Copyright & Disclaimer | Privacy and Cookies Notice