Corpus-based insights into spoken L2 English: Introducing eight projects that use the Trinity Lancaster Corpus

In November 2016, we announced the Early Data Grant Scheme in which researchers could apply for access to the Trinity Lancaster Corpus (TLC) before its official release in 2018.  The Early Data subset of the corpus contains 2.83 million words from 1,244 L2 speakers.

The Trinity Lancaster Corpus project is a product of an ongoing collaboration between The ESRC Centre for Corpus Approaches to Social Science (CASS), Lancaster University, and Trinity College London, a major international examination board. The Trinity Lancaster Corpus contains several features (rich metadata, a range of proficiency levels, L1s and age groups) that make it an important resource for studying L2 English. Soon after we started working on the corpus development in 2013, we realised the great potential of the dataset for researchers in language learning and language testing. We were very excited to receive a number of outstanding applications from around the world (Belgium, China, Germany, Italy, Spain, UK and US).  The selected projects cover a wide range of topics focusing on different aspects of learner language use. In the rest of this blog post we introduce the successful projects and their authors.

  1. Listener response in examiner-EFL examinee interactions

Erik Castello and Sara Gesuato, University of Padua

The term listener response is used to denote (non-)verbal behaviour produced in reaction to an interlocutor’s talk and sharing a non-turn status, e.g. short verbalisations, sentence completion, requests for clarifications, restatements, shakes, frowns (Xudong 2009). Listener response is a form of confluence-oriented behaviour (McCarthy 2006) which contributes to the construction and smooth handling of conversation (Krauss et al. 1982). Response practices can vary within the same language/culture in terms of placement and function in the turn sequence and the roles played by the same listener response types (Schiffrin 1987; Gardner 2007). They can also vary across cultures/groups (Cutrone 2005; Tottie 1991) and between the sexes (Makri-Tsilipakou 1994; Rühlemann 2010). Therefore, interlocutors from different linguistic/cultural backgrounds may experience communication breakdown, social friction and the emergence of negative attitudes (Wieland 1991; Li 2006), including participants in examiner-EFL examinee interactions (Götz 2013) and in EFL peer-to-peer interactions (Castello 2013). This paper explores the listener response behaviour of EFL examinees in the Trinity Lancaster Corpus (Gablasova et al. 2015), which may display interference from the examinees’ L1s and affect the examiners’ impression of their fluency. It aims to: identify forms of verbal listener responses in examinee turns and classify them in terms of conventions of form (mainly following Clancy et al. 1996) and conventions of function (mainly following Maynard 1997); identify strategies for co-constructing turn-taking, if any (Clancy/McCarthy 2015); and determine the frequencies of occurrence of the above phenomena across types of interaction, examinees’ perceived proficiency levels and between the sexes.

Erik Castello is Assistant Professor of English Language and Translation at the University of Padua, Italy. His research interests include (learner) corpus linguistics, discourse analysis, language testing, academic English and SFL. He has co-edited two volumes and published two books and several articles on these topics.

Sara Gesuato is Associate Professor of English language at the University of Padua, Italy. Her research interests include pragmatics, genre analysis, verbal aspect, and corpus linguistics. She has co-edited two volumes on pragmatic issues in language teaching, and is currently investigating sociopragmatic aspects of L2 written speech acts.

  1. Formulaic expressions in learner speech: New insights from the Trinity Lancaster Corpus

Francesca Coccetta, Ca’ Foscari University of Venice

This study investigates the use of formulaic expressions in the dialogic component of the Trinity Lancaster Corpus. Formulaic expressions are multi-word units serving pragmatic or discourse structuring functions (e.g. discourse markers, indirect forms performing speech acts, and hedges), and their mastery is essential for language learners to sound more native-like. The study explores the extent to which the Trinity exam candidates use formulaic expressions at the various proficiency levels (B1, B2 and C1/C2), and the differences in their use between successful and less successful candidates. In addition, it investigates how the exam candidates compare with native speakers in the use of formulaic expressions. To do this, recurrent multi-word units consisting of two to five words will be automatically extracted from the corpus using Sketch Engine; then, the data will be manually filtered to eliminate unintentional repetitions, phrase and clause fragments (e.g. in the, it and, of the), and the multi-word units that do not perform any pragmatic or discourse function. The high-frequency formulaic expressions of each proficiency level will be provided and compared with each other and with the ones identified in previous studies on native speech. The results will offer new insights into learners’ use of prefabricated expressions in spoken language, particularly in an exam setting.

Francesca Coccetta is a tenured Assistant Professor at Ca’ Foscari University of Venice. She holds a doctorate in English Linguistics from Padua University where she specialised in multimodal corpus linguistics. Her research interests include multimodal discourse analysis, learner corpus research, and the use of e-learning in language learning and teaching. 

  1. The development of high-frequency verbs in spoken EFL and ESL

Gaëtanelle Gilquin, Université catholique de Louvain

This project aims to contribute to the recent effort to bridge the paradigm gap between second language acquisition research and corpus linguistics. While most such studies have relied on written corpus data to compare English as a Foreign Language (EFL) and English as a Second Language (ESL), the present study will take advantage of a new resource, the Trinity Lancaster Corpus, to compare speech in an EFL variety (Chinese English) and in an ESL variety (Indian English). The focus will be on high-frequency verbs and how their use develops across proficiency levels in the two varieties, as indicated by the CEFR scores provided in the corpus. Various aspects of language will be considered, taking high-frequency verbs as a starting point, among which grammatical complexity (e.g. through the use of infinitival constructions of the causative type), idiomaticity (e.g. through the degree of typicality of object nouns) and fluency (e.g. through the presence of filled pauses in the immediate environment). The assumption is that, given the different acquisitional contexts of EFL and ESL, one and the same score in EFL and ESL may correspond to different linguistic realities, and that similar developments in scores (e.g. from B1 to B2) may correspond to different developments in language usage. More particularly, it is hypothesised that EFL speakers will progress more rapidly in aspects that can benefit from instruction (e.g. those involving grammatical rules), whereas ESL speakers will progress more rapidly in aspects that can benefit from exposure to naturalistic language (like phraseology).

Gaëtanelle Gilquin is a Lecturer in English Language and Linguistics at the University of Louvain. She is the coordinator of LINDSEI and one of the editors of The Cambridge Handbook of Learner Corpus Research. Her research interests include spoken learner English, the link between EFL and ESL, and applied construction grammar.

  1. Describing fluency across proficiency levels: From ‘can-do- statements’ towards learner-corpus-informed descriptions of proficiency

Sandra Götz, Justus Liebig University Giessen

 While it has been noted that current assessment scales (e.g. the Common European Framework of Reference; CEF; Council of Europe 2009) describing learners’ proficiency levels in ‘can-do-statements’ are often formulated somewhat vaguely (e.g. North 2014), researchers and CEF-developers have pointed out the benefits of including more specific linguistic descriptors emerging from learner corpus analyses (e.g. McCarthy 2013; Park 2014). In this project, I will test how/if descriptions of fluency in learner language such as the CEF can benefit from analyzing learner data at different proficiency levels in the Trinity Lancaster Corpus. More specifically I will test if the learners’ proficiency levels can serve as robust predictors in their use core fluency variables, such as filled and unfilled pauses (e.g. er, erm, eh, ehm), discourse markers (e.g. you know, like, well), or small words (e.g. sort of, kind of). Also, I will test if learners show similar or different paths in their developmental stages of fluency from the B1 to the C2 level, regardless of (or dependent on) their L1. Through the meta-information available on the learners in the Trinity Lancaster Corpus, sociolinguistic and learning context variables (such as the learners’ age, gender or the task type) will also be taken into consideration in developing data-driven descriptor scales on fluency at different proficiency levels. Thus, it will be possible to differentiate between L1-specific and universal learner features in fluency development.

Sandra Götz obtained her PhD from Justus Liebig University Giessen and Macquarie University Sydney in 2011. Since then, she has been working as a Senior Lecturer in English Linguistics at University of Giessen. Her main research interests include (learner) corpus linguistics and its application to language teaching and testing, applied linguistics and World Englishes.

  1. Self-repetition in the spoken English of L2 English learners: The effects of task type and proficiency levels

Lalita Murty, York University

Self-repetition (SR) where the speaker repeats a word/phrase is a much-observed phenomenon in spoken discourse. SR serves a range of distinct communicative and interactive functions in interactions such as expressing agreement or disagreement or adding emphasis to what the speaker wants to say as the following example shows ‘Yes, I know I know and I certainly think that limits are…’ (to express agreement with the previous speaker) (Gablasova, et al, 2015). Self-repetitions also help in creating coherence (Bublitz, 1989 as cited in Fung, 2007: 224), enhancing the clarity of the message (Kaur, 2012), keeping the floor, maintaining smooth flow of conversation, linking speaker’s ideas to previous speaker’s ideas (Tannen, 1989), and initiating self and other repairs (Bjorkman, 2011; Robinson and Kevoe-Feldman, 2010). This paper will use Sketch Engine to extract instances of single content word self-repetitions in the Trinity Lancaster Corpus data to examine the effect of (i) L2 proficiency levels and (ii) task types on the frequency and functions of different types of self-repetitions made by speakers at varying proficiency levels in the different tasks. A quantitative and qualitative analysis of the data thus extracted will be conducted using a mix of Norrick’s (1987) framework along with CA approaches.

Lalita Murty is a Lecturer at the Norwegian Study Centre, University of York.  Her previous research focused on spoken word recognition and call centre language. Currently she is working on Reduplication and Iconicity in Telugu, a South Indian language.

  1. Certainty adverbs in learner language: The role of tasks and proficiency

Pascual Pérez-Paredes, University of Cambridge and María Belén Díez-Bedmar, University of Jaén

When comparing native and non-native use of stance adverbs, the effect of task has been largely ignored. An exception is Gablasova et al.’s (2015). The authors researched the effect of different speaking tasks on L2 speakers’ use of epistemic stance markers and concluded that there was a significant difference between the monologic prepared tasks and every other task and between the dialogic general topic and the dialogic pre-selected topic (p < .05). This study suggests that the type of speaking task conditions speakers’ repertoire of markers, including certainty markers. Pérez-Paredes & Bueno (forthcoming) looked at how certainty stance adverbs were employed during the picture description task in the LINDSEI and the extended LOCNEC (Aguado et al., 2012). In particular, the authors discussed the contexts of use of obviously, really and actually by native and NNSs across the same speaking task in the four datasets when expressing the range of meanings associated with certainty. The authors found that different groups of speakers used these adverbs differently, both quantitatively and qualitatively. Our research seeks to expand the findings in Gablasova et al.’s (2015) and Pérez-Paredes & Bueno (forthcoming) and examine the uses of certainty adverbs across the L1s, proficiency and tasks represented in the Trinity Lancaster Corpus. We believe that the use of this corpus, together with the findings from the LINDSEI, will help us reach a better understanding of the uses of certainty adverbs in spoken learner language.

Pascual Pérez-Paredes is a Lecturer in Research in Second Language Education at the Faculty of Education, University of Cambridge. His main research interests are learner language variation, the use of corpora in language education and corpus-assisted discourse analysis.

María Belén Díez-Bedmar is Associate Professor at the University of Jaén (Spain). Her main research interests include Learner Corpus Research, error-tagging, the learning of English as a Foreign Language, language testing and assessment, the CEFR and CMC.  She is currently involved in national and international corpus-based projects.

  1. Emerging verb constructions in spoken learner English

Ute Römer and James Garner, Georgia State University

Recent research in first language (L1) and second language (L2) acquisition has demonstrated that we learn language by learning constructions, defined as conventionalized form-meaning pairings. While studies in L2 English acquisition have begun to examine construction development in learner production data, these studies have been based on rather small corpora. Using a larger set of data from the Trinity Lancaster Corpus (TLC), this study investigates how verb-argument constructions (VACs; e.g. ‘V about n’) emerge in the spoken English of L2 learners at different proficiency levels. We will systematically and exhaustively extract a small set of VACs (’V about n’, ‘V for n’, ‘V in n’, ‘V like n’, and ‘V with n’) from the L1 Italian and L1 Spanish subsets of the TLC, separately for three CEFR proficiency levels. For each VAC and L1-proficiency combination (e.g. Italian-B1), we will create frequency-sorted verb lists, allowing us to determine how learners’ verb-construction knowledge develops with increasing proficiency. We will also examine in what ways VAC emergence in the TLC data is influenced by VAC usage as captured in a large native-speaker reference corpus (the BNC). We will use chi-square tests to compare VAC type and token frequencies across L1 subsets and proficiency levels. We will use path analysis (a type of structural equation modeling) including the predictor variables L1 status, proficiency level, and BNC usage information to gain insights into how learner characteristics and variables concerning L1 construction usage affect the emergence of the target VACs in spoken L2 learner English.

Ute Römer is currently Assistant professor in the Department of Applied Linguistics and ESL at Georgia State University. Her research interests include corpus linguistics, phraseology, second language acquisition, discourse analysis, and the application of corpora in language teaching. She serves on a range of editorial and advisory boards of professional journals and organizations, and is general editor of the Studies in Corpus Linguistics book series.

James Garner is currently a PhD student in the Department of Applied Linguistics and ESL at Georgia State University. His current research interests include learner corpus research, phraseology, usage-based second language acquisition, and data-driven learning.

  1. Verb-argument constructions in Chinese EFL learners’ spoken English production

Jiajin Xu and Yang Liu, Beijing Foreign Studies University

The widespread recognition of usage-based approach to constructions has made Corpus Linguistics a most viable methodology to scrutinise such frequent morpho-syntactic patterns as verb-argument constructions (VACs) in learner language. The present study attempts to examine the use of VACs in Chinese EFL learners’ spoken English. Our focus will be on the semantics of the verbal constructions in light of collostructional statistics (Stefanowitsch & Gries, 2003) as well as the comparisons across learners’ proficiency levels and task types. 20 VACs were collected from COBUILD Grammar Patterns 1: Verbs (Francis, Hunston & Manning, 1996). On the basis of the retrieved VAC concordances from the Trinity Lancaster Corpus, the semantic prototypicality of the VACs will be analysed according to the collocational strength of verbs with their host constructions. Comparisons of Chinese EFL learners against the native speakers will be made, and also within different task types. It is hoped that our findings would shed light on Chinese EFL learners’ knowledge of VACs and the crosslinguistic influence that impacts verb semantics of learners’ spoken English. Meanwhile, we also consider language proficiency and task type as potential factors that may account for the differences across CEFR groups based on the comparisons within Chinese EFL learners.

Jiajin Xu is Professor of Linguistics at the National Research Centre for Foreign Language Education, Beijing Foreign Studies University as well as secretary general and a founding member of the Corpus Linguistics Society of China. His research interests include discourse studies, second language acquisition, contrastive linguistics and translation studies, and corpus linguistics.

Yang Liu is currently a PhD candidate at Beijing Foreign Studies University. His research focus is on the corpus-based study of construction acquisition of Chinese EFL learners.

 

 

 

 

 

 

 

Introducing a new project with the British Library

Since 2012 the BBC have been working with the British Library to build a collection of intimate conversations from across the UK in the BBC Listening Project. Through its network of local radio stations, and with the help of a travelling recording booth the BBC has captured many conversations of people, who are well known to one another, on a range of topics in high quality audio.

For the past two years we have been discussing with the BBC and the British Library the possibility of using these recordings as the basis of a large scale extension of our spoken BNC corpus. The Spoken BNC2014 has been built so far to reflect language in intimate settings – with recordings made in the home. This has led to a large and very useful collection of data but, without the resources of an organization such as the BBC, we were not able to roam the country with a sound recording booth to sample language from John o’Groats to Land’s End! By teaming up with the BBC and British Library we can supplement this very useful corpus of data, which is strongly focused on a ‘hard to capture’ context, intimate conversations in the home, with another type of data, intimate conversations in a public situation sampled from across the UK.

Another way in which the Listening data should prove helpful to linguists is that the data itself was captured in a recording studio as high quality audio recordings. Our hope is that a corpus based on this material will be of direct interest and use to phoneticians.

We have recently concluded our discussion with the British Library, which is archiving this material, and signed an agreement which will see CASS undertake orthographic transcription of the data. Our goal is to provide a high quality transcription of the data which will be of use to linguists and members of the public, who may wish to browse the collection, alike. In doing this we will be building on our experience of producing the Trinity Lancaster Corpus of Spoken Learner English and the Spoken BNC2014.

We take our first delivery of recordings at the beginning of March and are very excited at the prospect of lifting the veil a little further on the fascinating topic of everyday conversation and language use. The plan is to transcribe up to 1000 of the recordings archived at the British Library. We will be working to time align the transcriptions with the sound recordings also and are working closely with our strong phonetics team in the Department of Linguistics and English Language at Lancaster University to begin to assess the extent to which this new dataset could facilitate new work, for example, on the accents of the British Isles.

Our partners in the British Library are just as excited as we are – Jonnie Robinson, lead Curator for Spoken English at the British Library says ‘The British Library is delighted to enable Lancaster to make such innovative use of the Listening Project conversations and we look forward to working with them to make the collection more accessible and to enhance its potential to support linguistic and other research enquiries’.

Keep an eye on the CASS website and Twitter feed over the next couple of years for further updates on this new project!

Analysing Corporate Communications

Detecting the structure of annual financial reports and extracting their contents for further corpus analysis has never been easier. The UCREL Corporate Financial Information Environment (CFIE) project and CASS’ Corporate Communications sub-project has now released the CFIE-FRSE Final Report Structure Extractor: A desktop application to detect the structure of UK Annual Reports and extract the reports’ contents on a section level. This extraction step is vital for the analysis of UK reports which adopt a much more flexible structure than the US equivalent 10-Ks. The CFIE-FRSE tool works as a desktop version of our CFIE-FRSE Web tool https://cfie.lancaster.ac.uk:8443/.

The tool provides batch extraction and analysis of PDF annual report content for English. Crucially, our approach preserves the structure of the underlying report (as represented by the document table of contents) and therefore offers clear delineation between the narrative and financial statement components, as well as facilitating analysis of the narrative component on a schedule-by-schedule basis.

The tool was trained using more than 10,000 UK annual reports and the extraction accuracy exceeds 95% against manual validations and large-sample tests confirm that extracted content varies predictably with economic and regulatory factors.

Accessing the tool:

The tool is available for direct download from GitHub link below:

https://drelhaj.github.io/CFIE-FRSE/

GitHub Repository:

https://github.com/drelhaj/CFIE-FRSE

The CFIE-FSRE tool:

  • Detects the structure of UK Annual Reports by detecting the key section, their start and end pages and extracts the contents.
  • Extracts the text of each section in a plain text file format.
  • Splits the text of each section into sentences using Stanford Sentence Splitter.
  • Provides a Section Classification mechanism to detect the type of the extracted section.
  • Each extracted section will be annotated with a number between 0 and 8 as follows:
Header Type Header
1 Chairman’s Statement
2 CEO Review
3 Corporate Government Report
4 Directors Remuneration Report
5 Business Review
6 Financial Review
7 Operating Review
8 Highlights
0 A section that is none of the above
  • The tool uses Levenshtein distance and other similarity metrics and synonyms for section classification. For example Chairman’s letter and letter to shareholders can still be detected as Type 1 section (Chairman’s Statement).
  • The analysis results of the uploaded files or reports can be found in a subdirectory that follows the pattern of “FileName_Analysis”
    • For example, if you are uploading a file called XYZCompany.pdf, the results will be in subdirectory called XYZCompany_Analysis
    • Analysis outputs are saved in Comma Separated Value (CSV) file format which can be opened using any spreadsheet editor.
    • The tool provides more fields in the Sections_Frequencies.csv file which can be found in the Analysis subdirectory. The new fields are:
    • Start and End pages of each section.
    • Provides the readability of the extracted sections in addition to the whole report using Fog and Flesch readability metrics.
    • Provides keywords frequencies using a preloaded set of keywords for Forward Looking, Positivity, Negativity and Uncertainty.
    • Report Year, this will only work if the year was part of the file name. E.g. “XYZCompany_2015.pdf”
    • Performance Flag: Shows 1 if a section is a performance section, 0 otherwise.
    • Strategy Flag: Shows 1 if a section is a strategic section, 0 otherwise.
    • Booklet Flag: Shows 1, 2 or 3 if a header is a booklet layout, 0 otherwise. Our tool is unable to process booklet annual reports (those reports where two pages are combined into one pdf page). Numbers 1-3 indicates how confident the system is. 1 suspects a booklet layout, 3 definitely a booklet layout
    • The keyword lists (Forward Looking, Uncertainty, Positivity and Negativity) have been updated to eliminate duplicates and encoding errors.

How to run the software:

  • [MS Windows]: To run the tool, simply clone the repository to your machine, place your pdf annual reports in the pdfs directory and run (double click) the runnable.bat file.
  • [Linux/Ubuntu]: To run the tool, simply clone the repository to your machine, place your pdf annual reports in the pdfs directory and run the runnable.sh. Simply cd to the directory where the runnable.sh is located and type the following command ./runnable.sh
  • [Unix/Mac]: To run the tool, simply clone the repository to your machine, place your pdf annual reports in the pdfs directory and run the runnable.sh. Simply cd to the directory where the runnable.sh is located and type the following command sh runnable.sh or bash runnable.sh
  • The analysis output directory (a directory for each PDF file) will be found in the PDF directory.
  • Please do not delete any of the files or directories or change their structure.
  • You can add or delete PDF files from the PDF directory and you can also edit the userKeywords.txt to include your own keyword list, simply empty the file and insert one keyword (or keyphrase) on each line.

Related Papers:

  • El-Haj, Mahmoud, Rayson, Paul, Young, Steven, and Walker, Martin. “Detecting Document Structure in a Very Large Corpus of UK Financial Reports”. In The 9th edition of the Language Resources and Evaluation Conference, 26-31 May 2014, Reykjavik, Iceland.
    Available at: http://ucrel.lancs.ac.uk/cfie/El-HajEtAl_lrec14.pdf

License:

The tool is available under the GNU General Public License.

More about the CFIE research:

For more information about the projects’ output, web-tools, resources and contact information, please visit our page below:

http://ucrel.lancs.ac.uk/cfie/

 

Recent Research into CEO Compensation

On Wednesday 18th January, the CFA Society United Kingdom (CFA UK) hosted a breakfast meeting at Innholders’ Court (London, EC4R 2RH) to discuss findings of a recently completed CFA UK-funded research project examining CEO compensation across the FTSE-350 from 2003 to 2015. CFA UK represents the interests of around 12,000 investment professionals in the UK and the report received widespread press coverage over the Christmas period including coverage from the BBC, The Times, The Guardian, and Financial Times.

The report (co-authored with Dr Weijia Li, Lancaster University Management School and available to download at: https://www.cfauk.org/media-centre/cfa-uk-executive-remuneration-report-2016) contributes to the executive remuneration debate by providing independent statistical evidence highlighting a limited association between economic value creation and executive pay.

Among other findings, the research suggests that despite relentless pressure from regulators and governance reformers over the last two decades to ensure closer alignment between executive pay and performance, the association between CEO pay and fundamental value creation in the UK remains weak at best.

At the heart of the problem is the disconnect between the performance measures that are widely employed in executive remuneration contracts such as earnings per share (EPS) growth and total shareholder return (TSR), and the extent to which these metrics provide reliable information on periodic value creation. Economic theory clearly demonstrates that EPS growth and TSR provide poor proxies for value creation; and this insight is confirmed in the data, with correlations below 30% documented for these measures and more sophisticated value-based performance metrics such as residual income and economic profit that include an explicit charge for invested capital.

The work also reveals that mandatory pay-related annual report disclosures designed to enhance the transparency of executive remuneration arrangements have become increasingly complicated and hard to read (measured by the Fog index), to the extent that even relatively sophisticated consumers of firms’ published reports struggle to identify basic information such as total compensation paid to the CEO during the reporting period.

Attendees at the event comprised representatives from a range of City institutions including CFA UK, The Investment Association, SVM Asset Management, RPMI Railpen, Schroders, PIRC, Aberdeen Asset Management, JP Morgan Asset Management, Kepler Cheuvreux, Legal & General Investment Management, Fidelity International, Willis Towers Watson, Pensions and Lifetime Savings Association.

Will Goodhart (Chief Executive, CFA UK) welcomed attendees and Natalie Winterfrost (Aberdeen Asset Management) provided context for the research. After a brief summary of the research purpose, methodology and main findings, plus follow-up comments from steering committee members Prof Brian Main (Edinburgh University), James Cooke (SVM Asset Management), and Alasdair Wood (Willis Towers Watson), attendees engaged in a lively discussion concerning the report’s conclusions and their implications for executive compensation policy in the UK. The discussions will help CFA UK to formulate its engagement strategy with companies and institutional investors to improve the degree of alignment between pay and value generation.

Registration now open for Lancaster Summer Schools in Corpus Linguistics and other Digital Methods!

Registration now open for Lancaster Summer Schools in Corpus Linguistics and other Digital Methods!

We are pleased to announce that we will be running our hugely popular summer schools again in 2017! We will be running six free training events that cover the techniques of corpus linguistics, computational analysis of language and geographical information systems. The schools include both lectures and practical sessions that introduce the latest developments in the field and practical applications of cutting-edge analytical techniques. The summer schools are taught by leading experts in the field both from CASS and other departments and institutions (CASS Challenge Panel).

The summer schools running in 2017 are:

  • Corpus linguistics for Language studies
  • Corpus linguistics for Social Science
  • Corpus linguistics for the Humanities
  • Statistics for Corpus linguistics
  • Geographical information systems for the Digital Humanities
  • Corpus-based NLP

The summer schools will take place over 4 days (27th – 30th June 2017) and are free to attend. Click here for more information and to register.

Introducing the CASS Guided Reading Project (Part 2)

In the first blog entry, we noted that there is a substantial variability in our understanding of how guided reading is thought to foster specific literacy skills. Further, we explained how our use of corpus methods will enable us to identify a wide range of teacher strategies used in guided reading. That is a crucial first step in providing a more refined understanding of the range of teacher strategies that are used. A natural ‘next step’ is to determine which strategies are most effective: that is, which strategies result in positive outcomes. Again, corpus search tools provide an efficient and accurate method for achieving this.

Which language and literacy skills are targeted by guided reading?

Guided reading has the potential to develop a range of essential reading skills. These skills are numerous, so teachers may choose to target different skills according to the group’s reading ability. For example, 4-year-olds are only just beginning to develop their ability to read words on a page, so teachers are more likely to focus on ways of improving their accurate translation of print into word meanings (i.e., decoding skills and vocabulary). Conversely, older children are able to read words relatively well, so teachers are more likely to target improving an understanding of the language that has been accessed from the printed word (i.e., reading comprehension skills).

How can we measure potential outcomes from guided reading?

Compared to normal ‘control’ reading sessions, children who undertake a series of guided reading sessions typically display greater improvements in standardised assessments of reading skills (see Burkins & Croft, 2009; Ford, 2015).

However, such longitudinal assessments do not provide a measure of the effect that specific teacher strategies have on the quality of the responses by the child. Some recent studies of shared reading (which apply similar scaffolding strategies to guided reading, but involve the sharing of an enlarged book rather than providing a copy to each child) have attempted to investigate this by parsing through the children’s responses and coding for features of interest. For example, Justice and colleagues (2013) used this method to report that children responded to rich teacher input by providing more multi-clause utterances themselves (e.g., coordinated clauses: He read the book and watched TV; subordinated clauses: He read the book because he enjoys reading). However, as noted in the first blog, this means of coding is arduous and time consuming. Instead, we can use corpus search methods to uncover a wider range of language features more reliably and speedily. That enables us to analyse a larger number of child-teacher interactions and to study these interactions across a range of contexts and in relation to a number of different factors such as (i) age, (ii) reading ability, (iii) socio-economic status, (iv) gender, (v) reading motivation, and (vi) teacher experience. These will be discussed separately in a future blog.

Other research into shared reading has used some simple corpus search methods to measure the quality of response. Those studies measured whether the average length of an utterance is one word or multiword (e.g., Zucker and colleagues, 2010). However, such a measure is limited in richness of information, and only applies to very young children (up to around 5 years of age). Our work at CASS will draw on the work from shared reading and extend that knowledge base by providing more advanced corpus search queries that enable a fine-grained analysis of the quality of children’s responses. For example, we can analyse the quality of responses in terms of grammatical features, vocabulary diversity, and syntactic structure, rather than just on length of utterance.

In an upcoming blog, we will provide a closer insight into the specific corpus search measurements that we are using to identify teacher strategies (as introduced in the first blog), and their effectiveness on the quality of responses by children (as introduced in the current blog).

An update on data collection

The CASS guided reading project aims to create a large corpus made up of a total of 100 guided reading sessions that each last between 15-35 minutes. So far, we have recorded around 80% of our target number of sessions, and the corpus is projected to reach between 400,000 to 500,000 words!

All recordings have been at primary schools in the UK with children aged between 4 and 10 (Y1 to Y6). Recordings are made in a naturalistic manner such that they are non-invasive to the normal proceedings of a lesson. A voice recorder is set up, as well as a video camera so that we can identify individual speakers if the audio is unclear.

A big shout out to all the wonderful schools and teachers who have helped us so far: Ryelands CE, Mereside, Baines Endowed CE, Dolphinholme CE, Ellel St John’s CE, Halton St Wilfrid’s CE, Pilling St John’s CE, and Kirkland and Catterall Saint Helen’s CE. These schools have been so welcoming and their contribution to the research is invaluable! Also, a big thanks to our ‘Queen of transcription’, Ruth Avon, who has worked tirelessly to keep the transcribing of the recordings well on track for a complete analysis in early 2017.

References

Burkins, J. & Croft, M. M. (2010). Preventing misguided reading: new strategies for guided reading teachers. Thousand Oaks CA: Corwin.

Ford, M. P. (2015). Guided Reading: What’s New, and What’s Next? North Mankato, MN: Capstone.

Justice, L. M., McGinty, A.S., Zucker, T., Cabell, S.Q., & Piasta, S.B. (2013). Bi-directional dynamics underlie the complexity of talk in teacher–child play-based conversations in classrooms serving at-risk pupils. Early Childhood Research Quarterly, 28, 496– 508.

Zucker, T.A., Justice, L.M., Piasta, S. B., Kaderavek, J. N. (2010). Preschool teachers’ literal and inferential questions and children’s responses during whole-class shared reading. Early Childhood Research Quarterly, 25, 65–83.

Introducing the CASS Guided Reading Project (Part 1)

In collaboration with the Department of Psychology, CASS is investigating the critical features of guided reading that can benefit the language and literacy skills of typically developing children.

What is guided reading?

Guided reading is a technique used by teachers to support literacy development. The teacher works with a small group of children, typically not more than 6, who are grouped according to ability and who work together on the same text. This ability-grouping enables the teacher to focus on the specific needs of those children, and to provide opportunities for them to develop their understanding of what they read through discussion, as well as their reading fluency. In this project we are investigating the features of effective guided reading, with a particular emphasis on reading comprehension.

Features of guided reading

Teachers aim to bridge the gap between children’s current and potential ability. Research indicates that this is best achieved by using methods that facilitate interaction, rather than by providing explicit instruction alone (e.g., Pianta et al., 2007).

The strategies that teachers can use to support and develop understanding of the text are best described as lying on a continuum, from low challenge strategies – for example, asking children simple yes/no or closed-answer questions – to high challenge strategies, that might require children to explain a character’s motivation and evaluate the text. Low challenge strategies pose more limited constraints on possible answers: they may simply require children to repeat back part of the text or provide a one word response, such as a character’s name. High challenge strategies provide greater opportunity for children to express their own interpretation of the text.

Low challenge questions can be used by the teacher to assess children’s basic level of understanding and are also a good way to encourage children to participate in the session. High challenge questions assess a deeper understanding and more sophisticated comprehension skills. Skilled teachers will adapt questions and their challenge depending on the group and individual children’s level of understanding and responsiveness, with the intent of gradually increasing the responsibility for the children to take turns in leading the discussion. This technique is used to scaffold the discussion.

Our investigation: How is guided reading effective?

Previous studies observing guided reading highlight substantial variability in what teachers do and, therefore, in our understanding of how guided reading can be used to best foster language and literacy skills. A more fine-grained and detailed examination of teacher input and its relation to children’s responses is needed to determine the teacher strategies that are most effective in achieving specific positive outcomes (see Burkins & Croft, 2009; Ford, 2015).

Previous research on this topic has typically taken the form of observational studies, in which researchers have had to laboriously parse and hand-code transcriptions of the teacher-children interactions (a corpus) to identify teacher strategies of interest. Because this is a long and painstaking process, it limits the size of the corpus to one that can be coded within a realistic time window. In this project, we aim to maximise interpretation of these naturalistic classroom interactions using powerful corpus search tools. These enable precise computer-searches for a wide range of language features, and are much faster and more reliable compared to hand-coding. This enables us to create and explore a much larger corpus of guided reading sessions than in previous studies, making a fine-grained analysis possible. For an introduction to corpus search methods, check out this CASS document.

Future blogs will provide more detail about the specific corpus search measurements that CASS are using to identify what makes for effective guided reading. The next (upcoming) blog, however, will explain the motivation for using corpus methods to investigate the effective outcomes of guided reading.

Meet the Author of this blog: Liam Blything

Since July 2016, I have been working as a Senior Research Associate on the CASS guided reading project. My Psychology PhD focused on language acquisition and has been awarded by Lancaster University. It is a great privilege to be working on such an exciting project that answers psychological questions with all these exciting and advanced corpus linguistics methods. I look forward to providing future updates!

 

References

Burkins, J. & Croft, M. M. (2010). Preventing misguided reading: new strategies for guided reading teachers. Thousand Oaks CA: Corwin.

Pianta, R. C., Belsky, J., Houts, R., Morrison, F., & the National Institute of Child Health and Human Development Early Child Care Research Network. (2007). Opportunities to learn in America’s elementary classrooms. Science, 315, 1795–1796.

Ford, M. P. (2015). Guided Reading: What’s New, and What’s Next? North Mankato, MN : Capstone.

 

Controlling the scale and pace of immigration: changes in UK press coverage about migration

The issue of immigration prominently featured in debates leading up to the June 2016 EU Referendum vote. It was argued that too many people were entering the UK, largely from other EU member states. Politicians and media also talked about ‘taking back control’—notably in the contexts of deciding who can enter Britain and enforcing borders. But, as our new Migration Observatory report ‘A Decade of Immigration in the British Press’ reveals through corpus linguistic methods, such language wasn’t necessarily new: in fact, under the coalition government from 2010-2015, the press was increasingly casting migration in terms of its scale or pace. And, the relative importance of ‘limiting’ or ‘controlling’ migration rose over this period, too.

Our report aimed to understand how British press coverage of immigration had changed in the decade leading up to the May 2015 General Election. We built upon previous research done at Lancaster University (headed by CASS Deputy Director Paul Baker) into portrayals of migrant groups. Our corpus of 171,401 items comes from all 19 national UK newspapers (including Sunday versions) that continuously published between January 2006 and May 2015. Using the Sketch Engine, we identified the kinds of modifiers (adjectives) and actions (verbs) associated with the terms ‘immigration’ and ‘migration’.

The modifiers that were most frequently associated with either of these terms included ‘mass’ (making up 15.7% of all modifiers appearing with either word), ‘net’ (15.6%), and ‘illegal’ (11.9%). Closer examination of the top 50 modifiers revealed a group of words related to the scale or pace of migration: in addition to ‘mass’ and ‘net’, these included terms such as ‘uncontrolled’, ‘large-scale’, ‘high’, and ‘unlimited’. Grouping these terms together, and tracking their proportion of all modifiers compared to those related to illegality—which is another prominent way of referring to immigrants—reveals how these terms made up an increasingly larger share of modifiers under both the Labour and coalition governments since 2006. Figure 1 shows how these words made up nearly 40% of all modifiers in 2006, but over 60% in the five months of 2015. Meanwhile, the share of modifiers referring to legal aspects of immigration (‘illegal’, ‘legal’, ‘unlawful’, or ‘irregular’) declined from 22% in 2006 to less than 10% in January-May 2015.

Figure 1.

blog

 

 

 

 

 

 

 

Another way of examining this dimension of ‘scale’ or ‘pace’ is to look at the kinds of actions (verbs) done to ‘immigration’ or ‘migration’. For example, in the sentences ‘the government is reducing migration’ and ‘we should encourage more highly-skilled immigration’, the verbs ‘reduce’ and ‘encourage’ signal some kind of action being done to ‘immigration’ and ‘migration’. In a similar way to Figure 1, we looked at the most frequent verbs associated with either term. A category of words expressing efforts to limit or control movement—what we call ‘limit’ verbs in the report—emerged from the top 50 verbs. These included examples such as ‘control’, ‘tackle’, ‘reduce’, and ‘cap’.

Figure 2 shows how the overall frequency of these limit verbs, indicated by the solid line, rose by about five times between 2006 and the high point in 2014—most notably from 2013. But, as a share of all verbs expressing some action towards ‘immigration’ or ‘migration’, this category was consistently making up 30-40% from 2010 onwards. This suggests that, although these kinds of words weren’t that frequent in absolute terms until 2014, the press had already started moving towards using them from 2010.

Figure 2.

blog1

 

 

 

 

 

 

 

These results show how the kind of language around immigration has changed since 2006. Corpus methods allow us to look at a large amount of text—in this case, over a significant period of time in British politics—in order to put recent rhetoric in its longer context. By doing so, researchers contribute concrete evidence about how the British press has actually talked about migrants and migration. Such evidence opens timely and important debates about the role of the press in public discussion (how does information presented through media impact public opinion?) and the extent to which press outputs should be scrutinised.

About the author: William Allen is a Research Officer with The Migration Observatory and the Centre on Migration, Policy, and Society (COMPAS), both based at the University of Oxford. His research focuses on the ways that media, public opinion, and policymaking on migration interact. He also is interested in the ways that migration statistics and research evidence is used in non-academic settings, especially through data visualisations.

New CASS PhD student!

CASS is delighted to welcome new PhD student Andressa Gomide to the centre, where she will be working on data visualization in corpus linguistics. Continue reading to find out more about Andressa!


I am in the first year of a my PhD in Linguistics, which is focused on data visualizations for corpus tools. Being a research student at CASS, I am looking forward to gaining a better understanding of how different fields of study use corpus tools in their research.

IMG_4188

I’ve been involved with corpus linguistics since 2011, when I started my undergraduate research program on leaner corpora. Since then, I have developed a strong interest in corpus studies, which led me to devote my BA and my MA to this theme. I completed both my BA and my MA at the Universidade Federal de Minas Gerais in Brazil.

Aside from my interest in linguistics, I also enjoy outdoor activities such as cycling and hiking.

CASS goes to the Wellcome Trust!

Earlier this month I represented CASS in a workshop, hosted by the Wellcome Trust, which was designed to explore the language surrounding patient data. The remit of this workshop was to report back to the Trust on what might be the best ways to communicate to patients about their data, their rights respecting their data, and issues surrounding privacy and anonymity. The workshop comprised nine participants who all communicated with the public as part of their jobs, including journalists, bloggers, a speech writer, a poet, and a linguist (no prizes for guessing who the latter was…). On a personal note, I had prepared for this event from the perspective of a researcher of health communication. However, the backgrounds of the other participants meant that I realised very quickly that my role in this event would not be so specific, so niche, but was instead much broader, as “the linguist” or even “the academic”.

Our remit was to come up with a vocabulary for communication about patient data that would be easier for patients to understand. As it turned out, this wasn’t too difficult, since most of the language surrounding patient data is waffly at its best, and overly-technical and incomprehensible at its worst. One of the most notable recommendations we made concerned the phrase ‘patient data’ itself, which we thought might carry connotations of science and research, and perhaps disengage the public, and so recommended that the phrase ‘patient health information’ might sound less technical and more 14876085_10154608287875070_1645281813_otransparent. We undertook a series of tasks which ranged from sticking post-it notes on whiteboards and windows, to role play exercises and editing official documents and newspaper articles. What struck me, and what the diversity of these tasks demonstrated particularly well, was how the suitability of our suggested terms could only really be assessed once we took the words off the post-it notes and inserted them into real-life communicative situations, such as medical consultations, patient information leaflets, newspaper articles, and even talk shows.

The most powerful message I took away from the workshop was that close consideration of linguistic choices in the rhetoric surrounding health is vital for health care providers to improve the ways that they communicate with the public. To this end, as a collection of methods that facilitate the analysis of large amounts of authentic language data in and across a variety of texts and contexts, corpus linguistics has an important role to play in providing such knowledge in the future. Corpus linguistic studies of health-related communication are currently small in number, but continue to grow apace. Although the health-related research that is being undertaken within CASS, such as Beyond the Checkbox and Metaphor in End of Life Care, go some way to showcasing the rich fruits that corpus-based studies of health communication can bear, there is still a long way to go. In particular, future projects in this area should strive to engage consumers of health research not only in terms of our findings, but also the (corpus) methods that we have used to get there.