Environmental DH Seminar: Jim Clifford & Jacon Polay, "Solving OCR: Using olmOCR to Follow Commodities across the British World"

Wednesday 11 March 2026, 3:00pm to 4:00pm

Venue

Online, Lancaster, United Kingdom, LA1 4YD - View Map

Open to

External Organisations, Postgraduates, Public, Staff, Undergraduates

Registration

Free to attend - registration required

Registration Info

https://www.eventbrite.co.uk/e/solving-ocr-using-olmocr-to-follow-commodities-across-the-british-world-tickets-1856518660289

Event Details

Join the Lancaster-Manchester Environmental DH Seminar for a presentation by Jim Clifford and Jacob Polay (University of Saskatchewan) on how recent advances in open-source Optical Character Recognition (OCR) technologies are transforming the possibilities of digital history.

Abstract

OCR has long posed a challenge for historians working with digitised archives. Irregular typefaces, the long s, uneven print quality, and low-resolution scans—particularly those derived from microfilm—have severely constrained the use of text mining and computational analysis in historical research. In this talk, Clifford works with olmOCR, an energy-efficient, low-cost, and open-source OCR system developed by Allan AI, which surpasses the performance of expensive multimodal large language models.

Clifford and his collaborators have developed a pipeline capable of downloading and processing hundreds of thousands of pages from the Internet Archive. In partnership with Canadiana.org, they are now reprocessing extensive collections to generate high-quality OCR outputs at scale. With cleaner text data, the team is constructing named entity recognition pipelines to identify and interlink people, places, and commodities, ultimately producing Linked Open Data and knowledge graphs that trace the development of extractivist commodity economies across the British World System from the 1650s to the 1960s.

This presentation will introduce the technical foundations of the project, share early findings, and reflect on how improved OCR and data infrastructures can support large-scale, open, and reproducible research in environmental and economic history.

About the Speakers

Jacob Polay is a second-year Master's student in the Department of History at the University of Saskatchewan. His thesis research focuses on adapting small open-weight language models for text mining pipelines designed to work with early modern English texts. He is the developer of Early Modern NER, an open-source named entity recognition tool for early modern English (github.com/polayj/earlymodernner).

Jim Clifford is an Associate Professor of History at the University of Saskatchewan, specializing in environmental history and digital history. His research examines the extractivism of the long nineteenth century that supplied the raw materials fuelling urban industrial growth in Britain and across the British World. He employs digital methods, including GIS, knowledge graphs, and large language models, to trace the transnational environmental transformations driven by industrialization. He is a co-editor of Historical Methods.

About the Environmental Digital Humanities Seminar (EDHS)

The Environmental Digital Humanities Seminar (EDHS) brings together scholars from across the humanities who use digital methods to understand environments past, present, and future. EDHS is inclusive of urban, rural, and suburban spaces and places, and while we explore environments globally, we also showcase local work from and about the North of England.

EDHS is supported by the N8, the Lancaster Data Science Institute, Digital Humanities Centre at Lancaster, Centre for Digital Humanities, Cultures, and Media at the University of Manchester, and the MCGIS research group at Manchester.

Organisers

Giulia Grisot (Manchester), Katherine McDonough (Lancaster), Luca Scholz (Manchester), Joanna Taylor (Manchester)

Contact Details

Name Katherine McDonough
Email

k.mcdonough@lancaster.ac.uk

Directions to Online

Online