The texts in the corpus were all typed from photocopy by transcribers in the Department of Linguistics, Lancaster University. Although labour-intensive and thus expensive, this procedure was necessary because the quality of the images was too low to allow for the text to be scanned in using OCR. Click here for an example of the chaos that results when an OCR-capture was taken of one page's image - and here for the image from which it was acquired.
The text was inputted in a basic SGML-compatible format that has much in common with HTML. This was for two reasons: firstly, to simplify the encoding for the transcribers, and secondly, to permit automatic mapping to a TEI-conformant SGML/XML encoding or a fully Web-compatible HTML format.
Spelling in the seventeenth century was not fixed, and varies between and even within different individuals' practice. The use of regularised spellings in the corpus was a controversial but necessary step. If corpus analysis pacages such as concordancers are to function at an acceptable level of performance, they must not make such elementary mistakes as treating - for instance - Cromwel and Cromwell (two popular spellings of this leader's name) as different words. To avoid this, it is necessary to regularise spelling.
However, it is to be recognised that to maximise their utility to the historian, the electronic versions of the newsbooks should remain as close to the originals as possible: therefore, the original spellings should not simply be jettisoned. Rather, a solution was adopted which allowed both the original spelling and a standardised, 20th/21st Century Standard English form to be preserved in the markup. This was accomplished using the <reg> element (short for regularise). From an original text with <reg> elements, it is possible to derive by a very simple algorithm both a regularised version and an original-spelling version, as is demonstrated on these pages by means of the sample data.
Thus, the markup scheme as developed allows us to have the best of bost worlds, and to gain maximum utility from the corpus.
home | background | projects | encoding | data | references | contact & links