Data Encoding Scheme



The texts in the corpus were all typed from photocopy by transcribers in the Department of Linguistics, Lancaster University. Although labour-intensive and thus expensive, this procedure was necessary because the quality of the images was too low to allow for the text to be scanned in using OCR. Click here for an example of the chaos that results when an OCR-capture was taken of one page's image - and here for the image from which it was acquired.

The text was inputted in a basic SGML-compatible format that has much in common with HTML. This was for two reasons: firstly, to simplify the encoding for the transcribers, and secondly, to permit automatic mapping to a TEI-conformant SGML/XML encoding or a fully Web-compatible HTML format.

Basic text elements

The <p> element is used to indicate the extent of paragraphs; headings are encoded in three groups, indicated by <h1>, <h2> and <h3>. The three main typeface variants - roman, italic and gothic - are also encoded, as the default, <i>, and <go> respectively. The <em> element is used to indicate the contrastive italicisation used in this period for significant words in the sentence, for instance, proper nouns.

Other, less common arrangements of text were handled using the <table> and <poem> elements and their associated subordinate elements (<tr>, <td>; <stanza>, <line>). <hr> is used to indicate lines across the page; <img> is used to indicate the presence of - and link to graphical scans of - illustrations in the text. Page breaks, comments from the transcriber, and unclear text are also indicated in the markup.

Page from Mercurius Politicus, showing different typefaces including Gothic

Spelling regularisation

Spelling in the seventeenth century was not fixed, and varies between and even within different individuals' practice. The use of regularised spellings in the corpus was a controversial but necessary step. If corpus analysis pacages such as concordancers are to function at an acceptable level of performance, they must not make such elementary mistakes as treating - for instance - Cromwel and Cromwell (two popular spellings of this leader's name) as different words. To avoid this, it is necessary to regularise spelling.

However, it is to be recognised that to maximise their utility to the historian, the electronic versions of the newsbooks should remain as close to the originals as possible: therefore, the original spellings should not simply be jettisoned. Rather, a solution was adopted which allowed both the original spelling and a standardised, 20th/21st Century Standard English form to be preserved in the markup. This was accomplished using the <reg> element (short for regularise). From an original text with <reg> elements, it is possible to derive by a very simple algorithm both a regularised version and an original-spelling version, as is demonstrated on these pages by means of the sample data.

Thus, the markup scheme as developed allows us to have the best of bost worlds, and to gain maximum utility from the corpus.


home | background | projects | encoding | data | references | contact & links