LCCWP TRANSCRIPTION PRINCIPLES

Transcription of LCCWP Projects

NOTE: This page is still in draft form!

The markup tags used in the LCWP

SGML tags	denotes
<P>....</P>	A "chunk" of text. Roughly corresponds to the notion of paragraph but also includes headings, captions, list items and chunks of text of indeterminate function
<REG>...<REG>	Regularized spelling. In its current form, the corpus does not indicate the original spellings of regularized words, but we intend to add this information in the future.
<SIC>...<SIC>	Cases where there may be a degree of doubt about the accuracy of the words transcribed.
<GAP>	Material omitted from the transcription, typically textual or visual material imported into project from an external source
<GAP desc="figure">	Graphic element (picture, drawing, photograph etc.) produced by the child
<TABLE>....</TABLE>	Any clear use of a table containing rows and cells
<ROW>...</ROW>	Row in a table
<CELL>...</CELL>	Cell in a table
<NAME key="...">	Anonymized name of a child in the corpus sample
<CHPB desc="...">	Child's page number

Notes on tag usage

Page numbering. Originally we used child's page numbers. Problems with this: children use many blank pages (and it is not generally clear if intentional which is which); their projects sometimes start make page 1 the cover pages, sometimes it is the inside cover, sometimes it is the first page of 'text proper'. To be consistent we decided to make page 1 the inside cover page in all instances, and - if it exists - the cover page is numbered page 0.
The child's original page numbering is still included for reference, but it is not the primary label in the project index.

Anonymisation tags. Eg <name key="KH"> refers to the child with pseudonym Kyrah Hollingsworth. Thus it is still possible to cross-reference a name in one project to a name in another project, but the identification of that child in real life is not disclosed.

Criteria underlying the Transcription

Our main concerns in developing transcription guidelines for LCCWP have been:

Fidelity to the original
that is, to keep the transcription as faithful to the original as possible
Consistency
that is, to code like features in a like manner throughout the corpus
Research goals and deadlines
Some of our broader aims are to enable the corpus to be used for the investigation of children's lexis, grammar and discourse. Some of the things we are not primarily concerned with are letter (glyph) formation and spelling variation; the corpus will be of limited value in studying these phenomena
. The external deadlines of project funding have contributed to the corpus encoding being to a simpler level of detail than we would have liked.
Standards in text encoding
Emergent standards in the corpus building world.. we have opted for an HTML/SGML hybrid. - it is both standardised and flexible. As well as standard tags, such as <p> there is scope for project-specific tags, such as...

The transcription scheme is discussed in more detail in the paper: Smith, N., McEnery A. and Ivanic, R. (1998) Issues in Transcribing a Corpus of Children's Handwritten Projects. Literary and Linguistic Computing, Vol.13, No.4. Oxford: OUP.

Updated: 04 April 2001.