NOTE: This page is still in draft form!
SGML tags | denotes |
<P>....</P> | A "chunk" of text. Roughly corresponds to the notion of paragraph but also includes headings, captions, list items and chunks of text of indeterminate function |
<REG>...<REG> | Regularized spelling. In its current form, the corpus does not indicate the original spellings of regularized words, but we intend to add this information in the future. |
<SIC>...<SIC> | Cases where there may be a degree of doubt about the accuracy of the words transcribed. |
<GAP> | Material omitted from the transcription, typically textual or visual material imported into project from an external source |
<GAP desc="figure"> | Graphic element (picture, drawing, photograph etc.) produced by the child |
<TABLE>....</TABLE> | Any clear use of a table containing rows and cells |
<ROW>...</ROW> | Row in a table |
<CELL>...</CELL> | Cell in a table |
<NAME key="..."> | Anonymized name of a child in the corpus sample |
<CHPB desc="..."> | Child's page number |
Notes on tag usage
Page numbering. Originally we used child's page
numbers. Problems with this: children use many blank pages (and it is
not generally clear if intentional which is which); their projects
sometimes start make page 1 the cover pages, sometimes it is the inside
cover, sometimes it is the first page of 'text proper'. To be consistent
we decided to make page 1 the inside cover page in all instances, and
- if it exists - the cover page is numbered page 0.
The child's original page numbering is still included for reference, but
it is not the primary label in the project index.
Anonymisation tags. Eg <name key="KH"> refers to the child with pseudonym Kyrah Hollingsworth. Thus it is still possible to cross-reference a name in one project to a name in another project, but the identification of that child in real life is not disclosed.
Our main concerns in developing transcription guidelines for LCCWP have been:
that is, to keep the transcription as faithful to the original as possible
that is, to code like features in a like manner throughout the corpus
Some of our broader aims are to enable the corpus to be used for the investigation of children's lexis, grammar and discourse. Some of the things we are not primarily concerned with are letter (glyph) formation and spelling variation; the corpus will be of limited value in studying these phenomena
. The external deadlines of project funding have contributed to the corpus encoding being to a simpler level of detail than we would have liked.Emergent standards in the corpus building world.. we have opted for an HTML/SGML hybrid. - it is both standardised and flexible. As well as standard tags, such as <p> there is scope for project-specific tags, such as...
The transcription scheme is discussed in more detail in the paper: Smith, N., McEnery A. and Ivanic, R. (1998) Issues in Transcribing a Corpus of Children's Handwritten Projects. Literary and Linguistic Computing, Vol.13, No.4. Oxford: OUP.
Updated: 04 April 2001.