Lancaster University Department of Linguistics and Modern English Language
Corpus Linguistics Home
Page index
WordSmith
BNCweb
DIY Corpora
Building DIY Corpora
Headers in DIY Corpora
 
Page One
 
 
Page Two
 
 
Page Three
 
 
Current page
 
 

Glossary of Useful Terms

 

MS Word file
 
A file in the format created by the program Microsoft Word. These files usually have names that end in the suffix .doc and/or an icon . They cannot be used in corpus analysis programs without being converted to text only format.

 

Text only file
 
A file containing letters and numbers but no proprietary formatting codes for things such as bold, italic, etc. Most text only files have the ending .txt and/or an icon . They are also known as "Plain Text" or "ASCII" files. Note that HTML and SGML files are essentially "text only" files, with formatting handled using angled brackets < > so that no special character sets or proprietary codes need to be added to the file.

 

SGML file
 
A kind of text only file that contains "mark-up tags" which show where formatting should appear in the file, such as <P> to mark a new paragraph, <pause> for pauses. You can also put a lot of information in a "header" of an SGML file, such as the date the text was created, the author, the number of words, etc. WordSmith lets you switch SGML tags on and off (see [Settings-Adjust Settings - Tags, and clear the button next to Activated]. SGML files have the ending .sgm or .sgml

 

HTML
 
The current language of the Internet. Similar to SGML, in that tags like <P> and <TABLE> are common. The tags are specific to displaying pages on the Internet. You cannot make up your own tags in HTML. HTML files have the ending .htm or .html and/or an icon or .

 

XML
 
A new version of HTML. It should be the future language of the Internet, and is already widely used in corpus linguistics. It is a kind of compromise between the sophistication of SGML and the flexibility of HTML.