DLF. Standards and practices. TEI Text Encoding in Libraries

Printer-Friendly Page

TEI Text Encoding in Libraries

Guidelines for Best Encoding Practices

Version 1.0 (July 30, 1999)

Current version: Version 2.1 (2006)

Comments to Perry Willett, Indiana University (email: pwillett@indiana.edu)

Introduction
Participants
Recommendations
General Recommendations
Encoding Levels
Attribute Values

I. Introduction

At the TEI and XML in Digital Libraries Workshop held at the Library of Congress on June 30-July 1, 1998, three working groups were formed. Group 2 was charged with developing a set of recommendations for libraries using the TEI Guidelines in electronic text encoding. Representatives from six libraries met at the Library of Congress on November 12-13, 1998. The Task Force met again at ALA mid-winter (January 1999) to incorporate comments and finalize the draft. The revised recommendations were circulated to the conference working group in May 1999 and presented at the joint annual meeting of the Association of Computers and the Humanities and Association of Literary and Linguistic Computing in June 1999. Version 1.0 was circulated for comments in August 1999.

Return to top

II. Participants:

LeeEllen Friedland, Library of Congress
Nancy Kushigian, University of California, Davis
Christina Powell, University of Michigan
David Seaman, University of Virginia
Natalia Smith, University of North Carolina at Chapel Hill
Perry Willett, Indiana University

Return to top

III. Background

Our recommendations are for libraries using the TEILite DTD v1.6. There are many different library text digitization projects, for different purposes. With this in mind, the Task Force has attempted to make these recommendations as inclusive as possible by developing a series of encoding levels. These levels are meant to allow for a range of practice, from wholly automated text creation and encoding, to encoding that requires expert content knowledge, analysis, and editing.

Encoding levels 1-4 require no expert knowledge of content. Level 5, in contrast, requires scholarly analysis. Levels 1-4 allow the conversion and encoding of texts to be performed without the assistance of content experts and can be enriched with more markup at any time. Recommendations for Levels 1-4 are intended for projects wishing to create encoded electronic text with structural markup, but minimal semantic or content markup. Also, the encoding levels are cumulative: encoding requirements at each level incorporate the requirements of lower levels.

These recommendations are concerned with the text portion of a TEI-encoded document. While there are modest requirements for including certain information about encoding level in the TEI Header, a separate set of recommendations has been developed to address issues concerning TEI Header contents to MARC-format bibliographic data (see TEI/MARC Best Practices Document from Working Group 1).

Return to top

IV. General Recommendations

The encoding level (as described in this document) should be recorded in the <editorialDecl>, along with an explanation of any deviation from the recommendations.
Electronic text at all levels of encoding should begin with the transcription of the first word on the first leaf of the original work. It may be impractical or undesirable to transcribe and encode certain features of the text, such as publisher's advertisements or indexes, but if at all possible, they should be included as links to page images. Any omissions of material found in the original work should be noted in the <editorialDecl> in the TEI Header.
File naming should follow ISO 9660 conventions: 8-character filenames, 3-character extensions, using A-Z, a-z, 0-9, underscores and hyphens.
Numbered <DIV>s present advantages to search and indexing software by explicitly communicating the hierarchical level of the section described. One anomaly of the TEI Guidelines is that <DIV0> is not available in <FRONT> or <BACK> matter. Therefore, we recommend the use of numbered <DIV>s throughout the electronic text, always beginning with <DIV1>. Texts at all levels should include at least one <DIV1>.
Page breaks <PB> should occur at the top of the page, and entirely within any DIV.

Return to top

V. Encoding Levels

V.1. LEVEL 1: Fully Automated Conversion and Encoding

Purpose: To create electronic text with the primary purpose of keyword searching and linking to page images. The primary advantage in using the TEILite DTD at this level is that a TEI Header is attached to the text file.

Rationale: That text is subordinate to the page image, and is not intended to stand alone as an electronic text (without page images).

Texts at Level 1 can be created and encoded by fully automated means, using uncorrected OCR of page images ("dirty OCR"), or exporting from existing electronic text files. Only those tags that are necessary to divide the text from the header and facilitate linking to page images are used. Encoding is performed automatically based on artifacts of the OCR or other document creation process (page breaks, for example) and metadata collected during the imaging or preparation process. This encoding is both minimal and reliable, and does not typically require extensive review of each page of each text.

Level 1 texts are not intended to be adequate for textual analysis; they are more likely to be suited to the goals of a preservation unit or mass digitization initiative. Though their encoding is minimal, Level 1 texts are fully valid SGML texts. In addition to taking advantage of the TEI Header, using the TEILite DTD allows Level 1 texts to be compatible with more richly encoded TEILite texts for searching, for example. Further encoding based on document structures or content analysis can be added to a Level 1 text at any time.

Level 1 is most suitable for projects with the following characteristics:

a large volume of material is to be made available online quickly
a digital image of each page is desired
no manual intervention will be performed in the text creation process
the material is of interest to a large community of users who wish to read texts that allow keyword searching
sophisticated search and display capabilities based on the structure of the text are not necessary
extensibility is desired; that is, one desires to keep open the option for a higher level of tagging to be added at a later date

<DIV1>	Type="section" is the default attribute value.
<P>	One "container" element per DIV is required.
<PB>	This is required in Level 1. Page images can be linked to the text using ID/IDREF or ENTITYREF attributes. Using ENTITYREF has advantages for maintaining large numbers of image files, but would require modifying the TEILite DTD.
<FIGURE>	This element is optional at Level 1. The advantage of using <FIGURE> is the ability to record metadata using <FIGDESC>.

Return to top

V.2. LEVEL 2: Minimal Encoding

Purpose: To create electronic text for keyword searching, linking to page images, and identifying simple structural hierarchy to improve navigation.

Rationale: The text is subordinate to the page image, though navigational markers (textual divisions, heads) are captured. The text could stand alone as electronic text (without page images) if the accuracy of its contents is suitable to its intended use and it is not necessary to display low-level typographic or structural information. Level 2 requires a set of elements more granular than those of Level 1, including bibliographic or structural information below the monographic or volume level, but still does not require a specialist to identify.

Though texts at Level 2 can be created and encoded by automated means, based on the typographic elements in the electronic file (for example, bold centered text at the top of the page surrounded by whitespace indicates a new chapter head, and thus a new division), it is not likely to be absolutely reliable across a large body of material. Level 2 encoding requires some human intervention to identify each textual division and heading. Level 2 texts do not require any specialist knowledge or manual intervention below the section level.

Level 2 texts can be displayed separately from their page images. Even when displayed with page images, Level 2 encoding of sections and heads provides greater navigational possibilities than Level 1 encoding, and enables searching to be restricted within particular textual divisions (for example, searching for two phrases within the same chapter).

Level 2 is most suitable for projects with the following characteristics:

a large volume of material is to be made available online quickly
a digital image of each page is desired
the material is of interest to a large community of users who wish to read texts that allow keyword searching
rudimentary search and display capabilities based on the large structures of the text are desired
each text will be checked to ensure that divisions and headers are properly identified
extensibility is desired; that is, one desires to keep open the option for a higher level of tagging to be added at a later date

All elements specified in Level 1 plus the following:

<FRONT>, <BACK>	Optional
<HEAD>	Required if present
<DIV1>	Type="section" is the default attribute value. It is recommended that the N attribute be included to record the div sequence.
<P>	One "container" element per DIV is required.

Return to top

V.3. LEVEL 3: Simple Analysis

Purpose: To create text that can stand alone as electronic text and identifies hierarchy and typography without content analysis being of primary importance.

Rationale: Level 3 texts can be created from scratch or by the relatively easy conversion of existing HTML or word-processing documents. Encoding offers the advantage of the TEI Header, interoperability with other TEI collections, and extensibility to higher levels of encoding. Level 3 generally requires some human editing, but the features to be encoded are determined by the appearance of the text and not specialized content analysis.

Level 3 texts identify front and back matter, and all paragraph breaks. The finer granularity of tagging these features, as well as figures, notes, and all changes of typography, allows a range of options for display, delivery, and searching. For example, one has the option of identifying and, therefore, specifying the display charactersitics of different typographic styles, and regularizing the display and placement of note text.

Level 3 texts can stand alone as text without page images and, therefore, can be uploaded, downloaded and delivered quickly, and require less storage space than digital collections with page images. However, the simple level of structural anaylsis and absence of specialized content analysis reflected in Level 3 tagging may make it desirable for some, depending on project priorities, to include page images in order to provide users with a fuller set of resources.

Level 3 is most suitable for projects with the following characteristics:

the material is of interest to a large community if users who wish to read texts that allow keyword searching
some sophistication of display, delivery, and searching based on structure of the text is desired
each text will be checked to ensure that tagging decisions have been made appropriately
the users of the texts may have limited storage or display capabilities
the creator of the texts has limited or no ability to provide content specialists to analyze, tag, or review texts
extensibility is desired; that is, one desires to keep open the option for a higher level of tagging to be added at a later date

All elements specified in Levels 1 and 2, plus the following:

<FRONT>, <BACK>	Required if present.
<P>	Required for paragraph breaks in prose; may be used for stanzas using <LB> for line breaks in verse.
<FIGURE>	Required to indicate figures other than page images.
<HI>	Required to indicate changes in typeface. REND attribute is optional.
<NOTE>	All notes must be encoded. It is also recommended that notes that extend beyond one page be combined into one <NOTE> element. Marginal notes, without reference, should occur at the beginning of the paragraph to which they refer, with the value of the PLACE attribute as "margin".

NOTE ON <NOTE>:

For processing reasons, it may be desirable to move footnotes from their original location in the text. If left at the bottom of a page, a note may become included in another paragraph or section of the encoded text, and thus separated from its reference. There are options for placement of footnotes if they are moved:

Inline. The note is inserted at the point of reference. An N attribute is the value of the note. No <REF> element is needed with this option.
End-of-Paragraph. <REF> with target attribute occurs at point of reference. <NOTE> with ID attribute occurs within, but at the end of the paragraph in which the reference occurs.
End-of-Div. Notes moved to the end of the <DIV>

Return to top

V.4. LEVEL 4: Basic Content Analysis
Purpose: To create text that can stand alone as electronic text, identifies hierarchy and typography, specifies function of textual and structural elements, and describes the nature of the content and not merely its appearance. This level is not meant to encode or identify all structural, semantic or bibliographic features of the text.

Rationale: Greater description of function and content allows for:
- flexibility of display and delivery
- sophisticated searching within specified textual and structural elements
- combining the broadest range of uses and audiences
Texts encoded at Level 4 are able to stand alone as part of a library collection, and do not require images in order for them to be read by students, scholars and general readers. This level of TEI encoding allows them to be displayed or printed in a variety of ways suitable for classroom or scholarly use.
Level 4 texts contain tags and attributes that describe content. For example, lines of verse are tagged with <L>; the <P> tag is reserved for true paragraphs. Attributes of the text that contribute to meaning are preserved, such as indentation of lines of verse and typography. These are textual features that are not encoded at lower levels and that allow the text to be used and understood fully independent of images.
The ability to stand alone as text means that Level 4 texts can be uploaded, downloaded, and delivered quickly, and require less storage space than collections with page images.
Finally, functionally accurate tagging in Level 4 texts allows them to be searched or displayed in sophisticated ways. For example, a searcher could limit his or her search in a dramatic text to stage directions or to the speeches of a particular character. In a volume of poetry published by subscription, a search could be confined to names that appear in lists, thus limiting a search to names of people who subscribed to a particular volume. This ability to limit searches becomes more significant as textbases become larger, and thus is of great importance to the library community as it attempts to build into the initial design and implementation of textbases features needed to enhance interoperability.
Level 4 is most suitable for projects with the following characteristics:
- the users of the texts may have limited storage or display capabilities
- sophisticated search and display capabilities are desired
- the collection is of interest to a currently existing, well defined community of users, such as that that would constitute a market for any published text or collection of texts
- the collection is rare and not available to users in print or other electronic formats
- the texts will be used for pedagogical or scholarly purposes, not just as reading copies
- extensibility is desired; that is, one desires to keep open the option for a higher level of tagging to be added by the scholarly community at a later date
In considering such a level 4 TEI digitization project, an academic library should consult with faculty members and collection bibliographers, and ask the following question: Is this collection of texts one that the library should purchase if it were available commercially? If so, the benefits of a Level 4 project are many, for the result is a freely available collection of texts owned and administered by the library community, thus free of licensing restrictions and on-going access charges.
General Level 4 Recommendations:
- V.4.1. Emphasized text should be encoded as <FOREIGN>, <TITLE>, <EMPH>, as appropriate. Any ambiguous emphasized text should be encoded as <HI>.
- V.4.2. It is recommended that the <SIC> element be used to indicate typographic errors, with corrections noted as the value of the CORR attribute.
- V.4.3. <TITLEPAGE> should include the verso if present, divided with by <PB N="verso">. Tables of contents, errata, subscription lists, "other titles by the same author" should be included in a separate numbered DIV, as a <LIST> with <ITEM>s. Frontispieces should be encoded as a <FIGURE>, within a separate numbered <DIV> and <P>.
Level 4 Prose:
- V.4.4. Letters that occur within the text body provide some challenges. It is recommended that quoted letters that occur as part of a text (and not collections of letters themselves) be encoded within <q><text><body><div1 type="letter">, with <opener>, <dateline>, <salute>, <signed>, <closer> included as appropriate.
- V.4.5. Quotations that do not occur inline, but are set off typographically in some way, should be encoded as <q>.
- V.4.6. Notes are to be encoded as described in Level 3.
- V.4.7. <Argument>, <Opener>, <epigraph>, <closer>, <trailer>, <add>, <del>, <unclear> as appropriate.
Level 4 Drama:
- V.4.8. Cast lists should be encoded as <LIST>s, with <ITEM>s.
- V.4.9. Speeches are encoded as <SP>, with speakers identified within <SPEAKER> elements.
Level 4 Verse:
- V.4.10 All verse, even poems without separate stanzas or verse paragraphs, should be contained within a line group element <LG>. This will assist with automated processing and retrieval.
- V.4.11 It is common to see informal divisions within poems, noted by a string of asterisks or periods. These should be encoded as <MILESTONE>s with attribute values of UNIT="typography"and N= indicating the character used and its occurrence, <MILESTONE UNIT="typography" N="******">.
- V.4.12 <L> It is strongly recommended that indentation is recorded using the REND attribute.
Level 4 Front and Back Matter:
- V.4.13 It is recommended that all prefaces, tables of contents, afterwords, appendices, endnotes and apparatus be encoded. For publisher's advertisements, indexes, and glossaries or other front or back matter that isn't considered of primary importance to the text, there are three options:
  1. Fully transcribe and encode
  2. Link to page images (may include an unencoded transcription)
  3. Omit, noted in <EditorialDesc>
Return to top
V.5. LEVEL 5: Scholarly Encoding Projects
Level 5 texts are those that require subject knowledge, and encode semantic, linguistic, prosodic or other elements beyond a basic structural level.

Return to top

VI. Attribute Values

TYPE
Constructing a list of acceptable attribute values for TYPE that could find wide agreement is impossible. Instead, it is recommended that projects describe the TYPE attribute values used in their texts in the project documentation and that this list be made available to people using the texts. See ABC for Book Collectors by John Carter (7th edition, New Castle, DE: Oak Knoll Books, 1995) for a list of standard names and definitions of bibliographic features of printed books. For those elements where TYPE is not required, such as <HEAD> and <TITLE>, use the attribute values for subtitles and additional titles, but not main titles.
REND
Difficulty using REND attributes occurs when it is desirable to record more than one rendition feature. With this in mind, it is recommended that projects employ the following adaptation of "rendition ladders", a concept developed at the Brown Women Writers Project <http://www.wwp.brown.edu>. This concept allows for strings of rendition features to be included as one REND value. Rendition ladders consist of categories of renditions, with further defined values included in parentheses.
REND should only be used to override a default value. For instance, if all text encoded as <HI> is defined as being rendered in italics, there is no reason to encode text as

<HI REND="font(italics)">

Combining attributes would result in a tag with attributes such as this:

<L REND="font(italics)align(right)">
FONT
italics, bold, fsc (full and smallcaps), smallcap, underlined, gothic
ALIGN
right, left, center, block
INDENT
Values in parentheses should indicate the number of tabstops to be indented, e.g., <L REND="indent(1)">
LANG
Use ISO639-2 three-character language codes.

Return to top

Please send comments or suggestions.
Last updated:

CLIR Home Page

TEI Text Encoding in Libraries

Guidelines for Best Encoding Practices

Version 1.0 (July 30, 1999)

Current version: Version 2.1 (2006)

Contents

I. Introduction

II. Participants:

III. Background

IV. General Recommendations

V. Encoding Levels

VI. Attribute Values