TEI Text Encoding in Libraries
Guidelines for Best Encoding
Practices
Version 1.0 (July 30, 1999)
Comments to Perry Willett, Indiana
University (email: pwillett@indiana.edu)
- Introduction
- Participants
- Recommendations
- General
Recommendations
- Encoding Levels
- Level
1: Fully Automated Conversion and Encoding
- Level
2: Minimal Encoding
- Level
3: Simple Analysis
- Level
4: Basic Content Analysis
- Level
5: Scholarly Encoding Projects
- Attribute
Values
At the TEI
and XML in Digital Libraries Workshop held at the Library of
Congress on June 30-July 1, 1998, three working groups were
formed. Group 2 was charged with developing a set of
recommendations for libraries using the TEI Guidelines in
electronic text encoding. Representatives from six libraries met
at the Library of Congress on November 12-13, 1998. The Task
Force met again at ALA mid-winter (January 1999) to incorporate
comments and finalize the draft. The revised recommendations were
circulated to the conference working group in May 1999 and
presented at the joint annual meeting of the Association of
Computers and the Humanities and Association of Literary and
Linguistic Computing in June 1999. Version 1.0 was circulated for
comments in August 1999.
Return to top
- LeeEllen Friedland, Library of Congress
- Nancy Kushigian, University of California, Davis
- Christina Powell, University of Michigan
- David Seaman, University of Virginia
- Natalia Smith, University of North Carolina at Chapel
Hill
- Perry Willett, Indiana University
Return to top
Our recommendations are for libraries using the TEILite DTD
v1.6. There are many different library text digitization
projects, for different purposes. With this in mind, the Task
Force has attempted to make these recommendations as inclusive as
possible by developing a series of encoding levels. These levels
are meant to allow for a range of practice, from wholly automated
text creation and encoding, to encoding that requires expert
content knowledge, analysis, and editing.
Encoding levels 1-4 require no expert knowledge of content.
Level 5, in contrast, requires scholarly analysis. Levels 1-4
allow the conversion and encoding of texts to be performed
without the assistance of content experts and can be enriched
with more markup at any time. Recommendations for Levels 1-4 are
intended for projects wishing to create encoded electronic text
with structural markup, but minimal semantic or content markup.
Also, the encoding levels are cumulative: encoding requirements
at each level incorporate the requirements of lower levels.
These recommendations are concerned with the text portion of a
TEI-encoded document. While there are modest requirements for
including certain information about encoding level in the TEI
Header, a separate set of recommendations has been developed to
address issues concerning TEI Header contents to MARC-format
bibliographic data (see TEI/MARC
Best Practices Document from Working Group 1).
Return to top
- The encoding level (as described in this document) should be
recorded in the <editorialDecl>, along with an explanation
of any deviation from the recommendations.
- Electronic text at all levels of encoding should begin with
the transcription of the first word on the first leaf of the
original work. It may be impractical or undesirable to transcribe
and encode certain features of the text, such as publisher's
advertisements or indexes, but if at all possible, they should be
included as links to page images. Any omissions of material found
in the original work should be noted in the <editorialDecl>
in the TEI Header.
- File naming should follow ISO 9660 conventions: 8-character
filenames, 3-character extensions, using A-Z, a-z, 0-9,
underscores and hyphens.
- Numbered <DIV>s present advantages to search and
indexing software by explicitly communicating the hierarchical
level of the section described. One anomaly of the TEI Guidelines
is that <DIV0> is not available in <FRONT> or
<BACK> matter. Therefore, we recommend the use of numbered
<DIV>s throughout the electronic text, always beginning
with <DIV1>. Texts at all levels should include at least
one <DIV1>.
- Page breaks <PB> should occur at the top of the page,
and entirely within any DIV.
Return to top
- V.1. LEVEL 1: Fully Automated
Conversion and Encoding
Purpose: To create electronic text with the primary
purpose of keyword searching and linking to page images. The
primary advantage in using the TEILite DTD at this level is that
a TEI Header is attached to the text file.
Rationale: That text is subordinate to the page image,
and is not intended to stand alone as an electronic text (without
page images).
Texts at Level 1 can be created and encoded by fully automated
means, using uncorrected OCR of page images ("dirty OCR"), or
exporting from existing electronic text files. Only those tags
that are necessary to divide the text from the header and
facilitate linking to page images are used. Encoding is performed
automatically based on artifacts of the OCR or other document
creation process (page breaks, for example) and metadata
collected during the imaging or preparation process. This
encoding is both minimal and reliable, and does not typically
require extensive review of each page of each text.
Level 1 texts are not intended to be adequate for textual
analysis; they are more likely to be suited to the goals of a
preservation unit or mass digitization initiative. Though their
encoding is minimal, Level 1 texts are fully valid SGML texts. In
addition to taking advantage of the TEI Header, using the TEILite
DTD allows Level 1 texts to be compatible with more richly
encoded TEILite texts for searching, for example. Further
encoding based on document structures or content analysis can be
added to a Level 1 text at any time.
Level 1 is most suitable for projects with the following
characteristics:
- a large volume of material is to be made available online
quickly
- a digital image of each page is desired
- no manual intervention will be performed in the text creation
process
- the material is of interest to a large community of users who
wish to read texts that allow keyword searching
- sophisticated search and display capabilities based on the
structure of the text are not necessary
- extensibility is desired; that is, one desires to keep open
the option for a higher level of tagging to be added at a later
date
<DIV1> |
Type="section" is the default
attribute value. |
<P> |
One "container" element per DIV is
required. |
<PB> |
This is required in Level 1. Page
images can be linked to the text using ID/IDREF or ENTITYREF
attributes. Using ENTITYREF has advantages for maintaining large
numbers of image files, but would require modifying the TEILite
DTD. |
<FIGURE> |
This element is optional at Level 1.
The advantage of using <FIGURE> is the ability to record
metadata using <FIGDESC>. |
Return to top
- V.2. LEVEL 2: Minimal
Encoding
Purpose: To create electronic text for keyword
searching, linking to page images, and identifying simple
structural hierarchy to improve navigation.
Rationale: The text is subordinate to the page image,
though navigational markers (textual divisions, heads) are
captured. The text could stand alone as electronic text (without
page images) if the accuracy of its contents is suitable to its
intended use and it is not necessary to display low-level
typographic or structural information. Level 2 requires a set of
elements more granular than those of Level 1, including
bibliographic or structural information below the monographic or
volume level, but still does not require a specialist to
identify.
Though texts at Level 2 can be created and encoded by
automated means, based on the typographic elements in the
electronic file (for example, bold centered text at the top of
the page surrounded by whitespace indicates a new chapter head,
and thus a new division), it is not likely to be absolutely
reliable across a large body of material. Level 2 encoding
requires some human intervention to identify each textual
division and heading. Level 2 texts do not require any specialist
knowledge or manual intervention below the section level.
Level 2 texts can be displayed separately from their page
images. Even when displayed with page images, Level 2 encoding of
sections and heads provides greater navigational possibilities
than Level 1 encoding, and enables searching to be restricted
within particular textual divisions (for example, searching for
two phrases within the same chapter).
Level 2 is most suitable for projects with the following
characteristics:
- a large volume of material is to be made available online
quickly
- a digital image of each page is desired
- the material is of interest to a large community of users who
wish to read texts that allow keyword searching
- rudimentary search and display capabilities based on the
large structures of the text are desired
- each text will be checked to ensure that divisions and
headers are properly identified
- extensibility is desired; that is, one desires to keep open
the option for a higher level of tagging to be added at a later
date
All elements specified in Level 1 plus the following:
<FRONT>, <BACK> |
Optional |
<HEAD> |
Required if present |
<DIV1> |
Type="section" is the default
attribute value. It is recommended that the N attribute be
included to record the div sequence. |
<P> |
One "container" element per DIV is
required. |
Return to top
- V.3. LEVEL 3: Simple
Analysis
Purpose: To create text that can stand alone as
electronic text and identifies hierarchy and typography without
content analysis being of primary importance.
Rationale: Level 3 texts can be created from scratch or
by the relatively easy conversion of existing HTML or
word-processing documents. Encoding offers the advantage of the
TEI Header, interoperability with other TEI collections, and
extensibility to higher levels of encoding. Level 3 generally
requires some human editing, but the features to be encoded are
determined by the appearance of the text and not specialized
content analysis.
Level 3 texts identify front and back matter, and all
paragraph breaks. The finer granularity of tagging these
features, as well as figures, notes, and all changes of
typography, allows a range of options for display, delivery, and
searching. For example, one has the option of identifying and,
therefore, specifying the display charactersitics of different
typographic styles, and regularizing the display and placement of
note text.
Level 3 texts can stand alone as text without page images and,
therefore, can be uploaded, downloaded and delivered quickly, and
require less storage space than digital collections with page
images. However, the simple level of structural anaylsis and
absence of specialized content analysis reflected in Level 3
tagging may make it desirable for some, depending on project
priorities, to include page images in order to provide users with
a fuller set of resources.
Level 3 is most suitable for projects with the following
characteristics:
- the material is of interest to a large community if users who
wish to read texts that allow keyword searching
- some sophistication of display, delivery, and searching based
on structure of the text is desired
- each text will be checked to ensure that tagging decisions
have been made appropriately
- the users of the texts may have limited storage or display
capabilities
- the creator of the texts has limited or no ability to provide
content specialists to analyze, tag, or review texts
- extensibility is desired; that is, one desires to keep open
the option for a higher level of tagging to be added at a later
date
All elements specified in Levels 1 and 2, plus the
following:
<FRONT>, <BACK> |
Required if present. |
<P> |
Required for paragraph breaks in
prose; may be used for stanzas using <LB> for line breaks
in verse. |
<FIGURE> |
Required to indicate figures other
than page images. |
<HI> |
Required to indicate changes in
typeface. REND attribute is optional. |
<NOTE> |
All notes must be encoded. It is
also recommended that notes that extend beyond one page be
combined into one <NOTE> element. Marginal notes, without
reference, should occur at the beginning of the paragraph to
which they refer, with the value of the PLACE attribute as
"margin". |
NOTE ON <NOTE>:
For processing reasons, it may be desirable to move footnotes
from their original location in the text. If left at the bottom
of a page, a note may become included in another paragraph or
section of the encoded text, and thus separated from its
reference. There are options for placement of footnotes if they
are moved:
- Inline. The note is inserted at the point of reference. An N
attribute is the value of the note. No <REF> element is
needed with this option.
- End-of-Paragraph. <REF> with target attribute occurs at
point of reference. <NOTE> with ID attribute occurs within,
but at the end of the paragraph in which the reference
occurs.
- End-of-Div. Notes moved to the end of the <DIV>
Return to top
- V.4. LEVEL 4: Basic Content
Analysis
Purpose: To create text that can stand alone as
electronic text, identifies hierarchy and typography, specifies
function of textual and structural elements, and describes the
nature of the content and not merely its appearance. This level
is not meant to encode or identify all structural, semantic or
bibliographic features of the text.
Rationale: Greater description of function and content
allows for:
- flexibility of display and delivery
- sophisticated searching within specified textual and
structural elements
- combining the broadest range of uses and audiences
Texts encoded at Level 4 are able to stand alone as part of a
library collection, and do not require images in order for them
to be read by students, scholars and general readers. This level
of TEI encoding allows them to be displayed or printed in a
variety of ways suitable for classroom or scholarly use.
Level 4 texts contain tags and attributes that describe
content. For example, lines of verse are tagged with <L>;
the <P> tag is reserved for true paragraphs. Attributes of
the text that contribute to meaning are preserved, such as
indentation of lines of verse and typography. These are textual
features that are not encoded at lower levels and that allow the
text to be used and understood fully independent of images.
The ability to stand alone as text means that Level 4 texts
can be uploaded, downloaded, and delivered quickly, and require
less storage space than collections with page images.
Finally, functionally accurate tagging in Level 4 texts allows
them to be searched or displayed in sophisticated ways. For
example, a searcher could limit his or her search in a dramatic
text to stage directions or to the speeches of a particular
character. In a volume of poetry published by subscription, a
search could be confined to names that appear in lists, thus
limiting a search to names of people who subscribed to a
particular volume. This ability to limit searches becomes more
significant as textbases become larger, and thus is of great
importance to the library community as it attempts to build into
the initial design and implementation of textbases features
needed to enhance interoperability.
Level 4 is most suitable for projects with the following
characteristics:
- the users of the texts may have limited storage or display
capabilities
- sophisticated search and display capabilities are
desired
- the collection is of interest to a currently existing, well
defined community of users, such as that that would constitute a
market for any published text or collection of texts
- the collection is rare and not available to users in print or
other electronic formats
- the texts will be used for pedagogical or scholarly purposes,
not just as reading copies
- extensibility is desired; that is, one desires to keep open
the option for a higher level of tagging to be added by the
scholarly community at a later date
In considering such a level 4 TEI digitization project, an
academic library should consult with faculty members and
collection bibliographers, and ask the following question: Is
this collection of texts one that the library should purchase if
it were available commercially? If so, the benefits of a Level 4
project are many, for the result is a freely available collection
of texts owned and administered by the library community, thus
free of licensing restrictions and on-going access charges.
- General Level 4 Recommendations:
- V.4.1. Emphasized text should be encoded as <FOREIGN>,
<TITLE>, <EMPH>, as appropriate. Any ambiguous
emphasized text should be encoded as <HI>.
- V.4.2. It is recommended that the <SIC> element be used
to indicate typographic errors, with corrections noted as the
value of the CORR attribute.
- V.4.3. <TITLEPAGE> should include the verso if present,
divided with by <PB N="verso">. Tables of contents, errata,
subscription lists, "other titles by the same author" should be
included in a separate numbered DIV, as a <LIST> with
<ITEM>s. Frontispieces should be encoded as a
<FIGURE>, within a separate numbered <DIV> and
<P>.
- Level 4 Prose:
- V.4.4. Letters that occur within the text body provide some
challenges. It is recommended that quoted letters that occur as
part of a text (and not collections of letters themselves) be
encoded within <q><text><body><div1
type="letter">, with <opener>, <dateline>,
<salute>, <signed>, <closer> included as
appropriate.
- V.4.5. Quotations that do not occur inline, but are set off
typographically in some way, should be encoded as <q>.
- V.4.6. Notes are to be encoded as described in Level 3.
- V.4.7. <Argument>, <Opener>, <epigraph>,
<closer>, <trailer>, <add>, <del>,
<unclear> as appropriate.
- Level 4 Drama:
- V.4.8. Cast lists should be encoded as <LIST>s, with
<ITEM>s.
- V.4.9. Speeches are encoded as <SP>, with speakers
identified within <SPEAKER> elements.
- Level 4 Verse:
- V.4.10 All verse, even poems without separate stanzas or
verse paragraphs, should be contained within a line group element
<LG>. This will assist with automated processing and
retrieval.
- V.4.11 It is common to see informal divisions within poems,
noted by a string of asterisks or periods. These should be
encoded as <MILESTONE>s with attribute values of
UNIT="typography"and N= indicating the character used and its
occurrence, <MILESTONE UNIT="typography" N="******">.
- V.4.12 <L> It is strongly recommended that indentation
is recorded using the REND attribute.
- Level 4 Front and Back Matter:
- V.4.13 It is recommended that all prefaces, tables of
contents, afterwords, appendices, endnotes and apparatus be
encoded. For publisher's advertisements, indexes, and glossaries
or other front or back matter that isn't considered of primary
importance to the text, there are three options:
- Fully transcribe and encode
- Link to page images (may include an unencoded
transcription)
- Omit, noted in <EditorialDesc>
Return to top
- V.5. LEVEL 5: Scholarly
Encoding Projects
Level 5 texts are those that require subject knowledge, and
encode semantic, linguistic, prosodic or other elements beyond a
basic structural level.
Return to top
- TYPE
Constructing a list of acceptable attribute values for TYPE that
could find wide agreement is impossible. Instead, it is
recommended that projects describe the TYPE attribute values used
in their texts in the project documentation and that this list be
made available to people using the texts. See ABC for Book
Collectors by John Carter (7th edition, New Castle, DE:
Oak Knoll Books, 1995) for a list of standard names and
definitions of bibliographic features of printed books. For those
elements where TYPE is not required, such as <HEAD> and
<TITLE>, use the attribute values for subtitles and
additional titles, but not main titles.
- REND
Difficulty using REND attributes occurs when it is desirable to
record more than one rendition feature. With this in mind, it is
recommended that projects employ the following adaptation of
"rendition ladders", a concept developed at the Brown Women Writers Project
<http://www.wwp.brown.edu>. This concept allows for strings
of rendition features to be included as one REND value. Rendition
ladders consist of categories of renditions, with further defined
values included in parentheses.
REND should only be used to override a default value. For
instance, if all text encoded as <HI> is defined as being
rendered in italics, there is no reason to encode text as
<HI REND="font(italics)">
Combining attributes would result in a tag with attributes such
as this:
<L REND="font(italics)align(right)">
- FONT
italics, bold, fsc (full and smallcaps), smallcap, underlined,
gothic
- ALIGN
right, left, center, block
- INDENT
Values in parentheses should indicate the number of tabstops to
be indented, e.g., <L REND="indent(1)">
- LANG
Use ISO639-2 three-character language codes.
Return to top
return to top >>
|