|
|
TEI Text Encoding in Libraries
Guidelines for Best Encoding Practices
Version 1.0 (July 30, 1999)
Comments to Perry
Willett, Indiana University (email: pwillett@indiana.edu)
- Introduction
- Participants
- Recommendations
- General
Recommendations
- Encoding Levels
- Level 1: Fully
Automated Conversion and Encoding
- Level 2: Minimal
Encoding
- Level 3: Simple
Analysis
- Level 4: Basic
Content Analysis
- Level 5: Scholarly
Encoding Projects
- Attribute
Values
At the TEI and XML
in Digital Libraries Workshop held at the Library of Congress on June
30-July 1, 1998, three working groups were formed. Group 2 was charged with
developing a set of recommendations for libraries using the TEI Guidelines in
electronic text encoding. Representatives from six libraries met at the Library
of Congress on November 12-13, 1998. The Task Force met again at ALA mid-winter
(January 1999) to incorporate comments and finalize the draft. The revised
recommendations were circulated to the conference working group in May 1999 and
presented at the joint annual meeting of the Association of Computers and the
Humanities and Association of Literary and Linguistic Computing in June 1999.
Version 1.0 was circulated for comments in August 1999.
Return to
top
- LeeEllen Friedland, Library of Congress
- Nancy Kushigian, University of California, Davis
- Christina Powell, University of Michigan
- David Seaman, University of Virginia
- Natalia Smith, University of North Carolina at Chapel Hill
- Perry Willett, Indiana University
Return to
top
Our recommendations are for libraries using the TEILite DTD v1.6. There are
many different library text digitization projects, for different purposes. With
this in mind, the Task Force has attempted to make these recommendations as
inclusive as possible by developing a series of encoding levels. These levels
are meant to allow for a range of practice, from wholly automated text creation
and encoding, to encoding that requires expert content knowledge, analysis, and
editing.
Encoding levels 1-4 require no expert knowledge of content. Level 5, in
contrast, requires scholarly analysis. Levels 1-4 allow the conversion and
encoding of texts to be performed without the assistance of content experts and
can be enriched with more markup at any time. Recommendations for Levels 1-4 are
intended for projects wishing to create encoded electronic text with structural
markup, but minimal semantic or content markup. Also, the encoding levels are
cumulative: encoding requirements at each level incorporate the requirements of
lower levels.
These recommendations are concerned with the text portion of a TEI-encoded
document. While there are modest requirements for including certain information
about encoding level in the TEI Header, a separate set of recommendations has
been developed to address issues concerning TEI Header contents to MARC-format
bibliographic data (see TEI/MARC Best
Practices Document from Working Group 1).
Return to
top
- The encoding level (as described in this document) should be recorded in
the <editorialDecl>, along with an explanation of any deviation from the
recommendations.
- Electronic text at all levels of encoding should begin with the
transcription of the first word on the first leaf of the original work. It may
be impractical or undesirable to transcribe and encode certain features of the
text, such as publisher's advertisements or indexes, but if at all possible,
they should be included as links to page images. Any omissions of material
found in the original work should be noted in the <editorialDecl> in the
TEI Header.
- File naming should follow ISO 9660 conventions: 8-character filenames,
3-character extensions, using A-Z, a-z, 0-9, underscores and hyphens.
- Numbered <DIV>s present advantages to search and indexing software
by explicitly communicating the hierarchical level of the section described.
One anomaly of the TEI Guidelines is that <DIV0> is not available in
<FRONT> or <BACK> matter. Therefore, we recommend the use of
numbered <DIV>s throughout the electronic text, always beginning with
<DIV1>. Texts at all levels should include at least one <DIV1>.
- Page breaks <PB> should occur at the top of the page, and entirely
within any DIV.
Return to
top
- V.1. LEVEL 1: Fully Automated Conversion and
Encoding
Purpose: To create electronic text with the primary purpose of
keyword searching and linking to page images. The primary advantage in using
the TEILite DTD at this level is that a TEI Header is attached to the text
file.
Rationale: That text is subordinate to the page image, and is not
intended to stand alone as an electronic text (without page images).
Texts at Level 1 can be created and encoded by fully automated means, using
uncorrected OCR of page images ("dirty OCR"), or exporting from existing
electronic text files. Only those tags that are necessary to divide the text
from the header and facilitate linking to page images are used. Encoding is
performed automatically based on artifacts of the OCR or other document
creation process (page breaks, for example) and metadata collected during the
imaging or preparation process. This encoding is both minimal and reliable,
and does not typically require extensive review of each page of each text.
Level 1 texts are not intended to be adequate for textual analysis; they
are more likely to be suited to the goals of a preservation unit or mass
digitization initiative. Though their encoding is minimal, Level 1 texts are
fully valid SGML texts. In addition to taking advantage of the TEI Header,
using the TEILite DTD allows Level 1 texts to be compatible with more richly
encoded TEILite texts for searching, for example. Further encoding based on
document structures or content analysis can be added to a Level 1 text at any
time.
Level 1 is most suitable for projects with the following
characteristics:
- a large volume of material is to be made available online quickly
- a digital image of each page is desired
- no manual intervention will be performed in the text creation process
- the material is of interest to a large community of users who wish to
read texts that allow keyword searching
- sophisticated search and display capabilities based on the structure of
the text are not necessary
- extensibility is desired; that is, one desires to keep open the option
for a higher level of tagging to be added at a later date
<DIV1> |
Type="section" is the default attribute
value. |
<P> |
One "container" element per DIV is
required. |
<PB> |
This is required in Level 1. Page images can be
linked to the text using ID/IDREF or ENTITYREF attributes. Using
ENTITYREF has advantages for maintaining large numbers of image files,
but would require modifying the TEILite DTD. |
<FIGURE> |
This element is optional at Level 1. The
advantage of using <FIGURE> is the ability to record metadata
using <FIGDESC>. |
Return to top
- V.2. LEVEL 2: Minimal Encoding
Purpose: To create electronic text for keyword searching, linking to
page images, and identifying simple structural hierarchy to improve
navigation.
Rationale: The text is subordinate to the page image, though
navigational markers (textual divisions, heads) are captured. The text could
stand alone as electronic text (without page images) if the accuracy of its
contents is suitable to its intended use and it is not necessary to display
low-level typographic or structural information. Level 2 requires a set of
elements more granular than those of Level 1, including bibliographic or
structural information below the monographic or volume level, but still does
not require a specialist to identify.
Though texts at Level 2 can be created and encoded by automated means,
based on the typographic elements in the electronic file (for example, bold
centered text at the top of the page surrounded by whitespace indicates a new
chapter head, and thus a new division), it is not likely to be absolutely
reliable across a large body of material. Level 2 encoding requires some human
intervention to identify each textual division and heading. Level 2 texts do
not require any specialist knowledge or manual intervention below the section
level.
Level 2 texts can be displayed separately from their page images. Even when
displayed with page images, Level 2 encoding of sections and heads provides
greater navigational possibilities than Level 1 encoding, and enables
searching to be restricted within particular textual divisions (for example,
searching for two phrases within the same chapter).
Level 2 is most suitable for projects with the following
characteristics:
- a large volume of material is to be made available online quickly
- a digital image of each page is desired
- the material is of interest to a large community of users who wish to
read texts that allow keyword searching
- rudimentary search and display capabilities based on the large
structures of the text are desired
- each text will be checked to ensure that divisions and headers are
properly identified
- extensibility is desired; that is, one desires to keep open the option
for a higher level of tagging to be added at a later date
All elements specified in Level 1 plus the following:
<FRONT>, <BACK> |
Optional |
<HEAD> |
Required if present |
<DIV1> |
Type="section" is the default attribute value.
It is recommended that the N attribute be included to record the div
sequence. |
<P> |
One "container" element per DIV is
required. |
Return
to top
- V.3. LEVEL 3: Simple Analysis
Purpose: To create text that can stand alone as electronic text and
identifies hierarchy and typography without content analysis being of primary
importance.
Rationale: Level 3 texts can be created from scratch or by the
relatively easy conversion of existing HTML or word-processing documents.
Encoding offers the advantage of the TEI Header, interoperability with other
TEI collections, and extensibility to higher levels of encoding. Level 3
generally requires some human editing, but the features to be encoded are
determined by the appearance of the text and not specialized content
analysis.
Level 3 texts identify front and back matter, and all paragraph breaks. The
finer granularity of tagging these features, as well as figures, notes, and
all changes of typography, allows a range of options for display, delivery,
and searching. For example, one has the option of identifying and, therefore,
specifying the display charactersitics of different typographic styles, and
regularizing the display and placement of note text.
Level 3 texts can stand alone as text without page images and, therefore,
can be uploaded, downloaded and delivered quickly, and require less storage
space than digital collections with page images. However, the simple level of
structural anaylsis and absence of specialized content analysis reflected in
Level 3 tagging may make it desirable for some, depending on project
priorities, to include page images in order to provide users with a fuller set
of resources.
Level 3 is most suitable for projects with the following
characteristics:
- the material is of interest to a large community if users who wish to
read texts that allow keyword searching
- some sophistication of display, delivery, and searching based on
structure of the text is desired
- each text will be checked to ensure that tagging decisions have been
made appropriately
- the users of the texts may have limited storage or display capabilities
- the creator of the texts has limited or no ability to provide content
specialists to analyze, tag, or review texts
- extensibility is desired; that is, one desires to keep open the option
for a higher level of tagging to be added at a later date
All elements specified in Levels 1 and 2, plus the following:
<FRONT>, <BACK> |
Required if present. |
<P> |
Required for paragraph breaks in prose; may be
used for stanzas using <LB> for line breaks in verse. |
<FIGURE> |
Required to indicate figures other than page
images. |
<HI> |
Required to indicate changes in typeface. REND
attribute is optional. |
<NOTE> |
All notes must be encoded. It is also
recommended that notes that extend beyond one page be combined into one
<NOTE> element. Marginal notes, without reference, should occur at
the beginning of the paragraph to which they refer, with the value of
the PLACE attribute as "margin". |
NOTE ON <NOTE>:
For processing reasons, it may be desirable to move footnotes from their
original location in the text. If left at the bottom of a page, a note may
become included in another paragraph or section of the encoded text, and thus
separated from its reference. There are options for placement of footnotes if
they are moved:
- Inline. The note is inserted at the point of reference. An N attribute
is the value of the note. No <REF> element is needed with this option.
- End-of-Paragraph. <REF> with target attribute occurs at point of
reference. <NOTE> with ID attribute occurs within, but at the end of
the paragraph in which the reference occurs.
- End-of-Div. Notes moved to the end of the <DIV>
Return
to top
- V.4. LEVEL 4: Basic Content Analysis
Purpose: To create text that can stand alone as electronic text,
identifies hierarchy and typography, specifies function of textual and
structural elements, and describes the nature of the content and not merely
its appearance. This level is not meant to encode or identify all structural,
semantic or bibliographic features of the text.
Rationale: Greater description of function and content allows
for:
- flexibility of display and delivery
- sophisticated searching within specified textual and structural elements
- combining the broadest range of uses and audiences
Texts encoded at Level 4 are able to stand alone as part of a library
collection, and do not require images in order for them to be read by
students, scholars and general readers. This level of TEI encoding allows them
to be displayed or printed in a variety of ways suitable for classroom or
scholarly use.
Level 4 texts contain tags and attributes that describe content. For
example, lines of verse are tagged with <L>; the <P> tag is
reserved for true paragraphs. Attributes of the text that contribute to
meaning are preserved, such as indentation of lines of verse and typography.
These are textual features that are not encoded at lower levels and that allow
the text to be used and understood fully independent of images.
The ability to stand alone as text means that Level 4 texts can be
uploaded, downloaded, and delivered quickly, and require less storage space
than collections with page images.
Finally, functionally accurate tagging in Level 4 texts allows them to be
searched or displayed in sophisticated ways. For example, a searcher could
limit his or her search in a dramatic text to stage directions or to the
speeches of a particular character. In a volume of poetry published by
subscription, a search could be confined to names that appear in lists, thus
limiting a search to names of people who subscribed to a particular volume.
This ability to limit searches becomes more significant as textbases become
larger, and thus is of great importance to the library community as it
attempts to build into the initial design and implementation of textbases
features needed to enhance interoperability.
Level 4 is most suitable for projects with the following
characteristics:
- the users of the texts may have limited storage or display capabilities
- sophisticated search and display capabilities are desired
- the collection is of interest to a currently existing, well defined
community of users, such as that that would constitute a market for any
published text or collection of texts
- the collection is rare and not available to users in print or other
electronic formats
- the texts will be used for pedagogical or scholarly purposes, not just
as reading copies
- extensibility is desired; that is, one desires to keep open the option
for a higher level of tagging to be added by the scholarly community at a
later date
In considering such a level 4 TEI digitization project, an academic library
should consult with faculty members and collection bibliographers, and ask the
following question: Is this collection of texts one that the library should
purchase if it were available commercially? If so, the benefits of a Level 4
project are many, for the result is a freely available collection of texts
owned and administered by the library community, thus free of licensing
restrictions and on-going access charges.
- General Level 4 Recommendations:
- V.4.1. Emphasized text should be encoded as <FOREIGN>,
<TITLE>, <EMPH>, as appropriate. Any ambiguous emphasized text
should be encoded as <HI>.
- V.4.2. It is recommended that the <SIC> element be used to
indicate typographic errors, with corrections noted as the value of the CORR
attribute.
- V.4.3. <TITLEPAGE> should include the verso if present, divided
with by <PB N="verso">. Tables of contents, errata, subscription
lists, "other titles by the same author" should be included in a separate
numbered DIV, as a <LIST> with <ITEM>s. Frontispieces should be
encoded as a <FIGURE>, within a separate numbered <DIV> and
<P>.
- Level 4 Prose:
- V.4.4. Letters that occur within the text body provide some challenges.
It is recommended that quoted letters that occur as part of a text (and not
collections of letters themselves) be encoded within
<q><text><body><div1 type="letter">, with
<opener>, <dateline>, <salute>, <signed>,
<closer> included as appropriate.
- V.4.5. Quotations that do not occur inline, but are set off
typographically in some way, should be encoded as <q>.
- V.4.6. Notes are to be encoded as described in Level 3.
- V.4.7. <Argument>, <Opener>, <epigraph>,
<closer>, <trailer>, <add>, <del>, <unclear>
as appropriate.
- Level 4 Drama:
- V.4.8. Cast lists should be encoded as <LIST>s, with
<ITEM>s.
- V.4.9. Speeches are encoded as <SP>, with speakers identified
within <SPEAKER> elements.
- Level 4 Verse:
- V.4.10 All verse, even poems without separate stanzas or verse
paragraphs, should be contained within a line group element <LG>. This
will assist with automated processing and retrieval.
- V.4.11 It is common to see informal divisions within poems, noted by a
string of asterisks or periods. These should be encoded as
<MILESTONE>s with attribute values of UNIT="typography"and N=
indicating the character used and its occurrence, <MILESTONE
UNIT="typography" N="******">.
- V.4.12 <L> It is strongly recommended that indentation is recorded
using the REND attribute.
- Level 4 Front and Back Matter:
- V.4.13 It is recommended that all prefaces, tables of contents,
afterwords, appendices, endnotes and apparatus be encoded. For publisher's
advertisements, indexes, and glossaries or other front or back matter that
isn't considered of primary importance to the text, there are three options:
- Fully transcribe and encode
- Link to page images (may include an unencoded transcription)
- Omit, noted in <EditorialDesc>
Return
to top
- V.5. LEVEL 5: Scholarly Encoding Projects
Level 5 texts are those that require subject knowledge, and encode
semantic, linguistic, prosodic or other elements beyond a basic structural
level. Return to top
- TYPE
Constructing a list of acceptable attribute values for TYPE that
could find wide agreement is impossible. Instead, it is recommended that
projects describe the TYPE attribute values used in their texts in the project
documentation and that this list be made available to people using the texts.
See ABC for Book Collectors by John Carter (7th edition, New
Castle, DE: Oak Knoll Books, 1995) for a list of standard names and
definitions of bibliographic features of printed books. For those elements
where TYPE is not required, such as <HEAD> and <TITLE>, use the
attribute values for subtitles and additional titles, but not main titles.
- REND
Difficulty using REND attributes occurs when it is desirable to
record more than one rendition feature. With this in mind, it is recommended
that projects employ the following adaptation of "rendition ladders", a
concept developed at the Brown Women
Writers Project <http://www.wwp.brown.edu>. This concept allows for
strings of rendition features to be included as one REND value. Rendition
ladders consist of categories of renditions, with further defined values
included in parentheses. REND should only be used to override a default
value. For instance, if all text encoded as <HI> is defined as being
rendered in italics, there is no reason to encode text as
<HI
REND="font(italics)">
Combining attributes would result in a tag
with attributes such as this:
<L
REND="font(italics)align(right)">
- FONT
italics, bold, fsc (full and smallcaps), smallcap, underlined,
gothic
- ALIGN
right, left, center, block
- INDENT
Values in parentheses should indicate the number of tabstops to
be indented, e.g., <L REND="indent(1)">
- LANG
Use ISO639-2 three-character language codes.
Return to
top
Please send
comments or suggestions.
Last updated:
|