Caroline Arms and Carl Fleischhauer.Office of Strategic
Initiatives, Library of Congress. View presentation
The Library of Congress is drafting a planning
framework that identifies and documents digital content formats that are
promising (and unpromising) for long-term sustainability. The resulting reference resource is intended
to serve staff who evaluate born digital content for
selection for the Library's collections and make provisions to sustain that
The term format is used broadly in this context,
formats at the level indicated by Windows file extensions or Internet MediaType (aka MIME type)
or subtypes of these that develop through time or are tailored to narrow,
of related formats, whose familial characteristics are important
that must be distinguished in terms of their underlying bitstream
formats that bind together the files or bitstreams
comprising a single digital work
The initial investigation has outlined two sets
of high-level factors that may be used when choosing formats:
(a)conceptual factors that
affect the sustainability of any digital format
(b)factors that relate to
quality or functionality (beyond normal rendering) that might be desired for
certain categories of content
The shorthand names for the sustainability
factors are disclosure, adoption, transparency, self-documentation, external
dependencies, and technical protection mechanisms. Quality and functionality factors have been sketched
for sound, still images, text, and video -- content categories with which
Library staff have experience in the digital realm.
Additional content categories are being added as
the investigation continues. The
activity is also developing summary descriptions for digital formats, intended
to be synergistic with the proposed Digital Format Registry.
Stephen L. Abrams, Digital Library Program Manager, HarvardUniversity Library
The concept of representation format, or type,
permeates all technical areas of digital repositories. Policy and processing decisions regarding
object ingest, storage, access, and preservation are frequently conditioned on
a per-format basis. In order to achieve
necessary operational efficiencies, repositories need to be able to automate
these procedures to the fullest extent possible.
JSTOR and the Harvard University Library are
collaborating on a project to develop an extensible framework for format
validation: JHOVE (pronounced "jove"), the
JSTOR/Harvard Object Validation Environment.
JHOVE provides functions to identify, validate,
and characterize digital objects: Format identification is the process of
determining the format to which a digital object conforms, e.g.: "I have a
digital object; what format is it?"
Format validation is the process of determining
the level of compliance of a digital object to the specification for its
purported format: "I have an object purportedly of format F; is it?"
Format characterization is the process of determining
the format-specific significant properties of an object of a given format:
"I have an object of format F; what are its salient properties?"JHOVE is a stand-alone, command-line
oriented Java application, with an extensible plug-in architecture. In its initial release, JHOVE includes
modules for recognizing and validating ASCII and UTF-8 encoded text, TIFF
(including popular public profiles, such as TIFF/EP, TIFF/IT, and GeoTIFF), and PDF (including profiles such as PDF/X-1, -1a,
-2, and -3).
2:00-3:30:BREAKOUT SESSION 2: RESOURCE MANAGEMENT. Alvarado D.
Patrick Hochstenbach, Herbert Van de Sompel. Los Alamos National Laboratory, Research Library,
View presentation View handout
Various XML-based approaches aimed
at representing so-called complex digital objects have emerged over the last
years. The MPEG-21 Digital Item
Declaration Language (DIDL) is an XML-packaging specification that, so far, has
received little attention in the Digital Library Community. The first part of this presentation will
highlight major characteristics of DIDL, and report on research conducted at
the LANL Research Library to determine the applicability of DIDL for the
representation of complex objects in the LANL repository. The second part of the presentation will
discuss a repository architecture under development at
LANL, in which DIDL-conformant documents are the unit of storage. The architecture builds on the OAI-PMH, the
forthcoming NISO OpenURL Framework Standard, and
concepts from the MPEG-21 Digital Item Processing specification to make stored
content accessible. While the discussion will be framed in the context of
ongoing work at LANL, it is hoped that it will reveal the relevance of some of
the presented concepts to other Digital Library efforts.
Thanks to a generous grant from the Andrew W.
Mellon Foundation, the University of Virginia Library and Cornell Information
Science have now made available an open-source version of the Flexible
Extensible Digital Object Repository Architecture (Fedora). This presentation is an update on the Fedora
project since its initial release on May
Fedora is an open source digital object
repository system that supports both management and delivery of heterogeneous
digital content. The system can be used
as the foundation for a variety of information management solutions including
institutional repositories, preservation management systems, digital asset
management systems, content management systems (CMS), and digital
As a simple use case, Fedora can be used to
manage and deliver digital objects that aggregate one or more content streams
into a single digital object.For
example, a Fedora object can be an aggregation of different resolutions of the
same digital image. More interestingly,
Fedora provides the building blocks for managing and delivering more complex
objects which have services associated with them. These objects can define relationships among
content streams and can be configured to deliver one or more specialized views
or "representations" of the content streams within the object.
Fedora provides web service interfaces to
support digital object management and access. New features of the system include content versioning, object change
tracking, time-stamped access requests (i.e., to obtain former views of
objects), and a new graphical object editor.
This presentation will offer a brief review of
the system features, followed by a demo of the latest software release (Fedora
1.2, available October 2003). It will
also report on progress of institutions who have installed Fedora, software
download statistics, and future development plans.Fedora is currently available for download
:BREAKOUT SESSION 4:
PRESERVATION. Alvarado Room D
LOCKSS Implementation: technology, collections, and access.
Tom Robertson, Technical Manager, LOCKSS Program,
Stanford University; Perry Willett, Head, Library Electronic Text Resource
Service, Indiana University; Martin Halbert, Director for Library Systems, Emory University.
An unintended consequence of the web is that
libraries cannot easily collect and preserve e-collections. They lease access to paid content; they
merely access freely available content. Libraries are unable to own
collections; they are able to offer only very limited services around these
collections. Libraries in the web
environment are unable to full their traditional society memory organizational
role. One of the biggest risks to libraries fulfilling their memory role, or
for any institution wishing to take responsibility for digital preservation, is
a budget cut. The LOCKSS approach tries
to prevent content being lost through budget cuts by dispersing all costs and
responsibilities across many institutions. The systems robustness depends upon
redundancy of hardware, software, content and administration.
The panel will present a technical overview and
real-world implementation activities: Collection Development activities in the
Humanities; Providing seamless end user access to locally stored content.
Robertson, Technical Manager, LOCKSS Program, StanfordUniversity: Tom will outline the
LOCKSS technology, the Program's status, and next steps
Willett, Head, Library Electronic Text Resource
Service, IndianaUniversity: Perry co-chairs a group selecting American
and British literature e-journals for LOCKSS preservation. He will discuss importance of preserving
humanities born digital materials and the process of obtaining publisher
Martin Halbert, Director for Library Systems, EmoryUniversity: Emory is prototyping
various methods of configuring LOCKSS caches within institutional networks so
readers can access "lockss-cached" materials when the publisher's
site is unavailable. Martin will outline
experiences and methods to integrate LOCKSS using PAC files, EZ Proxy, and
Squid and describe future work with Open URLs.
:Reception. The Franciscan Ballroom, SheratonOldTown.
Although data has long been an important element
of social science research and instruction, the nature of data needs within the
social sciences has changed dramatically in recent years. A major trend is the dramatic increase in
demand for numeric data by undergraduates who use and manipulate it in their
own research. The number of courses that
include data intensive assignments has also increased. As technology has developed, researchers are
increasingly looking for efficient ways to share data electronically with local
and distant colleagues. Finally,
researchers and librarians alike are recognizing the need to create electronic
archives of available data.
The Data Extraction Web Interface (DEWI) System
is a suite of tools for the processing, preservation, and delivery of
Stanford's social science numeric data collection. DEWI provides an integrated point of service
for data users, by allowing users to browse lists of variables, search for
variables of interest, and create custom sub-sets of data which can be
downloaded to personal computers in a variety of formats compatible with
popular statistical software. In
addition, supporting documentation in the form of codebooks, external links,
and locally developed guides to the data are available for most data sets.
DEWI can also be used to restrict access to
datasets to selected users. This allows us to ingest and serve current data
collected by faculty, so that research teams can use DEWI to access, control,
and archive their data before releasing it for public use. This feature of DEWI encourages early and
therefore more accurate and complete, creation of metadata and documentation.
In this presentation, we will describe the
development of DEWI, discuss how DEWI has been used within the Stanford
research and teaching community, and discuss some of the directions--including
broader collaborative efforts--we are exploring in the future development of
DEWI. We will also discuss how DEWI represents one approach in the search for
solutions to the kinds of challenges identified at the 1999 Digital Library
Federation Workshop on Social Science Data Archives.
Collections with Greenstone Digital Library Software.
The University of Chicago Library recently launched a new
digital collection of musical scores, Chopin Early Editions, using emerging
standards for building digital collections from reformatted materials.
Descriptive metadata were extracted from the OPAC, cross-walked to MODS, and
combined with structural metadata to create METS objects. Using XSLT, METS
records were converted into the Greenstone Archival Format and loaded into
Greenstone Digital Library Software.
Greenstone is a well-established, configurable,
and customizable system for creating digital libraries. It accepts documents in a number of common
formats and its own internal format. With no customization, Greenstone provides
full text searching, metadata-based searching and browsing, support for custom
metadata, and hierarchical or page-turning document navigation. Because of Greenstone's great flexibility,
and with support from an active and knowledgeable user community, most of the
desired interface features for this collection have been implemented.
We have added custom metadata, modified browse
and search results displays, and implemented custom document displays and
intra-document navigation features. Modifications were accomplished through
editing configuration files, modifying display macros, and custom generation of
the internal Greenstone document format. More recently, Greenstone includes a tool to aid the creation and
administration of collections, including direct editing of document metadata.
SESSION 6: METADATA & PRODUCTION. Alvarado
Metadata Tradeoffs in High-Production
Nancy J. Hoebelheinrich. Metadata Coordinator, StanfordUniversity Libraries / Academic Information Resources (SUL/AIR)
A common myth exists from the Library point of
view that the more metadata that can be captured for a digital object the
better –- for purposes of resource identification, selection, rendering, and
reconstruction. But, how much metadata
is really necessary and practicable to capture and/or create in high-production
digitization environments for those purposes? What are economically feasible, realistic workflow scenarios that will
scale to a faster pace and enable enough metadata to be captured and created to
accomplish the goals of a digital repository? What are the tradeoffs between creating more-and-better metadata, and
creating a greater number of high quality digital objects? In this
presentation, metadata creation and capture will be described and evaluated for
two different high production environments at Stanford University Libraries.
SUL/AIR has developed two high-production digitization
channels in recent years:
onsite capture of the GATT Archive at the World Trade Organization
DL1 Laboratory on Stanford campus
For each of these channels, part of the
challenge associated with the capture process has been to decide kind, extent,
and feasibility of creating descriptive, technical, and administrative metadata
in a collaboration of collection, scanning, cataloging and metadata, IT, and
preservation staff. For this
presentation, the rationale behind the metadata models used will be
discussed. Also discussed will be the
roles played by metadata standards, and the tradeoffs made between sound data
structure and information capture. Finally, an assessment will be offered of some areas of research and
further testing that would prove useful to develop better answers to the
The Union Catalog of Art Images
(UCAI): Aggregating and Standardizing
Diverse Legacy Metadata.
Esme Cowles (Database Developer) and Linda Barnhart (Project Coordinator), Union Catalog of
The Union Catalog of Art Images (UCAI) project,
funded by The Andrew W. Mellon Foundation and centered at UCSD, has built a
prototype union catalog from three large and extremely diverse sets of legacy
metadata for art images. UCAI is a
research and development project that is building the technical infrastructure
for a shared cataloging resource for the visual resources community to promote
the copy cataloging of images. The
platform for the prototype is the open source native XML database Xindice, with an
open source front end search engine, Lucene. The aggregated
dataset, at 675,000 records, may be one of the larger implementations of a
native XML database in the digital library community, and includes thumbnail
images for approximately 25% of the records.
The speakers will describe the problems
uncovered and the software tools developed to map, standardize, and ingest the
metadata to build the UCAI prototype. The challenges of clustering and merging redundant metadata will also be
described. The speakers will also
address the drawbacks and benefits of our experience with this native XML
database, and the next steps for project development.
This session will provide an overview of what
NYU has learned so far in trying to establish best practices with regards to
archiving digital video, including a brief technical overview of relevant
characteristics of digital video, a discussion of the abstract requirements for
preservation-worthy digital video, and some discussion of costs for creating
and maintaining a large scale digital video archive.
Library's Digital Preservation Program.
Patricia Cruse, Director, Digital Preservation Program, California Digital Library
In partnership with the University of California libraries, the
California Digital Library established a digital preservation program to focus
on the persistent management of digital information. Since its inception last year, CDL's program
has developed so that it leverages the infrastructure available to the
California Digital Library. The program
is active in three primary areas:
methods to preserve and persistently manage e-journals
a preservation repository for content created or managed by the UC libraries
methods for gathering and persistently managing web-based materials
Findings from CDL's web-archiving activities,
which were funded by The Andrew W. Mellon Foundation, will be presented.
: BREAKOUT SESSION 8: USER PERSPECTIVE AND
ASSESSMENT.Alvarado Room D.
Digital Scholarship in the Academy: What
Ann Lally, Head, Digital Initiatives, University of Washington Libraries
In March, 2003, the University of Washington Libraries held a retreat to
discuss digital scholarship – that is, academic work that was not possible
prior to the development of digital technology.
This retreat, titled "From Vision to
Transformation: New Models of Academic Support for Digital Scholarship",
brought together sixty-two scholars, librarians, archivists, museum curators,
academic leaders and technologists. Discussions with retreat participants centered around two questions:
are scholars’ needs and wants regarding digital scholarship, collections, and
strategies should the University of Washington and the UW Libraries
take to advance such scholarship and learning?"
The retreat included presentations, facilitated
small group discussions, and opportunities for social interactions. Overwhelmingly the retreat attendees pointed
to the Library as the logical place to turn for assistance. A wealth of
information was collected at this retreat; the issues raised (and the
interesting omissions) and the models proposed by faculty will be discussed and
will include potential implications for libraries.
The Evolution of an Interface from the User
Perspective: From End User Testing to a Usage Log Analysis.
“Find Articles/Find Databases/Find e-Journals”
is the new system the Cornell University Library (CUL) has implemented to
replace the old “e-Reference Collection” system.
Find Articles/Find Databases/Find e-Journals is
built using Endeavor’s ENCompass product,
Oracle and XSL, whereas the e-Reference Collection was based on a MySQL
database and Perl. E-Reference allowed
for searching the metadata about both proprietary and unrestricted resources
and then connecting directly to each database for searching in its native
environment. The new system builds on
the existing database search capability with “Find Databases,” adds federated
searching at the article level with “Find Articles,” and provides title-level
access to the 20,000 or so electronic journals available through the CUL with
“Find e-Journals.” In addition, the
system adds reference linking with “Find it at Cornell,” which utilizes
Endeavor’s LinkFinder Plus.
CUL’sENCompass project team began
working on the new system in earnest in the summer of 2002 and brought the
system up live in May of 2003. Beginning
in fall, 2002, the team undertook substantial XSL and HTML customization of the
web interface and integrated a new authentication system using EZProxy.Through the course of faculty and student
focus groups, reference staff feedback, and, most recently, a usage log
analysis, the ENCompass
team has been able to measure effectiveness of the new system and consider or
implement improvements. The presentation will focus on the evolution of the
interface and implications the usage log study raises for maximizing system
performance in future.
:PLENARY.Alvarado Room ABC
David Seaman, DLF
A rose is a rose by any other name; what's a
a) Elizabeth A. S. Beaudin (Yale): UNICODE: The Right Tools, but how to use
The proposed presentation will review the
technological decisions made at the outset of an open source electronic union
list of Middle Eastern serials. For
example, it might be argued that a Microsoft solution was and is available to
address the aspects of the project's linguistic needs. Why not use it? Several factors informed the decision, among
these: server configuration and experimentation, hardware costs for Middle
Eastern partners, and international software accessibility, expertise and
popularity. Further, how is Unicode being
exploited in our system prototype? How
has the tool been put to use to resolve search, retrieval, and display
requirements? What challenges remain to
solve, for example, when XML is introduced in later years of the project?
b) John Kunze (CDL): Persistent Identifiers: What's left to be done?Alvarado
To solicit ideas, needs, and
visions that could be used to inform an analysis of how to move library
community discussion forward on the creation and maintenance of durable
links. As a possible point of departure,
attendees are encouraged to read a recent reformulation of the problem that can
be found here (PDF, 9 pages):
: BREAKOUT SESSION 10: RESOURCE MANAGEMENT. Alvarado D.
LibData: a library web
Paul F. Bramscher, Shane A, Nackerud, and John T. Butler, University of Minnesota, TwinCities View presentation
The goal of publishing dynamically-generated
library web pages through a system that integrates easy-to-use web authoring
tools with a large database of information resources led the University of Minnesota Libraries to build LibData. Like many large research libraries, Minnesota provides access to
thousands of resources, both licensed and freely available. Access to these
resources is provided in numerous ways, including through its web site, subject
pathfinders, and course-customized library web pages.
LibData's master database -- which contains
records for resources, services, library locations, staff, and more -- was
designed to allow for easier management of these resources and their rapid
retrieval and incorporation into the variety of web presentations that
librarians create for users. LibData currently offers three distinct page
authoring tools useful to both novice and expert librarian users:
These tools are tightly integrated with the main
database, making resource management easier to control, and assuring that
library users receive predictable and up-to-date information. LibData also
features a robust staff management system, user and page statistics, and
complete customizability and extensibility. This presentation will discuss the LibData system, its underlying database architecture, the
administration interface, and efforts to create an open source distribution of
the system. Also discussed will be the extensibility of LibData
including work to connect it with enterprise systems such as the campus portal
and course management software.
NAND: A New Tool for an Old Problem.
Charles Blair, Elisabeth Long, and Keith Waclena, TheDigitalLibraryDevelopmentCenter, The University
of Chicago Library
The Digital Library Development Center (DLDC) at
the University of Chicago Library has developed a
lightweight, versatile tool for searching and browsing collections of data,
including but not limited to bibliographic data, via the World Wide Web. This
tool, NAND: A Non-Relational Database, has allowed the DLDC both to meet the
current needs of its user base and also to anticipate demand in some areas.
NAND has been used on a variety of projects
ranging from an e-journals list to an integrated search interface for digital
projects to management information for Unix computers. It is a generic tool
that addresses a class of data-indexing and presentation needs while allowing
customization by relatively non-technical staff to meet individual project
NAND is a portable single-file executable which
combines a back-end indexing component and a CGI-based web interface for
searching and browsing. A project can be completely set up with as few as three
lines of configuration, but advanced users have recourse to a complete
object-oriented programming language within the configuration file. The system
supports several types of input data (including delimited fields, CSV, refer,
HTML, and electronic mail). The CGI interface supports customizable HTML and
record layout; multiple sorts; multi-page hierarchical browsing; and Boolean
searches across any number and combination of fields with phrase searching,
wildcards and regular expressions.
: Break: Foyer
to Alvarado Room.
: BREAKOUT SESSION 11: NATIONAL SCIENCE
DIGITAL LIBRARY. Alvarado Room ABC.
NSDL Projects Update.
Martin Halbert, Director for Library Systems, EmoryUniversity
Two DLF institutions, EmoryUniversity and the CDL, have
independently received NSDL grants for similar and potentially complementary
work on interoperability with the NSDL.
"The OCKHAM Library Network." Martin Halbert, Emory University, with a draft of the OCKHAM reference model, as well as results from an early survey.
"Adding Value to NSDL : A Business Proposition and Service Enhancement." Laine Farley,
Director, Digital Library Services, California Digital Library
: BREAKOUT SESSION 12. Alvarado Room D.
Data Mining Library Collection Silos: An Opportunity for
Cooperative Collection Management of Print and Electronic Books.
The OCLC Online Computer Library Center WorldCat
database is used to identify print books (p-books) that have an electronic book
(e-book) edition and the libraries that hold these materials. An analysis of
the bibliographic characteristics of and the geographic holdings for these
materials provide empirical data for library decision-making.
Libraries are installing compact shelving,
moving lesser-used and older collections to remote storage locations and,
increasingly, are digitizing their materials. With digital collections come new
challenges, such as usage and cost comparisons of print and electronic
resources, digitization and preservation processes, organization, retrieval
systems, services, and collection management. By analyzing collection data
across institutions and within collections, library decision-makers are able to
make collection decisions based on empirical data. An aggregated database of
library holdings is required for such an analysis.
This research draws on the OCLC Online Computer
Library Center WorldCat database, containing more
than 50 million records. WorldCat has not only served
as an aggregator of bibliographic data for thirty years, but also identifies
almost a billion holding locations for library resources. WorldCat
can be used to describe collections bibliographically, as well as
geographically. The researchers use WorldCat to identify paper books (p-books) that have an
electronic book (e-book) edition. Holding patterns are analyzed by type of
library, publisher, date, and subject areas (using the North American Title
Count) for all p-books and e-books. A comparison of the characteristics of
p-books and e-books document the development and growth of the transition from
the paper library to the digital library. The findings from this research will
not only increase our understanding of the current e-book/p-book scenario, but
could also be useful in seeking outside funding for a range of library
operational issues, such as, preservation and digitization of materials and cooperative
and individual library collection development and management decisions.
From aggregation to commerce: the next
phase for the RLG Cultural Materials Alliance.
Beginning with a brief review of where RLG and the
institutions in the Cultural Materials Alliance are today in aggregating their
digitized special collections to make them easily accessible for teaching and
research, the majority of the presentation will focus on the next phase of the
initiative, which has as its goals reaching new audiences, providing broader
awareness of and access to the institutions' special collections, and testing
the waters of commercial licensing. This will include the results of RLG's investigation of image stock-houses, the plans for
open web access via internet search engines, and the results of the
deliberations of the Alliance
policy advisory group.
remarks: David Seaman, DLF. Alvarado