DLF evaluation of the Open Archives Initiative
The DLF is supporting the development of a small number of
Internet gateways through which users will access distributed
digital library holdings as if they were part of a single uniform
collection. The gateways will be built using a technique known as
metadata harvesting. That technique is documented in a technical
framework developed by the Open Archives Initiative (OAI). As such, the
gateways developed by the DLF will contribute to a practical
evaluation of the OAI's harvesting technique and its application
within libraries. These pages describe this development work and
provide an up-to-date account of its progress.
Contents
Background
The evaluation project
Provisional metadata sources
Harvesting services
In May 2000, with funding from The Andrew W. Mellon
Foundation, two meetings were held at Harvard University to
explore various technical and organizational issues involved in
the development of metadata harvesting services in digital
libraries. The meetings' aims are set out in a funding proposal that also acted to brief
invited participants and to focus their discussions.
Participants quickly concluded that there were numerous,
potentially very valuable harvesting applications in the digital
library. Rather than re-invent a harvesting protocol, however,
participants agreed to concentrate on desirable revisions to the
protocol that had been developed recently by the OAI and
documented in the Santa Fe Convention. The meetings produced
three outcomes: a vision statement describing how harvesting
services could be developed to the advantage of libraries and
their patrons; a set of recommended changes that were formally
put to the OAI; and a road map for the development of harvesting
services that would help the library community evaluate the OAI's
technical framework in particular and the potential value of
metadata harvesting in general.
The vision statement was produced in
printed and electronic forms, and circulated widely to a broad
cross-section of the library community. The statement reflected
on libraries' persistent concern to pool the records they had
developed in order to document their respective holdings. It
evaluated the various mechanisms that had been used to achieve
this shared aim (e.g. union catalogs and distributed search
services) and the difficulties that those mechanisms encountering
in trying to integrate records pertaining to digital as well as
non-digital information. Metadata harvesting, the statement
suggested, promised to overcome some of these limitations. In
addition, the vision statement demonstrated how harvesting could
support the construction of Internet portals or gateways; that
is, websites that organize access to a rich variety of
information resources (potentially in any format) to meet very
specific user needs. Thus, the vision statement mused about
harvesting services that organized access to information relevant
to those interested in a particular field of study (e.g. American
history, biomedical ethics), a particular kind of information
(electronic books, digital images, maps and cartography), or
information available in a certain region (e.g. the southwestern
United States). It also envisaged Internet search services,
equivalent to those offered commercially, for example, by Alta
Vista and Lycos, but focusing more exclusively on scholarly
information including that which exists in databases and which is
as such hidden from the commercial search engines' view. In this
regard, the statement suggested that the harvesting technique
could be used to build what members of the Association of
Research Libraries were at that time beginning to refer to as
"the scholarly commons".
The second result of the Harvard meetings was a set of three
recommendations that were taken formally to the OAI. These were
intended to help the OAI generalize the framework so that it
could be applied beyond the e-print community where it had
originated. The first recommendation was technical and urged
adoption of unqualified Dublin Core as the protocol's common
metadata element set. The OAI had originally proposed an Open
Archives Metadata Set that was smaller and more prescribed than
the Dublin Core, and more closely tailored to the needs of the
e-print community. The second recommendation was greater
organizational stability for the OAI, including a steering
committee, an official home for the OAI web site, and a clear
locus of responsibility for maintenance of the protocol. The aim
here was to stabilize the framework long enough to encourage
institutions to invest in its practical application. The third
recommendation sought to generalize the initiative by focusing it
on technical rather than operational issues. Hitherto, the
technical framework had been developed to support a particular
application that aimed at making electronic pre-prints
publications more widely accessible and without cost to
end-users. Participants in the Harvard meetings envisaged (and
did not want to constrain development of) applications that
reflected very different organizational and business objectives.
These three recommendations were among those discussed by the
Open Archives Initiative at its second public meeting in San
Antonio, Texas, in June 2000, and helped to encourage
developments that are reported elsewhere from these pages.
The road map for developing harvesting services involved
progress on two closely related fronts: developing a pool of
harvestable metadata focusing principally on metadata available
from library systems; and building a small number of online
services with metadata harvested from the pool. The work is being
undertaken in close collaboration between the DLF (whose 25
member libraries share an interest in integrating access to their
distributed collections) and The Andrew W. Mellon Foundation. An
account of its progress is set out below.
In June 2000 the DLF began construction of a simple database
to list its members' nearly 300 public domain online digital
collections. The database, available from http://www.hti.umich.edu/cgi/d/dlfcoll/dlfcoll-idx,
was developed in part to identify sources of harvestable metadata
upon which the prototype harvesting services might rely.
Also in June, the Andrew W. Mellon Foundation invited the DLF
to locate institutions interested in contributing to evaluation
projects either by contributing metadata or by harvesting
metadata and building services. A call for
expressions of interest issued by the DLF in July produced 13
responses. Responses described 9 or 10 potential harvesting
services. They also offered metadata from nearly 50 digital
library collections representing between well over a million
unique information objects.
In October 2000, a meeting of interested project participants
was convened by The Andrew W. Mellon Foundation to explore
technical, organizational, and resource issues and to identify
possible next steps. The meeting is reported fully elsewhere. Briefly, participants
identified at least four service types with particular
possibilities for libraries and including:
- services capable of supporting inquiry about a particular
subject ("Americana" was emphasized, in part because of the
availability of relevant digital holdings);
- services that provided access to information in a particular
format (particular interest was expressed in services for Encoded
Archival Descriptions, TEI encoded electronic texts, and visual
resources);
- services developed by a single library or library consortium
and customized to meet its patrons' specific needs; and
- services supporting a simple Lycos-style search across
available metadata irrespective of the subject matter, format, or
location of the information objects to which the metadata
referred.
There was finally interest in using harvesting services to
integrate information about digital as well as non-digital
objects and in this way, to capitalize on the substantial
scholarly wealth represented, for example, in union bibliographic
databases and online archival finding aids.
At present, those who have offered metadata are building OAI
conformant services that will allow them to make the metadata
available for harvesting. A list of metadata collections
provisionally to be made available for harvesting is supplied below. In the meantime, discussions
with potential service developers continue.
Proposing institution |
Collections |
Metadata format |
Committee on Institutional Cooperation, CIC |
1200 encoded archival descriptions for digital sound
recordings from National Gallery of the Spoken Word |
Encoded Archival Descriptions |
"
|
121 TEI-encoded files of c19 Sunday School books from
LC/Ameritech |
Text Encoding Initiative Headers |
"
|
Metadata for the 2832 titles in the Lyle Wright bibliography
of American Fiction 1851-1875 being digitized by CIC
institutions |
Text Encoding Initiative Headers |
"
|
600 encoded archival descriptions for digitized oral
arguments before the US Supreme Court (from Oyez project) |
Encoded Archival Descriptions |
"
|
100 encoded archival descriptions records for digitized
speeches and recorded conversations by important historical
figures (from History and Politics out loud) |
Encoded Archival Descriptions |
"
|
Encoded archival descriptions for hundreds of scanned images,
manuscripts, and other materials on bridges railways and
transportation infrastructure (from Making of America II) |
Encoded Archival Descriptions |
"
|
400 records for digitized local and regional history image
collections from Global Cultural Memory |
Dublin Core |
"
|
Encoded archival descriptions for 10,000 items from
collections such as archives of ALA, prominent individuals, war
memorabilia |
Encoded Archival Descriptions |
"
|
200 records for social hygiene posters 1922-45 (from Social
Welfare Hygiene Posters) |
Visual Resources Association Core Categories |
"
|
1600 records for scanned historic scenery backdrops in common
use in US 1890-1920 |
Visual Resources Association Core Categories and Dublin
Core |
"
|
400 records for images from the history of computing |
Visual Resources Association Core Categories and Dublin
Core |
Cornell University |
Records for digitized books and images from the Making of
America collection including the 100,000 plus articles in
nineteenth century serial publications and the 267 volumes of
nineteenth and twentieth century monographs |
Text Encoding Initiative Headers |
"
|
2,000 records for nineteenth and twentieth century
agricultural texts |
MARC and Dublin Core records |
"
|
Records for 571 digitized pre-1914 math books |
MARC records |
"
|
Records pertaining to geospatial data for New York State |
Federal Geographic Data Committee |
"
|
Records for c3,000 electronic resources available through the
Cornell Library gateway |
Dublin Core |
Emory University |
Finding aids for nearly 45 special collections and item level
descriptions for 8,000 texts and photographs |
Encoded Archival Descriptions |
"
|
Records for the full SGML and XML encoded texts in the Emory
women writers project |
Text Encoding Initiative Headers |
"
|
Records for the full SGML and XML encoded poetry |
Text Encoding Initiative Headers |
"
|
Web resources created by Emory faculty for research
purposes |
NA |
"
|
Social science data produced at Emory University |
NA |
Harvard University |
Metadata from VIA - an online index of visual resources in
the arts, architecture, material culture, and history from
Harvard libraries, archives, museums, etc., including 15,000
records linked to digital images |
Visual Resources Association Core Categories and Dublin
Core |
Harvard University Virtual Data Center |
1000s of metadata records for data in the Harvard-MIT Data
Center and the Virtual Data Center |
Data Documentation Initiative |
Indiana University |
Metadata records from the nearly 185 searchable works by
nineteenth-century Victorian women's writers |
Text Encoding Initiative Headers |
"
|
Metadata for the 2832 titles in the Lyle Wright bibliography
of American Fiction 1851-1875 being digitized by CIC
institutions |
Text Encoding Initiative Headers |
"
|
Hoagy Carmichael Collection - metadata for thousands of items
in libraries and archives |
Encoded Archival Descriptions, MARC, Text Encoding Initiative
Headers |
Library of Congress |
Selected records from American Memory |
Dublin Core |
Research Libraries Group (RLG) |
Metadata records from the visual resources available from the
Cultural Material Alliance |
NA |
OCLC |
Records from Worldcat pertaining to theses and
dissertations |
Dublin Core and MARC records |
University of California at San Diego |
300,000 metadata records for slides in the UCSD Art and
Architecture Library Slide collection |
MARC |
University of Illinois at Urbana-Champagne |
25,000 metadata files from the American Institute of Physics
and the American Physical Society subset of full-text
journals |
Dublin Core |
"
|
2,000 metadata files from Historic Aerial Photos
Imagebase |
Federal Geospatial Data Committee |
"
|
400 metadata files from digitized local and regional history
image collections |
Dublin Core |
"
|
Metadata for 7,000 items in Kolb-Proust archive |
Text Encoding Initiative Headers |
"
|
Metadata describing 10,000 items in Illinois special and
archival collections |
Encoded Archival Descriptions |
University of Michigan |
9,000 titles in the Making of America I and IV
collections |
MARC, Text Encoding Initiative Headers, and Dublin Core |
"
|
5,000 technical reports from the University of Michigan
Engineering School |
NA |
"
|
Metadata records from Chadwyck-Healey literary
collections |
NA |
"
|
Metadata records from the Corpus of Middle English texts |
NA |
"
|
Metadata for 200,000 plus art and historical images |
NA |
University of Pennsylvania |
Metadata for the digital texts and images produced for the
Furness Shakespeare Library including 70 digitized documents |
MARC records |
"
|
Metadata for over 11,000 freely available online books from
the Online Books Page) |
Dublin Core |
"
|
Metadata for 150 freely available online books published by A
Celebration of Women Writers |
Dublin Core |
Universities of Tennessee at Knoxville and
Georgia |
Metadata for full TEI-encoded texts of Southeastern Native
American documents |
Text Encoding Initiative Headers |
University of Virginia |
Records for the complete set of electronic texts, finding
aids, and digital images that can be made available publicly
without restriction and without regard to thematic
concentration |
Text Encoding Initiative Headers, Encoded Archival
Descriptions, Dublin Core |
Yale University |
Metadata records (initially 5,000 but growing to 30,000) for
visual resources from various collections |
Visual Resources Association Core Categories |
In June 2001, The Andrew W. Mellon Foundation funded seven
projects that proposed construction of various online services
employing the OAI metadata harvesting protocol. The following
services were funded at DLF member institutions:
- At Emory University, two grant projects -- AmericanSouth.org and MetaArchive.org -- have
been collaboratively conjoined and are being carried forward in
cooperation with partner institutions SOLINET and ASERL. Services
will integrate access to digital collections dealing with the
American south; integrate finding aids for archives of papers of
major American political figures, and of records of theological
institutions, and Africana.
- The University of Illinois is creating a web portal for searching
materials focusing on cultural heritage and coming from variety
of institutions including library special collections, museums,
historical societies, and public libraries.
- The University of Michigan is building a service (OAISTER) that will
integrate access to digitally reformatted materials irrespective
of their subject, whether art or zoology, and format, whether
text or image.
- The University of Virginia is working to integrate access to
digital Americana, all formats.
Many other DLF members are contributing metadata to these
harvesting services.
return to top >> |