DLF evaluation of the Open Archives Initiative

The DLF is supporting the development of a small number of Internet gateways through which users will access distributed digital library holdings as if they were part of a single uniform collection. The gateways will be built using a technique known as metadata harvesting. That technique is documented in a technical framework developed by the Open Archives Initiative (OAI). As such, the gateways developed by the DLF will contribute to a practical evaluation of the OAI's harvesting technique and its application within libraries. These pages describe this development work and provide an up-to-date account of its progress.

Background
The evaluation project
Provisional metadata sources
Harvesting services

Background

In May 2000, with funding from The Andrew W. Mellon Foundation, two meetings were held at Harvard University to explore various technical and organizational issues involved in the development of metadata harvesting services in digital libraries. The meetings' aims are set out in a funding proposal that also acted to brief invited participants and to focus their discussions.

Participants quickly concluded that there were numerous, potentially very valuable harvesting applications in the digital library. Rather than re-invent a harvesting protocol, however, participants agreed to concentrate on desirable revisions to the protocol that had been developed recently by the OAI and documented in the Santa Fe Convention. The meetings produced three outcomes: a vision statement describing how harvesting services could be developed to the advantage of libraries and their patrons; a set of recommended changes that were formally put to the OAI; and a road map for the development of harvesting services that would help the library community evaluate the OAI's technical framework in particular and the potential value of metadata harvesting in general.

The vision statement was produced in printed and electronic forms, and circulated widely to a broad cross-section of the library community. The statement reflected on libraries' persistent concern to pool the records they had developed in order to document their respective holdings. It evaluated the various mechanisms that had been used to achieve this shared aim (e.g. union catalogs and distributed search services) and the difficulties that those mechanisms encountering in trying to integrate records pertaining to digital as well as non-digital information. Metadata harvesting, the statement suggested, promised to overcome some of these limitations. In addition, the vision statement demonstrated how harvesting could support the construction of Internet portals or gateways; that is, websites that organize access to a rich variety of information resources (potentially in any format) to meet very specific user needs. Thus, the vision statement mused about harvesting services that organized access to information relevant to those interested in a particular field of study (e.g. American history, biomedical ethics), a particular kind of information (electronic books, digital images, maps and cartography), or information available in a certain region (e.g. the southwestern United States). It also envisaged Internet search services, equivalent to those offered commercially, for example, by Alta Vista and Lycos, but focusing more exclusively on scholarly information including that which exists in databases and which is as such hidden from the commercial search engines' view. In this regard, the statement suggested that the harvesting technique could be used to build what members of the Association of Research Libraries were at that time beginning to refer to as "the scholarly commons".

The second result of the Harvard meetings was a set of three recommendations that were taken formally to the OAI. These were intended to help the OAI generalize the framework so that it could be applied beyond the e-print community where it had originated. The first recommendation was technical and urged adoption of unqualified Dublin Core as the protocol's common metadata element set. The OAI had originally proposed an Open Archives Metadata Set that was smaller and more prescribed than the Dublin Core, and more closely tailored to the needs of the e-print community. The second recommendation was greater organizational stability for the OAI, including a steering committee, an official home for the OAI web site, and a clear locus of responsibility for maintenance of the protocol. The aim here was to stabilize the framework long enough to encourage institutions to invest in its practical application. The third recommendation sought to generalize the initiative by focusing it on technical rather than operational issues. Hitherto, the technical framework had been developed to support a particular application that aimed at making electronic pre-prints publications more widely accessible and without cost to end-users. Participants in the Harvard meetings envisaged (and did not want to constrain development of) applications that reflected very different organizational and business objectives. These three recommendations were among those discussed by the Open Archives Initiative at its second public meeting in San Antonio, Texas, in June 2000, and helped to encourage developments that are reported elsewhere from these pages.

The evaluation projects

The road map for developing harvesting services involved progress on two closely related fronts: developing a pool of harvestable metadata focusing principally on metadata available from library systems; and building a small number of online services with metadata harvested from the pool. The work is being undertaken in close collaboration between the DLF (whose 25 member libraries share an interest in integrating access to their distributed collections) and The Andrew W. Mellon Foundation. An account of its progress is set out below.

In June 2000 the DLF began construction of a simple database to list its members' nearly 300 public domain online digital collections. The database, available from http://www.hti.umich.edu/cgi/d/dlfcoll/dlfcoll-idx, was developed in part to identify sources of harvestable metadata upon which the prototype harvesting services might rely.

Also in June, the Andrew W. Mellon Foundation invited the DLF to locate institutions interested in contributing to evaluation projects either by contributing metadata or by harvesting metadata and building services. A call for expressions of interest issued by the DLF in July produced 13 responses. Responses described 9 or 10 potential harvesting services. They also offered metadata from nearly 50 digital library collections representing between well over a million unique information objects.

In October 2000, a meeting of interested project participants was convened by The Andrew W. Mellon Foundation to explore technical, organizational, and resource issues and to identify possible next steps. The meeting is reported fully elsewhere. Briefly, participants identified at least four service types with particular possibilities for libraries and including:

services capable of supporting inquiry about a particular subject ("Americana" was emphasized, in part because of the availability of relevant digital holdings);
services that provided access to information in a particular format (particular interest was expressed in services for Encoded Archival Descriptions, TEI encoded electronic texts, and visual resources);
services developed by a single library or library consortium and customized to meet its patrons' specific needs; and
services supporting a simple Lycos-style search across available metadata irrespective of the subject matter, format, or location of the information objects to which the metadata referred.

There was finally interest in using harvesting services to integrate information about digital as well as non-digital objects and in this way, to capitalize on the substantial scholarly wealth represented, for example, in union bibliographic databases and online archival finding aids.

At present, those who have offered metadata are building OAI conformant services that will allow them to make the metadata available for harvesting. A list of metadata collections provisionally to be made available for harvesting is supplied below. In the meantime, discussions with potential service developers continue.

Provisional metadata sources

Proposing institution	Collections	Metadata format
Committee on Institutional Cooperation, CIC	1200 encoded archival descriptions for digital sound recordings from National Gallery of the Spoken Word	Encoded Archival Descriptions
"	121 TEI-encoded files of c19 Sunday School books from LC/Ameritech	Text Encoding Initiative Headers
"	Metadata for the 2832 titles in the Lyle Wright bibliography of American Fiction 1851-1875 being digitized by CIC institutions	Text Encoding Initiative Headers
"	600 encoded archival descriptions for digitized oral arguments before the US Supreme Court (from Oyez project)	Encoded Archival Descriptions
"	100 encoded archival descriptions records for digitized speeches and recorded conversations by important historical figures (from History and Politics out loud)	Encoded Archival Descriptions
"	Encoded archival descriptions for hundreds of scanned images, manuscripts, and other materials on bridges railways and transportation infrastructure (from Making of America II)	Encoded Archival Descriptions
"	400 records for digitized local and regional history image collections from Global Cultural Memory	Dublin Core
"	Encoded archival descriptions for 10,000 items from collections such as archives of ALA, prominent individuals, war memorabilia	Encoded Archival Descriptions
"	200 records for social hygiene posters 1922-45 (from Social Welfare Hygiene Posters)	Visual Resources Association Core Categories
"	1600 records for scanned historic scenery backdrops in common use in US 1890-1920	Visual Resources Association Core Categories and Dublin Core
"	400 records for images from the history of computing	Visual Resources Association Core Categories and Dublin Core
Cornell University	Records for digitized books and images from the Making of America collection including the 100,000 plus articles in nineteenth century serial publications and the 267 volumes of nineteenth and twentieth century monographs	Text Encoding Initiative Headers
"	2,000 records for nineteenth and twentieth century agricultural texts	MARC and Dublin Core records
"	Records for 571 digitized pre-1914 math books	MARC records
"	Records pertaining to geospatial data for New York State	Federal Geographic Data Committee
"	Records for c3,000 electronic resources available through the Cornell Library gateway	Dublin Core
Emory University	Finding aids for nearly 45 special collections and item level descriptions for 8,000 texts and photographs	Encoded Archival Descriptions
"	Records for the full SGML and XML encoded texts in the Emory women writers project	Text Encoding Initiative Headers
"	Records for the full SGML and XML encoded poetry	Text Encoding Initiative Headers
"	Web resources created by Emory faculty for research purposes	NA
"	Social science data produced at Emory University	NA
Harvard University	Metadata from VIA - an online index of visual resources in the arts, architecture, material culture, and history from Harvard libraries, archives, museums, etc., including 15,000 records linked to digital images	Visual Resources Association Core Categories and Dublin Core
Harvard University Virtual Data Center	1000s of metadata records for data in the Harvard-MIT Data Center and the Virtual Data Center	Data Documentation Initiative
Indiana University	Metadata records from the nearly 185 searchable works by nineteenth-century Victorian women's writers	Text Encoding Initiative Headers
"	Metadata for the 2832 titles in the Lyle Wright bibliography of American Fiction 1851-1875 being digitized by CIC institutions	Text Encoding Initiative Headers
"	Hoagy Carmichael Collection - metadata for thousands of items in libraries and archives	Encoded Archival Descriptions, MARC, Text Encoding Initiative Headers
Library of Congress	Selected records from American Memory	Dublin Core
Research Libraries Group (RLG)	Metadata records from the visual resources available from the Cultural Material Alliance	NA
OCLC	Records from Worldcat pertaining to theses and dissertations	Dublin Core and MARC records
University of California at San Diego	300,000 metadata records for slides in the UCSD Art and Architecture Library Slide collection	MARC
University of Illinois at Urbana-Champagne	25,000 metadata files from the American Institute of Physics and the American Physical Society subset of full-text journals	Dublin Core
"	2,000 metadata files from Historic Aerial Photos Imagebase	Federal Geospatial Data Committee
"	400 metadata files from digitized local and regional history image collections	Dublin Core
"	Metadata for 7,000 items in Kolb-Proust archive	Text Encoding Initiative Headers
"	Metadata describing 10,000 items in Illinois special and archival collections	Encoded Archival Descriptions
University of Michigan	9,000 titles in the Making of America I and IV collections	MARC, Text Encoding Initiative Headers, and Dublin Core
"	5,000 technical reports from the University of Michigan Engineering School	NA
"	Metadata records from Chadwyck-Healey literary collections	NA
"	Metadata records from the Corpus of Middle English texts	NA
"	Metadata for 200,000 plus art and historical images	NA
University of Pennsylvania	Metadata for the digital texts and images produced for the Furness Shakespeare Library including 70 digitized documents	MARC records
"	Metadata for over 11,000 freely available online books from the Online Books Page)	Dublin Core
"	Metadata for 150 freely available online books published by A Celebration of Women Writers	Dublin Core
Universities of Tennessee at Knoxville and Georgia	Metadata for full TEI-encoded texts of Southeastern Native American documents	Text Encoding Initiative Headers
University of Virginia	Records for the complete set of electronic texts, finding aids, and digital images that can be made available publicly without restriction and without regard to thematic concentration	Text Encoding Initiative Headers, Encoded Archival Descriptions, Dublin Core
Yale University	Metadata records (initially 5,000 but growing to 30,000) for visual resources from various collections	Visual Resources Association Core Categories

Harvesting services

In June 2001, The Andrew W. Mellon Foundation funded seven projects that proposed construction of various online services employing the OAI metadata harvesting protocol. The following services were funded at DLF member institutions:

At Emory University, two grant projects -- AmericanSouth.org and MetaArchive.org -- have been collaboratively conjoined and are being carried forward in cooperation with partner institutions SOLINET and ASERL. Services will integrate access to digital collections dealing with the American south; integrate finding aids for archives of papers of major American political figures, and of records of theological institutions, and Africana.
The University of Illinois is creating a web portal for searching materials focusing on cultural heritage and coming from variety of institutions including library special collections, museums, historical societies, and public libraries.
The University of Michigan is building a service (OAISTER) that will integrate access to digitally reformatted materials irrespective of their subject, whether art or zoology, and format, whether text or image.
The University of Virginia is working to integrate access to digital Americana, all formats.

Many other DLF members are contributing metadata to these harvesting services.

return to top >>

Last updated:

DLF evaluation of the Open Archives Initiative

Contents

Background

The evaluation projects

Provisional metadata sources

Harvesting services