Metadata harvesting testbed. Call for expressions of interest

The Digital Library Federation
29 June 2000

Please note, responses to this call are no longer being sought by the DLF.

I Executive summary

The Andrew W. Mellon Foundation has asked the DLF to propose a coherent set of projects that together can demonstrate how and to what extent emerging network technologies may make scholarly information resources more readily accessible via online gateway or portal services.

I am therefore writing to solicit brief expressions of interests from DLF member institutions who are interested in participating in such projects. Collaborative projects are especially welcome and may certainly involve institutions that are not DLF members, including other universities, colleges, museums, archives, publishers, professional societies, authors, etc..

II. Introduction

With a planning grant from the Andrew W. Mellon Foundation, the DLF has developed a framework for disclosing the collective wealth of our educational and cultural collections. Built upon technical work conducted by the Open Archives Initiative, the framework promises to support the emergence of online portal or gateway services that integrate access to the item-level metadata managed by libraries, museums, cultural organizations publishers, authors, scholarly societies, etc. Significantly, it may help the DLF to take a substantial step forward in realizing its mission "to bring together -- from across the nation and beyond -- digitized materials that will be made accessible to students, scholars, and citizens everywhere, and that document the building and dynamics of America's heritage and cultures."

A next step is evaluating the framework through some co-ordinated testbed activities to be supported by the Mellon Foundation, the DLF, and potentially other agencies.

I am accordingly looking for expressions of interest from members willing to participate in such activities by contributing metadata for some of their library's digital or paper-based collections. I am also looking for members who may be interested in building prototype gateway or portal services based on harvested subsets of the metadata that are contributed. It is hoped that a range of different harvesting services (for example, as organized by subject domain, information class, region, etc) will be proposed that can help to evaluate different aspects of the framework.

Those interested in participating in any testbed should, by 4 August 2000, send a brief expression of interest (400-600 words maximum) addressing the points raised in Section IV below.

III The framework

The framework accommodates both data providers and service providers. A data provider agrees to support a simple harvesting protocol and to provide extracts of item-level metadata in a common, minimal-level format in response to harvest requests from trusted service providers. It then records information about its metadata collections in a shared registry. A service provider uses this registry to locate potential data providers, and uses the harvest protocol to collect metadata from them possibly after reaching some kind of formal agreement on terms and conditions of use. The service provider is then able to build intellectually useful services, such as catalogs and portals to materials distributed across multiple sites. The framework applies to a wide range of information resources of academic and scholarly interest including printed and electronic texts, science and social science data sets, visual materials, archival collections, geographic information system (GIS) data, sound and music, video, and any other type of resource for which metadata is typically created.

Theoretically, the framework could be used to expose the metadata in thousands of individual systems worldwide to central collection. For example, comprehensive collections of Americana or GIS data could be developed. This would make local repositories more generally known, and more generally useful, because researchers could search across previously unconnected materials. It would also illuminate the "dark matter" of the Internet-material that is hard or impossible to find if the user does not already know where it exists.

Most important, the academic community could begin to ensure that services would be developed that express the values of that community-services that center on materials of research and educational interest, that provide honest and transparent ranking and retrieval, and that improve search quality by intelligently integrating metadata.

The services that could be developed under the framework are limited only by need and ingenuity. The following list includes some that have been discussed but is not by any means exhaustive:

A portal to digital Americana. Many universities, archives, historical societies, cultural institutions, and other organizations are creating Web-accessible collections of Americana, often with grant funds. Currently, these materials remain largely invisible to educators and scholars. A service focusing on harvested metadata for Americana might combine access to archival visual and textual collections, such as those included in American Memory, with citations to electronic journal articles from JSTOR, early American fiction from the Universities of Virginia and Indiana, H. H. Richardson architectural drawings from Harvard, the Hoagy Carmichael collection at Indiana University, Hawaiian language newspapers from the University of Hawaii, and audio, visual, textual, and multimedia materials from hundreds of relevant sites.
A portal to environmental information. Environmental information is collected by hundreds of international, federal, state, and private agencies, and described using dozens of metadata formats. This information is used intensively by government and university researchers, despite the difficulty of finding data scattered among such a vast number of sites. A portal built upon harvested metadata could combine access to land, air, and space data from key government agencies with access to white papers, treaties, policy documents, journals, newsletters, and other relevant sources of environmental information. An even more ambitious service might combine search access to environmental information with geographic information resources such as those indexed by the Illinois Natural Resources Geospatial Data Clearinghouse, the University of Nevada Geospatial Data Clearinghouse, the New York State Spatial Data Clearinghouse, and other regional clearinghouses of geospatial information.
The academic engine. Despite the availability of library catalogs, online journal search services, and departmental databases, many university students and researchers turn first to the major commercial Internet search engines for resource discovery. A comprehensive Internet search service oriented toward academic and research resources would be a more productive alternative. Such a service might include all the information covered in more specialized portals (e.g., Americana, environmental information, GIS), as well as metadata from academic catalogs and databases, Web pages in the ".edu" domain, and commercial resources aimed at the research community.

Technically the framework adopts the Santa Fe Convention as developed by the Open Archives initiative, specifically the DTD that it recommends for item-level metadata and a subset of the Dienst protocol as the common protocol used to harvest metadata records. A registry will be established as part of the testbed using simple web forms for data entry. Documentation will be provided, and we anticipate that through the work undertaken by testbed projects we will accumulate a library of sharable routines for functions such as converting from common metadata formats to the minimal DTD. The DLF will be working closely with the Open Archives Initiative to ensure that the technical framework maintains the stability it will need to support the testbed activity presented in this document.

Economically, we would use the testbed to explore economic relationships between metadata suppliers, service suppliers, and those responsible for the management of necessary standards (e.g. the DTD and Dienst protocol) and tools (e.g. the registry). It is hoped that registries pointing to metadata for high-quality educational information will attract and support a number of harvesting services each of which serves specific educational and other needs. Harvesting services, meanwhile, will theoretically encourage institutions to expose item level metadata, for example, as a means of enhancing use of locally managed collections. However, we need to understand what kinds of terms and conditions are necessary for data providers to entrust information with service providers and for service providers to sustain their services.

IV. The testbed

The test-bed seeks to address a range of related technical, functional, organizational and economic questions by:

exposing a pool of item-level metadata via some light-weight registry service;
supporting the development of some harvesting services built upon the metadata disclosed via the registry;
supporting appropriate research and development work in tools and other infrastructure, including legal and economic agreements, that may be commonly required by those interested in exposing item-level metadata or developing harvesting services.

Some key issues that work on the testbed may illuminate are indicated below:

Technically the testbed will seek to evaluate the metadata DTD and the harvest protocol; appropriate search-engine and related technologies; generic tools appropriate for translating local metadata databases into a form suitable for exposure ; and mechanisms for managing item level metadata that offer different levels and types of access to the information objects to which they refer.

Functionally it will determine whether harvest services meet scholars' information-seeking needs and how they can be developed or exploited by digital libraries, for example, in integrating access to distributed local collections.

Organizationally and economically, the testbed will identify how harvesting services might support themselves; whether there is a natural market for them and business models attractive to both service suppliers and service users. Equally, it will identify incentives for encouraging institutions to expose item level metadata.

IV. Format for expressions of interest

Brief expressions of interests (not exceeding 400-600 words) should address one or both of the following points

1. What metadata would you contribute to the testbed. It is assumed that every participating institution will contribute at least some metadata.

Describe the source format (MARC, Dublin Core, etc) and coverage (number, format, and subject, chronological or other scope of underlying information objects) of metadata databases your institution would be prepared to disclose to such a testbed. Brief descriptions (even aggregated ones, where numerous metadata databases are concerned) are appropriate at this stage.

There are no a priori assumptions about what metadata will be suitable to a harvesting service. Metadata comprised in OPACs, finding aids, and other indices as well as in databases of digital or digitized content are all undoubtedly of interest. Metadata records do not need to hotlink to some underlying digital information object, nor do metadata records need necessarily to refer to information objects that can be made publicly available without restriction. The only requirement is that participating institutions expose only those metadata which it has a right to expose and make publicly accessible.

2. What harvesting service would you be interested in developing? Not all participating institutions would be expected to develop a harvesting service. Nor is it likely that available funding will support more than a small number of harvesting services.

2.1. Describe the harvesting service you would be interested in developing and the communities or organizations that would benefit most from its use.

2.2. What item-level metadata would the service require for development through a prototype phase? Your answer to this question may help to match service needs with metadata offers.)

2.3. What staff, expertise, or other resources would the institution be able to commit towards its participation in the metadata harvesting testbed? What specific expertise, infrastructure or related experience would the institution draw upon in contributing to the testbed?

V. Next steps

Expressions of interest will identify the harvesting services that may be developed to prototype with access to appropriate item-level metadata.

In the interest of gaining experience with a range of distinctive services possibilities, members interested in developing similar harvesting services may be encouraged to align their efforts.

Depending on the level of interest in the testbed and the funding that becomes available for it, it may be necessary to go forward with only a subset of the metadata and harvesting services that are offered.

Selectivity, where required, will be governed by considerations including:

level and nature of appropriate metadata likely to be available to the harvesting service;
aspects of the framework that the harvesting service promises to illuminate;
costs involved in exposing appropriate metadata and developing the prototype harvesting service;
possibility that the harvesting service may be able (a) to sustain itself after an initial grant-assisted period and (b) contribute common knowledge, tools, etc. to educational and library communities;
specific interests of the sponsoring funding agency or agencies.

return to top >>

Last updated: