Metadata harvesting testbed. Call for expressions of interest

The Digital Library Federation
29 June 2000

Please note, responses to this call are no longer being sought by the DLF.

I Executive summary

The Andrew W. Mellon Foundation has asked the DLF to propose a coherent set of projects that together can demonstrate how and to what extent emerging network technologies may make scholarly information resources more readily accessible via online gateway or portal services.

I am therefore writing to solicit brief expressions of interests from DLF member institutions who are interested in participating in such projects. Collaborative projects are especially welcome and may certainly involve institutions that are not DLF members, including other universities, colleges, museums, archives, publishers, professional societies, authors, etc..

II. Introduction

With a planning grant from the Andrew W. Mellon Foundation, the DLF has developed a framework for disclosing the collective wealth of our educational and cultural collections. Built upon technical work conducted by the Open Archives Initiative, the framework promises to support the emergence of online portal or gateway services that integrate access to the item-level metadata managed by libraries, museums, cultural organizations publishers, authors, scholarly societies, etc. Significantly, it may help the DLF to take a substantial step forward in realizing its mission "to bring together -- from across the nation and beyond -- digitized materials that will be made accessible to students, scholars, and citizens everywhere, and that document the building and dynamics of America's heritage and cultures."

A next step is evaluating the framework through some co-ordinated testbed activities to be supported by the Mellon Foundation, the DLF, and potentially other agencies.

I am accordingly looking for expressions of interest from members willing to participate in such activities by contributing metadata for some of their library's digital or paper-based collections. I am also looking for members who may be interested in building prototype gateway or portal services based on harvested subsets of the metadata that are contributed. It is hoped that a range of different harvesting services (for example, as organized by subject domain, information class, region, etc) will be proposed that can help to evaluate different aspects of the framework.

Those interested in participating in any testbed should, by 4 August 2000, send a brief expression of interest (400-600 words maximum) addressing the points raised in Section IV below.

III The framework

The framework accommodates both data providers and service providers. A data provider agrees to support a simple harvesting protocol and to provide extracts of item-level metadata in a common, minimal-level format in response to harvest requests from trusted service providers. It then records information about its metadata collections in a shared registry. A service provider uses this registry to locate potential data providers, and uses the harvest protocol to collect metadata from them possibly after reaching some kind of formal agreement on terms and conditions of use. The service provider is then able to build intellectually useful services, such as catalogs and portals to materials distributed across multiple sites. The framework applies to a wide range of information resources of academic and scholarly interest including printed and electronic texts, science and social science data sets, visual materials, archival collections, geographic information system (GIS) data, sound and music, video, and any other type of resource for which metadata is typically created.

Theoretically, the framework could be used to expose the metadata in thousands of individual systems worldwide to central collection. For example, comprehensive collections of Americana or GIS data could be developed. This would make local repositories more generally known, and more generally useful, because researchers could search across previously unconnected materials. It would also illuminate the "dark matter" of the Internet-material that is hard or impossible to find if the user does not already know where it exists.

Most important, the academic community could begin to ensure that services would be developed that express the values of that community-services that center on materials of research and educational interest, that provide honest and transparent ranking and retrieval, and that improve search quality by intelligently integrating metadata.

The services that could be developed under the framework are limited only by need and ingenuity. The following list includes some that have been discussed but is not by any means exhaustive:

Technically the framework adopts the Santa Fe Convention as developed by the Open Archives initiative, specifically the DTD that it recommends for item-level metadata and a subset of the Dienst protocol as the common protocol used to harvest metadata records. A registry will be established as part of the testbed using simple web forms for data entry. Documentation will be provided, and we anticipate that through the work undertaken by testbed projects we will accumulate a library of sharable routines for functions such as converting from common metadata formats to the minimal DTD. The DLF will be working closely with the Open Archives Initiative to ensure that the technical framework maintains the stability it will need to support the testbed activity presented in this document.

Economically, we would use the testbed to explore economic relationships between metadata suppliers, service suppliers, and those responsible for the management of necessary standards (e.g. the DTD and Dienst protocol) and tools (e.g. the registry). It is hoped that registries pointing to metadata for high-quality educational information will attract and support a number of harvesting services each of which serves specific educational and other needs. Harvesting services, meanwhile, will theoretically encourage institutions to expose item level metadata, for example, as a means of enhancing use of locally managed collections. However, we need to understand what kinds of terms and conditions are necessary for data providers to entrust information with service providers and for service providers to sustain their services.

IV. The testbed

The test-bed seeks to address a range of related technical, functional, organizational and economic questions by:

Some key issues that work on the testbed may illuminate are indicated below:

Technically the testbed will seek to evaluate the metadata DTD and the harvest protocol; appropriate search-engine and related technologies; generic tools appropriate for translating local metadata databases into a form suitable for exposure ; and mechanisms for managing item level metadata that offer different levels and types of access to the information objects to which they refer.

Functionally it will determine whether harvest services meet scholars' information-seeking needs and how they can be developed or exploited by digital libraries, for example, in integrating access to distributed local collections.

Organizationally and economically, the testbed will identify how harvesting services might support themselves; whether there is a natural market for them and business models attractive to both service suppliers and service users. Equally, it will identify incentives for encouraging institutions to expose item level metadata.

IV. Format for expressions of interest

Brief expressions of interests (not exceeding 400-600 words) should address one or both of the following points

1. What metadata would you contribute to the testbed. It is assumed that every participating institution will contribute at least some metadata.

Describe the source format (MARC, Dublin Core, etc) and coverage (number, format, and subject, chronological or other scope of underlying information objects) of metadata databases your institution would be prepared to disclose to such a testbed. Brief descriptions (even aggregated ones, where numerous metadata databases are concerned) are appropriate at this stage.

There are no a priori assumptions about what metadata will be suitable to a harvesting service. Metadata comprised in OPACs, finding aids, and other indices as well as in databases of digital or digitized content are all undoubtedly of interest. Metadata records do not need to hotlink to some underlying digital information object, nor do metadata records need necessarily to refer to information objects that can be made publicly available without restriction. The only requirement is that participating institutions expose only those metadata which it has a right to expose and make publicly accessible.

2. What harvesting service would you be interested in developing? Not all participating institutions would be expected to develop a harvesting service. Nor is it likely that available funding will support more than a small number of harvesting services.

2.1. Describe the harvesting service you would be interested in developing and the communities or organizations that would benefit most from its use.

2.2. What item-level metadata would the service require for development through a prototype phase? Your answer to this question may help to match service needs with metadata offers.)

2.3. What staff, expertise, or other resources would the institution be able to commit towards its participation in the metadata harvesting testbed? What specific expertise, infrastructure or related experience would the institution draw upon in contributing to the testbed?

V. Next steps

Expressions of interest will identify the harvesting services that may be developed to prototype with access to appropriate item-level metadata.

In the interest of gaining experience with a range of distinctive services possibilities, members interested in developing similar harvesting services may be encouraged to align their efforts.

Depending on the level of interest in the testbed and the funding that becomes available for it, it may be necessary to go forward with only a subset of the metadata and harvesting services that are offered.

Selectivity, where required, will be governed by considerations including: