Harvesting research metadata. Aims, objectives, planning process

Research Metadata Harvest Project
Draft Request to the Mellon Foundation for Funds to Support a Planning Process

April 2000

Harvard University, in conjunction with the Digital Library Federation, proposes a planning process to define how research institutions can make available metadata for their collections for re-use in various finding systems, thus making those collections more visible to Internet users. The planning process, expected to take 6 months or less, will consist of a 2-day meeting at Harvard, plus follow-up work via e-mail and telephone. The result will be a report to the Mellon Foundation and the research community defining proposed data formats and protocols, suggested intellectual property arrangements, and one or more plans for test projects. We are requesting funding from the Andrew W. Mellon Foundation to support the meeting and the preparation of the final written report.

The Internet has indisputably emerged as the pre-eminent shop-window onto society's cultural and scholarly wealth. Through their development of online catalogues and finding aids, libraries, museums, archives, and other cultural organizations are aggressively extending access to their holdings, and users are able increasingly to view those holdings online as digitized texts, images, even sound and film. The Internet also supplies better access to the data that underpins so much scholarly investigation, for example, as derived from government departments, statistical surveys, scientific experimentation and other primary research. Educational institutions are also contributing substantially to this rapidly increasing national and international wealth through their development of online learning materials, pre-print services, and other new forms of scholarly communications.

The educational and cultural possibilities are undoubtedly enormous, yet so is the frustration experienced by those who seek to exploit this vast wealth of information. Hiding behind a myriad of URLs that remain unorganized and un-indexed, these materials remain accessible only to those who know of their existence or who stumble upon them by some merciful act of serendipity. Even where locations are known, users are unable to search across collections. Looking for information bearing on the life and times of Charles Dickens, the anatomical differences between marsupials and mammals, or the changing nature of fortification during the Bronze Age, the user is required to issue the same set of queries against every relevant online collection.

Obstacles to so-called "cross-searching" also impede the fullest return on any investment that is made in the development of online indices and digitized collections. If set in a very different context some of the digital images in a database designed to support the study of the Greek town of Ephesus can make a significant contribution to another collection - one, for example, that charts architectural development in the classical world. The costs involved in producing high quality and richly described online information demand that we pay at least some attention to the promise of re-purposing. Yet we cannot so long as individual information objects remain bound to a particular online product. Further, so long as it remains difficult to identify online educational and cultural information and to re-purpose selected information presenting it in new ways and as part of new collections, there is a substantial risk of redundant effort on the part of libraries, museums, archives and others who create digitized collections.

promising and as yet undeveloped path involves institutions disclosing item-level metadata held in local databases in some agreed harvestable form, while entering some very minimal information about the local databases into a common registry service. Together these facilities would support the development of numerous and very different views of the scholarly record. Thus, one can imagine the development of online services that provide access to information that supports investigation of particular subject areas (colonial American history); themes (biodiversity); regional holdings (cultural and educational holdings in the Pacific Northwest); even resource types (e.g. archival finding aids).

underlying philosophy in such a metadata harvesting scheme is to let a thousand flowers blossom on the service side and in this regard to encourage libraries and other organizations to support their users' very different needs. The same philosophy is applied to the supply side. That is, potential suppliers of item level metadata would choose what to disclose to a wider public. The only a priori restriction would be on metadata that can be legally distributed. Suppliers might also choose to link metadata records to the information resources to which they refer. In this respect, one can envisage metadata that includes links, for example, to digital images, electronic texts, or spatially referenced databases that are in this way made available to potential users in their native (or other) formats.

The direction holds further promise for libraries and other cultural organizations which grapple with the problem of integrating their own users' access to a growing array of distributed and deeply heterogeneous online information, including locally subscribed electronic journals and abstract and indexing services, digitized collections, research databases, and online teaching materials. Indeed, this institutional interest may supply considerable momentum as potential participants in any broader metadata harvesting initiative may simultaneously address themselves to local as well as global needs.

The funds requested in this proposal are for a planning process that will chart the development of a technical, organizational and business framework within which such metadata harvesting can take place and fully develop proposals for one or several prototype metadata harvesting projects. The process will involve a two-day planning meeting, a period of community consultation and review, and the design and launch of at least one prototype project with clearly established milestones and deliverables. The process is intended to:

The two-day planning meeting will assemble a small group of people who bring relevant high-level expertise to bear on the process. In several cases they are drawn from institutions that express an interest in becoming involved in any prototype, both as metadata suppliers and as online service developers.

Results of the meeting will be fully documented and that documentation, along with draft agreements, will be circulated widely within the library community for review and comment. They will also be brought to bear in any test-bed activity that results from the planning process.

The planning process may commence as early as May 1 2000. It would be completed within six months of the start date with deposit at the Mellon Foundation of a full report including the deliverables outlined above.

Harvard University Library will be responsible for successful completion of the planning process. Documentation will be produced, and the Digital Library Federation will take responsibility for ensuring widespread community review and consultation.