DLF. Architectures. Harvesting research metadata. Proposal. Print Version

Harvesting research metadata. Aims, objectives, planning process

Research Metadata Harvest Project
Draft Request to the Mellon Foundation for Funds to Support a Planning Process

April 2000

Harvard University, in conjunction with the Digital Library Federation, proposes a planning process to define how research institutions can make available metadata for their collections for re-use in various finding systems, thus making those collections more visible to Internet users. The planning process, expected to take 6 months or less, will consist of a 2-day meeting at Harvard, plus follow-up work via e-mail and telephone. The result will be a report to the Mellon Foundation and the research community defining proposed data formats and protocols, suggested intellectual property arrangements, and one or more plans for test projects. We are requesting funding from the Andrew W. Mellon Foundation to support the meeting and the preparation of the final written report.

The Internet has indisputably emerged as the pre-eminent shop-window onto society's cultural and scholarly wealth. Through their development of online catalogues and finding aids, libraries, museums, archives, and other cultural organizations are aggressively extending access to their holdings, and users are able increasingly to view those holdings online as digitized texts, images, even sound and film. The Internet also supplies better access to the data that underpins so much scholarly investigation, for example, as derived from government departments, statistical surveys, scientific experimentation and other primary research. Educational institutions are also contributing substantially to this rapidly increasing national and international wealth through their development of online learning materials, pre-print services, and other new forms of scholarly communications.

The educational and cultural possibilities are undoubtedly enormous, yet so is the frustration experienced by those who seek to exploit this vast wealth of information. Hiding behind a myriad of URLs that remain unorganized and un-indexed, these materials remain accessible only to those who know of their existence or who stumble upon them by some merciful act of serendipity. Even where locations are known, users are unable to search across collections. Looking for information bearing on the life and times of Charles Dickens, the anatomical differences between marsupials and mammals, or the changing nature of fortification during the Bronze Age, the user is required to issue the same set of queries against every relevant online collection.

Obstacles to so-called "cross-searching" also impede the fullest return on any investment that is made in the development of online indices and digitized collections. If set in a very different context some of the digital images in a database designed to support the study of the Greek town of Ephesus can make a significant contribution to another collection - one, for example, that charts architectural development in the classical world. The costs involved in producing high quality and richly described online information demand that we pay at least some attention to the promise of re-purposing. Yet we cannot so long as individual information objects remain bound to a particular online product. Further, so long as it remains difficult to identify online educational and cultural information and to re-purpose selected information presenting it in new ways and as part of new collections, there is a substantial risk of redundant effort on the part of libraries, museums, archives and others who create digitized collections.

Internet search engines offer a partial and ineffective solution: they access only a small portion of the information content that is available via the World Wide Web; do not distinguish quality in the content they do access; and have no access to the metadata that are stored in databases and in fact typically characterize the educational and research collections referenced above.
So-called subject gateways that supply organized, catalogued, searchable, and web-accessible references to scholarly and educational collections have emerged in part to fill the need but are costly to maintain. They also fail, as the Internet search engines do, to get at the individual or item-level information that make up educational and cultural collections.
Aggregating services, too, have their limitations, for example, in their scope (catalogue records for books, art historical images, archival finding aids), and in the particular view they impose on their materials.

promising and as yet undeveloped path involves institutions disclosing item-level metadata held in local databases in some agreed harvestable form, while entering some very minimal information about the local databases into a common registry service. Together these facilities would support the development of numerous and very different views of the scholarly record. Thus, one can imagine the development of online services that provide access to information that supports investigation of particular subject areas (colonial American history); themes (biodiversity); regional holdings (cultural and educational holdings in the Pacific Northwest); even resource types (e.g. archival finding aids).

underlying philosophy in such a metadata harvesting scheme is to let a thousand flowers blossom on the service side and in this regard to encourage libraries and other organizations to support their users' very different needs. The same philosophy is applied to the supply side. That is, potential suppliers of item level metadata would choose what to disclose to a wider public. The only a priori restriction would be on metadata that can be legally distributed. Suppliers might also choose to link metadata records to the information resources to which they refer. In this respect, one can envisage metadata that includes links, for example, to digital images, electronic texts, or spatially referenced databases that are in this way made available to potential users in their native (or other) formats.

The direction holds further promise for libraries and other cultural organizations which grapple with the problem of integrating their own users' access to a growing array of distributed and deeply heterogeneous online information, including locally subscribed electronic journals and abstract and indexing services, digitized collections, research databases, and online teaching materials. Indeed, this institutional interest may supply considerable momentum as potential participants in any broader metadata harvesting initiative may simultaneously address themselves to local as well as global needs.

The funds requested in this proposal are for a planning process that will chart the development of a technical, organizational and business framework within which such metadata harvesting can take place and fully develop proposals for one or several prototype metadata harvesting projects. The process will involve a two-day planning meeting, a period of community consultation and review, and the design and launch of at least one prototype project with clearly established milestones and deliverables. The process is intended to:

agree a common format for disclosing item level metadata, giving preference to the adoption of existing standards (Dublin Core, XML) over the development of proprietary schemes;
agree the minimum level information that may need to be supplied about any metadata collection to some common registry service (again with reference to existing approaches to collection level descriptions and registry services) and the organizational and business issues involved in the establishment of such a service;
agree protocols for registry access and metadata harvesting;
identify promising search strategies and any tools that may need to be developed in order to support them;
identify promising service-side scenarios through harvested metadata may be presented to specific user communities and the incentives and business models that may apply in each of those scenarios;
identify organizational and business issues that would need to be addressed by any pilot project or projects and rules of engagement that would need to be agreed by participants in those projects to encourage their orderly progress;
outline an implementation process and supply milestones for any pilot project or projects, addressing their funding requirements;
nurture the development of pilot project(s) and other implementation efforts.

The two-day planning meeting will assemble a small group of people who bring relevant high-level expertise to bear on the process. In several cases they are drawn from institutions that express an interest in becoming involved in any prototype, both as metadata suppliers and as online service developers.

Results of the meeting will be fully documented and that documentation, along with draft agreements, will be circulated widely within the library community for review and comment. They will also be brought to bear in any test-bed activity that results from the planning process.

The planning process may commence as early as May 1 2000. It would be completed within six months of the start date with deposit at the Mellon Foundation of a full report including the deliverables outlined above.

Harvard University Library will be responsible for successful completion of the planning process. Documentation will be produced, and the Digital Library Federation will take responsibility for ensuring widespread community review and consultation.

Harvesting research metadata. Aims, objectives, planning process

Research Metadata Harvest Project Draft Request to the Mellon Foundation for Funds to Support a Planning Process

Research Metadata Harvest Project
Draft Request to the Mellon Foundation for Funds to Support a Planning Process