Research Metadata Harvest Project
Draft Request to the Mellon Foundation for Funds to Support a
Planning Process
April 2000
Harvard University, in conjunction with the Digital Library
Federation, proposes a planning process to define how research
institutions can make available metadata for their collections
for re-use in various finding systems, thus making those
collections more visible to Internet users. The planning process,
expected to take 6 months or less, will consist of a 2-day
meeting at Harvard, plus follow-up work via e-mail and telephone.
The result will be a report to the Mellon Foundation and the
research community defining proposed data formats and protocols,
suggested intellectual property arrangements, and one or more
plans for test projects. We are requesting funding from the
Andrew W. Mellon Foundation to support the meeting and the
preparation of the final written report.
The Internet has indisputably emerged as the pre-eminent
shop-window onto society's cultural and scholarly wealth. Through
their development of online catalogues and finding aids,
libraries, museums, archives, and other cultural organizations
are aggressively extending access to their holdings, and users
are able increasingly to view those holdings online as digitized
texts, images, even sound and film. The Internet also supplies
better access to the data that underpins so much scholarly
investigation, for example, as derived from government
departments, statistical surveys, scientific experimentation and
other primary research. Educational institutions are also
contributing substantially to this rapidly increasing national
and international wealth through their development of online
learning materials, pre-print services, and other new forms of
scholarly communications.
The educational and cultural possibilities are undoubtedly
enormous, yet so is the frustration experienced by those who seek
to exploit this vast wealth of information. Hiding behind a
myriad of URLs that remain unorganized and un-indexed, these
materials remain accessible only to those who know of their
existence or who stumble upon them by some merciful act of
serendipity. Even where locations are known, users are unable to
search across collections. Looking for information bearing on the
life and times of Charles Dickens, the anatomical differences
between marsupials and mammals, or the changing nature of
fortification during the Bronze Age, the user is required to
issue the same set of queries against every relevant online
collection.
Obstacles to so-called "cross-searching" also impede the
fullest return on any investment that is made in the development
of online indices and digitized collections. If set in a very
different context some of the digital images in a database
designed to support the study of the Greek town of Ephesus can
make a significant contribution to another collection - one, for
example, that charts architectural development in the classical
world. The costs involved in producing high quality and richly
described online information demand that we pay at least some
attention to the promise of re-purposing. Yet we cannot so long
as individual information objects remain bound to a particular
online product. Further, so long as it remains difficult to
identify online educational and cultural information and to
re-purpose selected information presenting it in new ways and as
part of new collections, there is a substantial risk of redundant
effort on the part of libraries, museums, archives and others who
create digitized collections.
- Internet search engines offer a partial and ineffective
solution: they access only a small portion of the information
content that is available via the World Wide Web; do not
distinguish quality in the content they do access; and have no
access to the metadata that are stored in databases and in fact
typically characterize the educational and research collections
referenced above.
- So-called subject gateways that supply organized, catalogued,
searchable, and web-accessible references to scholarly and
educational collections have emerged in part to fill the need but
are costly to maintain. They also fail, as the Internet search
engines do, to get at the individual or item-level information
that make up educational and cultural collections.
- Aggregating services, too, have their limitations, for
example, in their scope (catalogue records for books, art
historical images, archival finding aids), and in the particular
view they impose on their materials.
promising and as yet undeveloped path involves institutions
disclosing item-level metadata held in local databases in some
agreed harvestable form, while entering some very minimal
information about the local databases into a common registry
service. Together these facilities would support the development
of numerous and very different views of the scholarly record.
Thus, one can imagine the development of online services that
provide access to information that supports investigation of
particular subject areas (colonial American history); themes
(biodiversity); regional holdings (cultural and educational
holdings in the Pacific Northwest); even resource types (e.g.
archival finding aids).
underlying philosophy in such a metadata harvesting scheme is
to let a thousand flowers blossom on the service side and in this
regard to encourage libraries and other organizations to support
their users' very different needs. The same philosophy is applied
to the supply side. That is, potential suppliers of item level
metadata would choose what to disclose to a wider public. The
only a priori restriction would be on metadata that can be
legally distributed. Suppliers might also choose to link metadata
records to the information resources to which they refer. In this
respect, one can envisage metadata that includes links, for
example, to digital images, electronic texts, or spatially
referenced databases that are in this way made available to
potential users in their native (or other) formats.
The direction holds further promise for libraries and other
cultural organizations which grapple with the problem of
integrating their own users' access to a growing array of
distributed and deeply heterogeneous online information,
including locally subscribed electronic journals and abstract and
indexing services, digitized collections, research databases, and
online teaching materials. Indeed, this institutional interest
may supply considerable momentum as potential participants in any
broader metadata harvesting initiative may simultaneously address
themselves to local as well as global needs.
The funds requested in this proposal are for a planning
process that will chart the development of a technical,
organizational and business framework within which such metadata
harvesting can take place and fully develop proposals for one or
several prototype metadata harvesting projects. The process will
involve a two-day planning meeting, a period of community
consultation and review, and the design and launch of at least
one prototype project with clearly established milestones and
deliverables. The process is intended to:
- agree a common format for disclosing item level metadata,
giving preference to the adoption of existing standards (Dublin
Core, XML) over the development of proprietary schemes;
- agree the minimum level information that may need to be
supplied about any metadata collection to some common registry
service (again with reference to existing approaches to
collection level descriptions and registry services) and the
organizational and business issues involved in the establishment
of such a service;
- agree protocols for registry access and metadata
harvesting;
- identify promising search strategies and any tools that may
need to be developed in order to support them;
- identify promising service-side scenarios through harvested
metadata may be presented to specific user communities and the
incentives and business models that may apply in each of those
scenarios;
- identify organizational and business issues that would need
to be addressed by any pilot project or projects and rules of
engagement that would need to be agreed by participants in those
projects to encourage their orderly progress;
- outline an implementation process and supply milestones for
any pilot project or projects, addressing their funding
requirements;
- nurture the development of pilot project(s) and other
implementation efforts.
The two-day planning meeting will assemble a small group of
people who bring relevant high-level expertise to bear on the
process. In several cases they are drawn from institutions that
express an interest in becoming involved in any prototype, both
as metadata suppliers and as online service developers.
Results of the meeting will be fully documented and that
documentation, along with draft agreements, will be circulated
widely within the library community for review and comment. They
will also be brought to bear in any test-bed activity that
results from the planning process.
The planning process may commence as early as May 1 2000. It
would be completed within six months of the start date with
deposit at the Mellon Foundation of a full report including the
deliverables outlined above.
Harvard University Library will be responsible for successful
completion of the planning process. Documentation will be
produced, and the Digital Library Federation will take
responsibility for ensuring widespread community review and
consultation.
return to top >>