Metadata harvesting testbed. Call for expressions of
interest
The Digital Library Federation
29 June 2000
Please note, responses to this call are no longer being
sought by the DLF.
I Executive summary
The Andrew W. Mellon Foundation has asked the DLF to propose a
coherent set of projects that together can demonstrate how and to
what extent emerging network technologies may make scholarly
information resources more readily accessible via online gateway
or portal services.
I am therefore writing to solicit brief expressions of
interests from DLF member institutions who are interested in
participating in such projects. Collaborative projects are
especially welcome and may certainly involve institutions that
are not DLF members, including other universities, colleges,
museums, archives, publishers, professional societies, authors,
etc..
II. Introduction
With a planning grant from the Andrew W. Mellon Foundation,
the DLF has developed a framework for disclosing the collective
wealth of our educational and cultural collections. Built upon
technical work conducted by the Open Archives Initiative, the
framework promises to support the emergence of online portal or
gateway services that integrate access to the item-level metadata
managed by libraries, museums, cultural organizations publishers,
authors, scholarly societies, etc. Significantly, it may help the
DLF to take a substantial step forward in realizing its mission
"to bring together -- from across the nation and beyond --
digitized materials that will be made accessible to students,
scholars, and citizens everywhere, and that document the building
and dynamics of America's heritage and cultures."
A next step is evaluating the framework through some
co-ordinated testbed activities to be supported by the Mellon
Foundation, the DLF, and potentially other agencies.
I am accordingly looking for expressions of interest from
members willing to participate in such activities by contributing
metadata for some of their library's digital or paper-based
collections. I am also looking for members who may be interested
in building prototype gateway or portal services based on
harvested subsets of the metadata that are contributed. It is
hoped that a range of different harvesting services (for example,
as organized by subject domain, information class, region, etc)
will be proposed that can help to evaluate different aspects of
the framework.
Those interested in participating in any testbed should, by 4
August 2000, send a brief expression of interest (400-600 words
maximum) addressing the points raised in Section IV below.
III The framework
The framework accommodates both data providers and service
providers. A data provider agrees to support a simple harvesting
protocol and to provide extracts of item-level metadata in a
common, minimal-level format in response to harvest requests from
trusted service providers. It then records information about its
metadata collections in a shared registry. A service provider
uses this registry to locate potential data providers, and uses
the harvest protocol to collect metadata from them possibly after
reaching some kind of formal agreement on terms and conditions of
use. The service provider is then able to build intellectually
useful services, such as catalogs and portals to materials
distributed across multiple sites. The framework applies to a
wide range of information resources of academic and scholarly
interest including printed and electronic texts, science and
social science data sets, visual materials, archival collections,
geographic information system (GIS) data, sound and music, video,
and any other type of resource for which metadata is typically
created.
Theoretically, the framework could be used to expose the
metadata in thousands of individual systems worldwide to central
collection. For example, comprehensive collections of Americana
or GIS data could be developed. This would make local
repositories more generally known, and more generally useful,
because researchers could search across previously unconnected
materials. It would also illuminate the "dark matter" of the
Internet-material that is hard or impossible to find if the user
does not already know where it exists.
Most important, the academic community could begin to ensure
that services would be developed that express the values of that
community-services that center on materials of research and
educational interest, that provide honest and transparent ranking
and retrieval, and that improve search quality by intelligently
integrating metadata.
The services that could be developed under the framework are
limited only by need and ingenuity. The following list includes
some that have been discussed but is not by any means
exhaustive:
- A portal to digital Americana. Many universities, archives,
historical societies, cultural institutions, and other
organizations are creating Web-accessible collections of
Americana, often with grant funds. Currently, these materials
remain largely invisible to educators and scholars. A service
focusing on harvested metadata for Americana might combine access
to archival visual and textual collections, such as those
included in American Memory, with citations to electronic journal
articles from JSTOR, early American fiction from the Universities
of Virginia and Indiana, H. H. Richardson architectural drawings
from Harvard, the Hoagy Carmichael collection at Indiana
University, Hawaiian language newspapers from the University of
Hawaii, and audio, visual, textual, and multimedia materials from
hundreds of relevant sites.
- A portal to environmental information. Environmental
information is collected by hundreds of international, federal,
state, and private agencies, and described using dozens of
metadata formats. This information is used intensively by
government and university researchers, despite the difficulty of
finding data scattered among such a vast number of sites. A
portal built upon harvested metadata could combine access to
land, air, and space data from key government agencies with
access to white papers, treaties, policy documents, journals,
newsletters, and other relevant sources of environmental
information. An even more ambitious service might combine search
access to environmental information with geographic information
resources such as those indexed by the Illinois Natural Resources
Geospatial Data Clearinghouse, the University of Nevada
Geospatial Data Clearinghouse, the New York State Spatial Data
Clearinghouse, and other regional clearinghouses of geospatial
information.
- The academic engine. Despite the availability of library
catalogs, online journal search services, and departmental
databases, many university students and researchers turn first to
the major commercial Internet search engines for resource
discovery. A comprehensive Internet search service oriented
toward academic and research resources would be a more productive
alternative. Such a service might include all the information
covered in more specialized portals (e.g., Americana,
environmental information, GIS), as well as metadata from
academic catalogs and databases, Web pages in the ".edu" domain,
and commercial resources aimed at the research community.
Technically the framework adopts the Santa Fe Convention as
developed by the Open Archives initiative, specifically the DTD
that it recommends for item-level metadata and a subset of the
Dienst protocol as the common protocol used to harvest metadata
records. A registry will be established as part of the testbed
using simple web forms for data entry. Documentation will be
provided, and we anticipate that through the work undertaken by
testbed projects we will accumulate a library of sharable
routines for functions such as converting from common metadata
formats to the minimal DTD. The DLF will be working closely with
the Open Archives Initiative to ensure that the technical
framework maintains the stability it will need to support the
testbed activity presented in this document.
Economically, we would use the testbed to explore economic
relationships between metadata suppliers, service suppliers, and
those responsible for the management of necessary standards (e.g.
the DTD and Dienst protocol) and tools (e.g. the registry). It is
hoped that registries pointing to metadata for high-quality
educational information will attract and support a number of
harvesting services each of which serves specific educational and
other needs. Harvesting services, meanwhile, will theoretically
encourage institutions to expose item level metadata, for
example, as a means of enhancing use of locally managed
collections. However, we need to understand what kinds of terms
and conditions are necessary for data providers to entrust
information with service providers and for service providers to
sustain their services.
IV. The testbed
The test-bed seeks to address a range of related technical,
functional, organizational and economic questions by:
- exposing a pool of item-level metadata via some light-weight
registry service;
- supporting the development of some harvesting services built
upon the metadata disclosed via the registry;
- supporting appropriate research and development work in tools
and other infrastructure, including legal and economic
agreements, that may be commonly required by those interested in
exposing item-level metadata or developing harvesting
services.
Some key issues that work on the testbed may illuminate are
indicated below:
Technically the testbed will seek to evaluate the metadata DTD
and the harvest protocol; appropriate search-engine and related
technologies; generic tools appropriate for translating local
metadata databases into a form suitable for exposure ; and
mechanisms for managing item level metadata that offer different
levels and types of access to the information objects to which
they refer.
Functionally it will determine whether harvest services meet
scholars' information-seeking needs and how they can be developed
or exploited by digital libraries, for example, in integrating
access to distributed local collections.
Organizationally and economically, the testbed will identify
how harvesting services might support themselves; whether there
is a natural market for them and business models attractive to
both service suppliers and service users. Equally, it will
identify incentives for encouraging institutions to expose item
level metadata.
IV. Format for expressions of interest
Brief expressions of interests (not exceeding 400-600 words)
should address one or both of the following points
1. What metadata would you contribute to the testbed. It
is assumed that every participating institution will contribute
at least some metadata.
Describe the source format (MARC, Dublin Core, etc) and
coverage (number, format, and subject, chronological or other
scope of underlying information objects) of metadata databases
your institution would be prepared to disclose to such a testbed.
Brief descriptions (even aggregated ones, where numerous metadata
databases are concerned) are appropriate at this stage.
There are no a priori assumptions about what metadata
will be suitable to a harvesting service. Metadata comprised in
OPACs, finding aids, and other indices as well as in databases of
digital or digitized content are all undoubtedly of interest.
Metadata records do not need to hotlink to some underlying
digital information object, nor do metadata records need
necessarily to refer to information objects that can be made
publicly available without restriction. The only requirement is
that participating institutions expose only those metadata which
it has a right to expose and make publicly accessible.
2. What harvesting service would you be interested in
developing? Not all participating institutions would be
expected to develop a harvesting service. Nor is it likely that
available funding will support more than a small number of
harvesting services.
2.1. Describe the harvesting service you would be interested
in developing and the communities or organizations that would
benefit most from its use.
2.2. What item-level metadata would the service require for
development through a prototype phase? Your answer to this
question may help to match service needs with metadata
offers.)
2.3. What staff, expertise, or other resources would the
institution be able to commit towards its participation in the
metadata harvesting testbed? What specific expertise,
infrastructure or related experience would the institution draw
upon in contributing to the testbed?
V. Next steps
Expressions of interest will identify the harvesting services
that may be developed to prototype with access to appropriate
item-level metadata.
In the interest of gaining experience with a range of
distinctive services possibilities, members interested in
developing similar harvesting services may be encouraged to align
their efforts.
Depending on the level of interest in the testbed and the
funding that becomes available for it, it may be necessary to go
forward with only a subset of the metadata and harvesting
services that are offered.
Selectivity, where required, will be governed by
considerations including:
- level and nature of appropriate metadata likely to be
available to the harvesting service;
- aspects of the framework that the harvesting service promises
to illuminate;
- costs involved in exposing appropriate metadata and
developing the prototype harvesting service;
- possibility that the harvesting service may be able (a) to
sustain itself after an initial grant-assisted period and (b)
contribute common knowledge, tools, etc. to educational and
library communities;
- specific interests of the sponsoring funding agency or
agencies.
return to top >>
|