A New Approach to Finding Research
Materials on the Web
Widespread network access to digital resources has created a
paradox for the academic and research community. Despite enormous
institutional investment in the creation and description of
materials of serious interest to research and education, these
resources exist in isolated pockets. They are difficult to find
and impossible to search across. Meanwhile, students and faculty
are tempted into over-reliance on commercial Internet search
engines, despite their limitations and the uneven quality of the
materials they include. In terms of providing convenient network
access to information resources of value, neither the traditional
library approach nor the emergent Internet approach is serving
the academic community well.
The traditional library approach has relied upon the creation
of descriptive metadata to give users search access to materials.
To make researchers aware of the contents of dispersed and
uncoordinated collections, union catalogs that reflect the
holdings of many libraries have been created. These systems,
however, require participants to actively contribute records to
the union database. Moreover, union catalogs are designed around
a single type of metadata, which limits the catalogs to materials
that can be described by such metadata. Commercial abstracting
and indexing services have developed separate systems for access
to the journal literature, leading to a disjuncture between the
search for monographs and the search for journal articles. The
emergence in the 1990s of new metadata formats for everything
from archival finding aids to social science data sets has led to
an increasing balkanization of search-and-retrieval systems.
Some libraries have taken an alternative approach, using a
standard search-and-retrieval protocol, Z39.50, to perform
broadcast searching of physically separate databases. While this
approach has the advantage of offering virtual union search
capability across repositories with differing underlying data
formats, it presents problems of its own. The data providers must
support complex Z39.50 server software, and considerable
coordination is required to set up workable profiles. Z39.50
search also works best across a limited number of services; it
does not scale to the thousands of potential sources of digital
content.
Internet search engines, in contrast, have generally relied
upon the automatic indexing of HTML text, as opposed to the
creation of metadata, and upon the automatic harvesting of Web
sites, rather than on the active contributions from data
providers. Consequently, these services scale in a way that
traditional library approaches do not, and they allow for the
creation of massive indexes that dwarf the largest library union
catalog. However, Internet search engines cover academic and
scholarly materials poorly, burying them in quantities of less
reliable resources provided by the commercial sector or by
unknown and uncredentialed individuals and organizations.
Further, the search engines cover only a portion of "Web space"
(as little as 17 percent according to a recent study) and
frequently favor retrieval of resources based on their own
business considerations rather than on the needs of searchers.
They are susceptible to page-jacking, index spamming, and other
dubious practices. Moreover, an enormous percentage of scholarly
materials, from digitized slides to survey data, are not
described by static Web pages but rather in myriad databases,
consequently, such materials are largely invisible to the search
engines.
A recent and promising approach attempts to combine the best
of library and Internet techniques into a wholly new model for
accessing scholarly resources. This model has its genesis in the
October 1999 meeting of the Open Archives Initiative in Santa Fe,
New Mexico, under the sponsorship of CLIR, the Digital Library
Federation (DLF), the Scholarly Publishing and Academic Resources
Coalition, the Association of Research Libraries, and the
Research Library of the Los Alamos National Laboratory. The group
discussed the interoperation of "e-print archives" (collections
of electronic journal articles and preprints), focusing on how
e-print repositories could most easily share metadata about their
holdings. The group decided to pursue an approach adapted from
the harvesting technique employed by the Internet search engines.
In this approach, there are data providers and service providers.
A data provider agrees to support a simple harvesting protocol
and to provide extracts of its metadata in a common,
minimal-level format in response to harvest requests. It then
records information about its collection in a shared registry. A
service provider uses this registry to locate participating data
providers, and uses the harvest protocol to collect metadata from
them. The service provider is then able to build intellectually
useful services, such as catalogs and portals to materials
distributed across multiple e-print sites. Ongoing commitment by
the principals of the Open Archive Initiative will produce a
refinement of the conventions and testbed implementations of this
model.
Under the aegis of the DLF and with support from the Andrew W.
Mellon Foundation, a related group has been building on this
promising foundation by discussing how to generalize the concepts
developed in Santa Fe into a universal model for research
metadata harvesting. This model would apply to a wider range of
digital resources of academic and scholarly interest. In addition
to e-prints and electronic texts, such resources include science
and social science data sets, visual materials, archival
collections, geographic information system (GIS) data, sound and
music, video, and any other type of resource for which metadata
is typically created.
This approach appears to have many virtues. The effort to
enable a data provider to support metadata harvesting is
reasonable compared with that required to contribute to a union
catalog or provide Z39.50 search access. After the repository is
set up, its metadata can be harvested with relatively little
intervention, providing the scalability that library approaches
have lacked. The development of both inclusive and specialized
search services is possible; in fact, the model encourages the
development of many search services competing in terms of
functionality, audience, and business models, thereby enriching
the entire research environment.
This model could be used to expose the metadata in thousands
of individual systems worldwide to central collection. For
example, comprehensive collections of Americana or GIS data could
be developed. This would make local repositories more generally
known, and more generally useful, because researchers could
search across previously unconnected materials. It would also
illuminate the "dark matter" of the Internet-material that is
hard or impossible to find if the user does not already know
where it exists.
Most important, the academic community could begin to ensure
that services would be developed that express the values of that
community-services that center on materials of research and
educational interest, that provide honest and transparent ranking
and retrieval, and that improve search quality by intelligently
integrating metadata.
The services that could be developed under the expanded Santa
Fe framework are limited only by need and ingenuity. The
following are among those discussed:
- A portal to digital Americana. Many universities, archives,
historical societies, cultural institutions, and other
organizations are creating Web-accessible collections of
Americana, often with grant funds. Currently, these materials
remain largely invisible to educators and scholars. A service
focusing on harvested metadata for Americana might combine access
to archival visual and textual collections, such as those
included in American Memory, with citations to electronic journal
articles from JSTOR, early American fiction from the University
of Virginia, H. H. Richardson architectural drawings from
Harvard, the Hoagy Carmichael collection at Indiana University,
Hawaiian language newspapers from the University of Hawaii, and
audio, visual, textual, and multimedia materials from hundreds of
relevant sites.
- A portal to environmental information. Environmental
information is collected by hundreds of international, federal,
state, and private agencies, and described using dozens of
metadata formats. This information is used intensively by
government and university researchers, despite the difficulty of
finding data scattered among such a vast number of sites. A
portal built upon harvested metadata could combine access to
land, air, and space data from key government agencies with
access to white papers, treaties, policy documents, journals,
newsletters, and other relevant sources of environmental
information. An even more ambitious service might combine search
access to environmental information with geographic information
resources such as those indexed by the Illinois Natural Resources
Geospatial Data Clearinghouse, the University of Nevada
Geospatial Data Clearinghouse, the NYS Spatial Data
Clearinghouse, and other regional clearinghouses of geospatial
information.
- The academic engine. Despite the availability of library
catalogs, online journal search services, and departmental
databases, many university students and researchers turn first to
the major commercial Internet search engines for resource
discovery. A comprehensive Internet search service oriented
toward academic and research resources would be a more productive
alternative. Such a service might include all the information
covered in more specialized portals (e.g., Americana,
environmental information, GIS), as well as metadata from
academic catalogs and databases, Web pages in the ".edu" domain,
and commercial resources aimed at the research community.
- It may not be a huge undertaking to move this vision to
reality. The next steps are to formalize the framework, to
establish mechanisms to encourage research collections to make
their metadata available, and to encourage service providers to
build useful tools based upon the harvesting of this metadata.
The following areas must be addressed:
-
- extending the general framework to encompass tools, business
models, and project coordination;
-
- formalizing the governance structures for maintaining,
documenting, and promoting the technical framework;
-
- creating a registry of high-quality research sites with
harvestable metadata; and
-
- defining a set of demonstration projects to build and test a
few catalogs, portals, and other services of interest to the
research community to test the validity of the model.
Recent statements from the library community have urged
research libraries to step up to the creation of a scholarly
information commons-a networked space where users readily and
seamlessly traverse the collected wealth of our disparate
educational and cultural collections. A workable model for
research metadata harvesting may provide the infrastructure
needed for approaching this task. To gain community input to this
initiative, the Digital Library Federation and other involved
parties will issue additional discussion papers and sponsor
meetings exploring these ideas over the next several months. For
further information, contact the DLF at dlf@clir.org.
return to top >>
|