A New Approach to Finding Research Materials on the Web

Widespread network access to digital resources has created a paradox for the academic and research community. Despite enormous institutional investment in the creation and description of materials of serious interest to research and education, these resources exist in isolated pockets. They are difficult to find and impossible to search across. Meanwhile, students and faculty are tempted into over-reliance on commercial Internet search engines, despite their limitations and the uneven quality of the materials they include. In terms of providing convenient network access to information resources of value, neither the traditional library approach nor the emergent Internet approach is serving the academic community well.

The traditional library approach has relied upon the creation of descriptive metadata to give users search access to materials. To make researchers aware of the contents of dispersed and uncoordinated collections, union catalogs that reflect the holdings of many libraries have been created. These systems, however, require participants to actively contribute records to the union database. Moreover, union catalogs are designed around a single type of metadata, which limits the catalogs to materials that can be described by such metadata. Commercial abstracting and indexing services have developed separate systems for access to the journal literature, leading to a disjuncture between the search for monographs and the search for journal articles. The emergence in the 1990s of new metadata formats for everything from archival finding aids to social science data sets has led to an increasing balkanization of search-and-retrieval systems.

Some libraries have taken an alternative approach, using a standard search-and-retrieval protocol, Z39.50, to perform broadcast searching of physically separate databases. While this approach has the advantage of offering virtual union search capability across repositories with differing underlying data formats, it presents problems of its own. The data providers must support complex Z39.50 server software, and considerable coordination is required to set up workable profiles. Z39.50 search also works best across a limited number of services; it does not scale to the thousands of potential sources of digital content.

Internet search engines, in contrast, have generally relied upon the automatic indexing of HTML text, as opposed to the creation of metadata, and upon the automatic harvesting of Web sites, rather than on the active contributions from data providers. Consequently, these services scale in a way that traditional library approaches do not, and they allow for the creation of massive indexes that dwarf the largest library union catalog. However, Internet search engines cover academic and scholarly materials poorly, burying them in quantities of less reliable resources provided by the commercial sector or by unknown and uncredentialed individuals and organizations. Further, the search engines cover only a portion of "Web space" (as little as 17 percent according to a recent study) and frequently favor retrieval of resources based on their own business considerations rather than on the needs of searchers. They are susceptible to page-jacking, index spamming, and other dubious practices. Moreover, an enormous percentage of scholarly materials, from digitized slides to survey data, are not described by static Web pages but rather in myriad databases, consequently, such materials are largely invisible to the search engines.

A recent and promising approach attempts to combine the best of library and Internet techniques into a wholly new model for accessing scholarly resources. This model has its genesis in the October 1999 meeting of the Open Archives Initiative in Santa Fe, New Mexico, under the sponsorship of CLIR, the Digital Library Federation (DLF), the Scholarly Publishing and Academic Resources Coalition, the Association of Research Libraries, and the Research Library of the Los Alamos National Laboratory. The group discussed the interoperation of "e-print archives" (collections of electronic journal articles and preprints), focusing on how e-print repositories could most easily share metadata about their holdings. The group decided to pursue an approach adapted from the harvesting technique employed by the Internet search engines. In this approach, there are data providers and service providers. A data provider agrees to support a simple harvesting protocol and to provide extracts of its metadata in a common, minimal-level format in response to harvest requests. It then records information about its collection in a shared registry. A service provider uses this registry to locate participating data providers, and uses the harvest protocol to collect metadata from them. The service provider is then able to build intellectually useful services, such as catalogs and portals to materials distributed across multiple e-print sites. Ongoing commitment by the principals of the Open Archive Initiative will produce a refinement of the conventions and testbed implementations of this model.

Under the aegis of the DLF and with support from the Andrew W. Mellon Foundation, a related group has been building on this promising foundation by discussing how to generalize the concepts developed in Santa Fe into a universal model for research metadata harvesting. This model would apply to a wider range of digital resources of academic and scholarly interest. In addition to e-prints and electronic texts, such resources include science and social science data sets, visual materials, archival collections, geographic information system (GIS) data, sound and music, video, and any other type of resource for which metadata is typically created.

This approach appears to have many virtues. The effort to enable a data provider to support metadata harvesting is reasonable compared with that required to contribute to a union catalog or provide Z39.50 search access. After the repository is set up, its metadata can be harvested with relatively little intervention, providing the scalability that library approaches have lacked. The development of both inclusive and specialized search services is possible; in fact, the model encourages the development of many search services competing in terms of functionality, audience, and business models, thereby enriching the entire research environment.

This model could be used to expose the metadata in thousands of individual systems worldwide to central collection. For example, comprehensive collections of Americana or GIS data could be developed. This would make local repositories more generally known, and more generally useful, because researchers could search across previously unconnected materials. It would also illuminate the "dark matter" of the Internet-material that is hard or impossible to find if the user does not already know where it exists.

Most important, the academic community could begin to ensure that services would be developed that express the values of that community-services that center on materials of research and educational interest, that provide honest and transparent ranking and retrieval, and that improve search quality by intelligently integrating metadata.

The services that could be developed under the expanded Santa Fe framework are limited only by need and ingenuity. The following are among those discussed:

A portal to digital Americana. Many universities, archives, historical societies, cultural institutions, and other organizations are creating Web-accessible collections of Americana, often with grant funds. Currently, these materials remain largely invisible to educators and scholars. A service focusing on harvested metadata for Americana might combine access to archival visual and textual collections, such as those included in American Memory, with citations to electronic journal articles from JSTOR, early American fiction from the University of Virginia, H. H. Richardson architectural drawings from Harvard, the Hoagy Carmichael collection at Indiana University, Hawaiian language newspapers from the University of Hawaii, and audio, visual, textual, and multimedia materials from hundreds of relevant sites.
A portal to environmental information. Environmental information is collected by hundreds of international, federal, state, and private agencies, and described using dozens of metadata formats. This information is used intensively by government and university researchers, despite the difficulty of finding data scattered among such a vast number of sites. A portal built upon harvested metadata could combine access to land, air, and space data from key government agencies with access to white papers, treaties, policy documents, journals, newsletters, and other relevant sources of environmental information. An even more ambitious service might combine search access to environmental information with geographic information resources such as those indexed by the Illinois Natural Resources Geospatial Data Clearinghouse, the University of Nevada Geospatial Data Clearinghouse, the NYS Spatial Data Clearinghouse, and other regional clearinghouses of geospatial information.
The academic engine. Despite the availability of library catalogs, online journal search services, and departmental databases, many university students and researchers turn first to the major commercial Internet search engines for resource discovery. A comprehensive Internet search service oriented toward academic and research resources would be a more productive alternative. Such a service might include all the information covered in more specialized portals (e.g., Americana, environmental information, GIS), as well as metadata from academic catalogs and databases, Web pages in the ".edu" domain, and commercial resources aimed at the research community.

It may not be a huge undertaking to move this vision to reality. The next steps are to formalize the framework, to establish mechanisms to encourage research collections to make their metadata available, and to encourage service providers to build useful tools based upon the harvesting of this metadata. The following areas must be addressed:

extending the general framework to encompass tools, business models, and project coordination;
- formalizing the governance structures for maintaining, documenting, and promoting the technical framework;
  creating a registry of high-quality research sites with harvestable metadata; and
  
  defining a set of demonstration projects to build and test a few catalogs, portals, and other services of interest to the research community to test the validity of the model.
  
  Recent statements from the library community have urged research libraries to step up to the creation of a scholarly information commons-a networked space where users readily and seamlessly traverse the collected wealth of our disparate educational and cultural collections. A workable model for research metadata harvesting may provide the infrastructure needed for approaching this task. To gain community input to this initiative, the Digital Library Federation and other involved parties will issue additional discussion papers and sponsor meetings exploring these ideas over the next several months. For further information, contact the DLF at dlf@clir.org.
  
  return to top >>

Last updated: