PART III: CONTRIBUTIONS
3.0 Next Generation OAI
As hoped, OAI's flexibility and relatively low common denominator of required elements has helped to foster adoption by a wide range of domains and institutions. The specification for OAI Static Repositories and Gateways, released in October 2004, fosters growth among smaller, resource-challenged repositories in cases where OAI implementation would otherwise have proven beyond their capacity (Habing 2005). The specification provides a simple means for small collections (fewer than 5,000 records) that do not change frequently (less often than monthly) to expose metadata in a single XML file (10-20 MB) for harvesting through intermediation of an OAI Static Gateway. [] In February 2006, JISC (UK) announced the STARGATE project, Static Repository Gateway Toolkit: Enabling small publishers to participate in OAI-PMH-based services. According to the press release:
The project is implementing a series of static repositories of publisher metadata, and will demonstrate the interoperability of the exposed metadata through harvesting and cross-searching via a static repository gateway, and conduct a critical evaluation of the static repository approach with publishers and service providers. []
Initially the project will concentrate on four library and information science journals and PerX, Pilot Engineering Repository Xsearch (see 2.1.2 and 4.2.8).
Meanwhile, OAI-PMH's simplicity may translate into myriad problems for harvesters, especially in cases where data providers do not implement some of the "optional features" that are most helpful to building aggregations. Consequently, service providers are often confronted by inconsistent, insufficient, or incompatible data that limit their ability to build meaningful aggregations for end-users. DLF members and affiliated partners have been at the forefront in articulating these problems and promoting solutions (Cole and Shreeves 2004, Tennant 2004a, Hagedorn 2005b, Shreeves et al. 2005, Lagoze et al. 2006a,b). Their efforts attempt to strike a balance between the demands placed on data providers and the expectations of service providers. To the extent practicable, they seek ways to automate procedures through machine-to-machine interaction, while recognizing that some degree of expert human intervention will always be required.
3.1 Building the Distributed Library
While continuing to apply the lessons learned from adopting the protocol, the DLF received a two-year grant from the Institute of Museum and Library Services (IMLS) in October 2004 to help achieve its vision of "the distributed library," using OAI for digital library aggregation. More concretely, the grant addresses challenges identified by early OAI adopters through promulgating best practices and facilitating communication among data and service providers about such issues as metadata variation, metadata formats, and implementation practices (Shreeves et al. 2005). The proposal states:
The Open Archives Initiative (OAI) has proven itself as a protocol that allows basic metadata records to be created by many providers and then gathered up by harvesters who use those records to create library services (e.g. http://www.oaister.org/). In the act of using it over several years in library settings, however, a range of issues have come to light that need research and development if OAI is going to mature into its full potential: collections as well as item records need further development, and we need richer mechanisms of creating dialog between harvesters and providers; the hurdles to adoption need careful study, particularly how to embed the very idea of creating public, harvestable metadata as a routine step in our digitizing workflows, and how to speed up the feedback loop from a harvester to a community of providers such as exists in the library world, who typically respond positively to such "good practice" guidance.
This is a multi-faceted endeavor, enabling DLF to solidify best practices for the creation of metadata about its dispersed collections, which then inform the development of new DLF services, such as the following:
(DLF http://www.diglib.org/architectures/oai/imls2004/, emphasis added)
Overall, these efforts are informed by advice from DLF's Scholars' Advisory Panel and Panel of Technical Experts. As a result, Seaman notes that they reflect some of the following improvements:
- 1. Promotion of the University of Illinois OAI-PMH Data Provider Registry, a comprehensive technical resource intended principally for use by builders of OAI services. With nearly 1,050 active repositories, the registry is browsable via a Web interface or as XML. Described more fully in section 4.1.1, the registry is available from http://gita.grainger.uiuc.edu/registry.
- 2. A DLF Portal that allows users to access all items from DLF-member institutions that are publicized through the Open Archives Initiative. Described more fully in section 4.1.8, as of May 2006, the portal contained more than one million records: http://hti.umich.edu/i/imls/.
- 3. A DLF MODS Portal that represents a subset of the full Portal, gathering together those records that have the richer MODS metadata that support much better subject, date, and geographic navigation. Currently comprising more than 250,000 records, the MODS Portal is also described in section 4.1.8: http://www.hti.umich.edu/m/mods.
- 4. A new DLF Collections Registry that describes nearly 800 publicly accessible digital collections from which the item-level records in the Portal are derived. Described more fully in section 4.4.3, the Registry is available from: http://gita.grainger.uiuc.edu/dlfcollectionsregistry/browse/.
- A simpler (Google-like) initial interface;
- The inclusion of thumbnail images of graphical collections into the metadata;
- The closing linking of an item to its immediate collection (rather than to its institution);
- More fields for limiting searches;
- A book bag for saving and emailing records;
- Inclusion in A9.com to facilitate simultaneous searching with Amazon, Wikipedia, RedLightGreen, The British Library catalog, and many other OpenSearch services; and
- Persistent URLs.
3.1.1 Components of DLF's Grant-related Work
The DLF grant's principal partners at Emory University, the University of Michigan and the University of Illinois at Urbana-Champaign (UIUC) are focusing on three broad areas of activity:
Emory University has created a series of curriculum materials for use in OAI best practices training. This curriculum series includes eight separate documents that together provide a concise set of materials for training institutional teams in best practices for OAI implementation. Emory University is currently developing an online system that would allow searching and collaborative updating of the OAI Best Practices, and controlled output of selected information into formatted training materials. Briefly annotated below, the series is current available at DLF's Web site, http://www.diglib.org/architectures/oai/imls2004/training/:
- 1. Understanding and improving workflow practices and training so that the creation of item-level metadata is integrated into daily workflow routines of DLF-member institutions. This will increase and regularize the creation and exposure of metadata for harvesting, which serves as the foundation of a reliable and well-populated finding system for DLF's distributed content. Over time, this addresses a chief concern identified in the 2003 survey, namely making the creation and exposure of item-level metadata a priority so more meaningful content is readily accessible to the end-user. []
- 2. Developing more nuanced and prescriptive "Best Practices" for the creation of metadata with provisions for richer metadata than the unqualified Dublin Core mandated by OAI. Agreed upon best practices will help to overcome the inordinate amount of time that harvesters (OAI service providers) need to spend normalizing and completing records before they can build new services. This is essential if the proposed finding system is going to scale and flourish. It also addresses concerns of granularity and context raised in the 2003 survey.
- 3. Coordinating information exchange between service and metadata providers for better discovery by developers as well as by the end-user. This will help digital library developers identify content for new services and will promote wider access by end-users. It responds to the need for a user-friendly registry or discovery tool geared towards the end-user noted in the 2003 survey.
Most major digital library systems now offer OAI data provider (and increasingly harvesting) capability. The DLF's "OAI Tools" handout annotates seven of them: ContentDM, CWIS (Collection Workflow Integration System), DLXS, DSpace, Ex Libris' DigiTool, Fedora, and Greenstone. While this list concentrates primarily on open source and non-commercial systems, DLF recognizes that many library vendors and software developers are building OAI data provider functionality into their overall digital content management systems. Among major library vendors offering turnkey solutions are: Endeavor's EnCompass; ProQuest's Digital Commons; IndexData's Keystone, Fretwell Downing Informatics' CPORTAL; VTLS' Vortex; and SirsiDynix's 3.0 release of the Hyperion system.
DLF OAI Implementers Workshop: Agenda
- The Project Abstract: outlines the purpose and goals of the IMLS project.
The Case for OAI
OAI Implementation: Administrative Planning
OAI "Cheat Sheet": A Taxonomy of Rapid OAI Deployment Strategies
Summary of OAI Metadata Best Practices
Summary of the DLF Aquifer MODS Profile
3.1.2 The Case for Sharing Metadata and Improving Its Quality
"Marketing with Metadata - How Metadata Can Increase Exposure and Visibility of Online Content" (Moffat 2006) makes a succinct and persuasive case for exposing metadata by outlining its benefits:
(Source: Version 1.0 8th March 2006 http://www.icbl.hw.ac.uk/perx/advocacy/exposingmetadata.htm)
- Allow your content to be found from a large number of locations (e.g. portals, aggregators, search engines).
- Allow aggregators to expose and thereby help to promote your materials in novel ways.
- Enhance the visibility and awareness of your available resources.
- Be a useful way to expose materials to new markets.
- Allow potential users to determine the relevance of resources without having to access them first.
- Facilitate the production of interoperable services.
- Improve the visibility of your content in search engines such as Google, Google Scholar and Yahoo.
- Drive traffic and business to websites.
This advocacy document provides case studies for three ways of exposing metadata: harvesting metadata via OAI, exposing metadata via distributed searching (e.g., Z39.50, SRU/SRW), and exposing content for syndication (e.g., RSS). It also provides answers to common questions, such as:
The case made, it is no surprise given their extensive experience in building services based on OAI harvesting, that the DLF in partnership with the National Science Digital Library (NSDL) is developing "Best Practices for Shareable Metadata." They point out that the attributes of high-quality metadata in a local context do not necessarily equate with the best metadata in a shared environment. The guidelines identify the following additional desirable characteristics for shareable metadata:
"If it's all about sharing content why can't we just provide you with a link to our content?"
- Simply providing a link to your content does not allow it to be shared and re-purposed easily and in a standard way. The beauty of exposing metadata in a standard way is that little effort is required for third parties to reuse your metadata and make it available to their visitors.
"I don't like the thought of giving away our content for others to use."
- Exposed metadata usually only contains a brief description of the actual content - just enough to generate interest in potential users. These users will be directed back to your site by links in the metadata in order to access the full content in the normal way (i.e. freely available, subscription based, pay-per-view, etc).
"Will exposing my metadata mean that it is indexed by search engines such as Google or Google Scholar?"
- This depends on how your metadata is exposed and the indexing approaches taken by individual search engines. Exposing metadata via OAI certainly can improve ranking in search engines. "A normal Google or Google Scholar search favours OAI-repository material and normally ranks it higher than an individual's own website." Recent developments such as 'search engine-OAI bridges' are improving search engines indexing of OAI compliant repositories. Many OAI repositories are now indexed by a number of search engines, e.g. Cogprints, a repository for cognitive sciences, is indexed by Google, Google Scholar, Yahoo, Scirus and Citebase.
"Why can't I simply make my content available to Google and let people find my stuff that way?"
- You can, and in many cases this will be a perfectly appropriate thing to do. This is particularly true for freely available full text resources. However, in some cases, for example where most of your resources are not text-based, exposing them to Google may not help much. In other cases, you may not want to make the full content freely available. In these situations, exposing metadata may be more appropriate. By making your metadata freely available, you can allow people to discover your resources more readily.
Proper context. In a shared environment, metadata records will become separated from any high-level context applying to all records in a group, and from other records presented together in a local environment. It is therefore essential that each record contain the context necessary for understanding the resource the record describes, without relying on outside information.
Content coherence. Metadata records for a shared environment need to contain enough information such that the record makes sense standing on its own, yet exclude information that only makes sense in a local environment. This can be described as sharing a 'view' of the native metadata.
Use of standard vocabularies. The use of standard vocabularies enables the better integration of metadata records from one source with records from other sources.
Consistency. Even high-quality metadata will vary somewhat among metadata creators. All decisions made about application of elements, syntax of metadata values, and usage of controlled vocabularies, should be consistent within an identifiable set of metadata records so those using this metadata can apply any necessary transformation steps without having to process inconsistencies within such a set.
Technical conformance. Metadata should conform to the specified XML schemas and should be properly encoded. (DLF and NSDL 2005)
Since August 2005, a working draft of the "Best Practices" has been available at the project's wiki located at the NSDL's community portal. DLF and NSDL are working with their respective communities and library vendors alike to raise awareness of why these guidelines are important. Publication of the document is anticipated later in 2006.
A key recommendation emanating from this collaboration is to endorse MODS (Metadata Object Description Schema) as the preferred metadata schema, particularly for use in describing cultural heritage and humanities digital resources. [] In December 2005, the DLF released for public comment, "MODS Implementation Guidelines for Cultural Heritage Materials." Among other features, the guidelines help to address the particular difficulties inherent in describing digital objects that have analog originals by distinguishing between "the intellectual content and genre of a resource and its digital format and location." The richer MODS descriptive schema helps to pave the way for enhanced service features, such as those identified by DLF's Scholars' Advisory Panel.
MODS high-level elements include:
|Title Information |
Type of resource
Table of contents
Institutions currently creating MODS records include: the Library of Congress, Indiana University, OCLC, and the University of Chicago (refer to section 4.1.8 for sample records).
3.1.3 DLF 2006 Survey Responses about Metadata
The DLF survey respondents represent primarily OAI service providers so it comes as no surprise that most are eagerly anticipating promulgation of the "best practices" to improve the quality of metadata they harvest. Indeed representatives from a number of these services are members of the DLF Best Practices Task Force. When asked if they expected to change their metadata creation practices in light of the forthcoming DLF/NSDL OAI best practice guidelines, the two respondents below reflect the hopes of most service providers:
Although they are at the forefront of devising tools that help to migrate, remediate, and enhance metadata, these service providers also join other survey respondents calling for more automatic metadata tools.
We expect the uncertainly of metadata normalization and enhancement that we have to do to lessen as better standards/guidelines are promulgated for mapping native content management metadata into OAI records for harvesting.
We would like to incorporate alternate metadata formats, besides oai_dc, whenever possible. We would also like to incorporate collection descriptions into the search interface when feasible. However, as an aggregator we are pretty much stuck with whatever metadata is available from our sources. We do some date normalization and will follow best practices as applicable; however, our hope is that the best practices will influence the repositories from which we harvest so that we can take advantage of the improved metadata to provide better search and browse services.
Respondents noted some of the following accomplishments relative to metadata:
- Emory University created The Metadata Migrator software package, funded by the Institute of Museum and Library Services. It allows institutions such as museums, archives, research centers, and small libraries to make their locally stored records available for online searching using the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH), http://www.metascholar.org/sw/mm/.
- The California Digital Library drafted "Specifications for Metadata Processing Tools" (Tennant, n.d.), http://www.cdlib.org/inside/projects/harvesting/metadata_tools.htm, and created a Date Normalization Tool. This Java utility takes non-machine readable Common Era dates as input and outputs machine-readable dates in order to enhance digital collections to support date range queries, http://www.cdlib.org/inside/diglib/datenorm/. Through its Metasearch Initiative, CDL established an SRU-compliant gateway to OAI-harvested metadata. These initiatives are described at http://www.cdlib.org/inside/projects/harvesting/.
- The CIC launched a new consortial metadata portal under the leadership of the University of Illinois; drafted "CIC-OAI Project Recommendations for Dublin Core Metadata Providers," Version 1.0 06/18/2004, edited by Muriel Foulonneau and Timothy W. Cole, http://cicharvest.grainger.uiuc.edu/dcguidelines.asp; and refined its workflow and filtering processes for metadata as described at http://cicharvest.grainger.uiuc.edu/aggregation.asp. It enriched and normalized the metadata to support various browse and search interfaces. Recently it developed a new enhanced OAI data provider for the registry to allow not only simple Dublin Core records that describe each repository harvested, but also the much richer collections that created manually along with the repository descriptions imported from OAIster. Because each of these sets and subsets has rich collection-level metadata derived from the registry, it allows harvesters to associate collection-level metadata to individually harvested items more easily.
- OAIster provided UI with additional metadata about all of its OAI repositories (e.g., title, description, home page, and historical record counts) and now it refers new data providers to UI for registration and validation before harvesting their metadata. In March 2006, OAIster announced the availability of its metadata for use by federated search engines via SRU and created a Web page with instructions about how it use its metadata outside OAIster's interface (http://oaister.umdl.umich.edu/o/oaister/sru.html).
- The DLF launched a new portal based on MODS (as 4.1.8), http://www.hti.umich.edu/m/mods.
- The Directory of Open Access Journals (DOAJ) began to make article-level metadata available in addition to journal metadata, http://www.doaj.org/articles/questions#metadataA.
- The Open Language Archives Community (OLAC) developed a metadata quality, report system and implemented an interactive survey of OLAC metadata implementations that permits users to see how any attribute or field of OLAC metadata is used by other OLAC archives. Its new search engine includes in the results a metadata quality-centric sorting algorithm.
- SMETE resources are cataloged to meet the requirements of the IEEE Learning Object Metadata Standard and SMETE has developed tools to transform local application profiles to normalized application profiles, http://smete.org/smete/ (see Technology).
- BEN has created metadata validation software tools for contributors to its portal, http://www.biosciednet.org/project_site/.
- DLESE developed a distributed Web-based cataloging tool to support multiple collections and multiple metadata frameworks. With other entities created the ADN Metadata Framework, http://www.dlese.org/Metadata/adn-item/history.htm.
- MERLOT developed a Metadata Services Agreement for use with participating external vendors.
- Cornucopia migrated to a new software system and realigned almost all of its data structure to conform to the RSLP (Research Support Libraries Programme) Collection Level Description Metadata Schema, http://www.ukoln.ac.uk/metadata/rslp/.
- In creating the collection registry model, the IMLS Digital Collections & Content (DCC) gateway draws on research about how to define and describe collections, ultimately opting to adapt the RSLP Collection Description Schema and the Dublin Core Collection Description Application Profile (Cole and Shreeves 2004, 312). The DCC arrived at a collection description metadata schema with four classes of entities
- The Collaborative Digitization Program (e.g., Heritage West) revised and updated its CDP Dublin Core Metadata Best Practices, http://www.cdpheritage.org/cdp/documents/CDPDCMBP.pdf.
- The American West carried out preliminary work on metadata enhancement to support topical clustering and faceted browsing.
- DLF Aquifer developed a descriptive metadata (MODS) profile (as described above), http://www.diglib.org/aquifer/DLF_MODS_ImpGuidelines_ver4.pdf. Next steps will be developing middleware tools that support metadata management activities such as migration, taxonomy assignment, and metadata enrichment.
- The INFOMINE database has been populated with robot records, mostly created from the iVia virtual library crawler and machine-generated metadata (using iVia classifiers). The iVia software supports automated metadata generation to assign Library of Congress Subject Headings and LC Classifications to resources. The iVia software also enabled NSDL to harvest item-level metadata from iVia's server for selected NSDL collections that did not include detailed metadata.
Among the challenges, respondents noted:
- the willingness (or not) to make harvestable metadata a local priority (DLF Portal);
- lack of a good metadata editor and metadata cleansing tools (OLAC);
- quality control of metadata and learning objects (NEEDS);
- incompatible metadata standards (Sheet Music Consortium);
- automatic metadata creation tools (Intute); and
- the need for robust, flexible, open source tools for metadata normalization and enrichment (CDL).
Finally, turning to the goals of "next generation" services, the Sheet Music Consortium seeks enriched metadata to provide better retrieval services and INFOMINE looks forward to harvesting and sharing metadata with other digital libraries. Meanwhile, NSDL's conversion to a Fedora repository marks a major transition from a metadata-centric to a resource-centric data model and search service; and DLF Aquifer anticipates experimentation with methods of aggregation other than metadata harvesting, namely the ability to move digital objects from domain to domain, perhaps modifying and re-depositing them in a different location in the process.
3.2 Digital Library Services Registries (DLSR)
The "service registry" is central to enabling digital libraries to interoperate in distributed information environments with service-oriented architectures based on common standards and protocols. Dempsey refers to the service registry's collective data as the equivalent of the "systemwide 'intelligence'" within a network to run distributed applications. [] At a March 2006 workshop, participants wrote a draft definition of the concept: "A digital library service registry allows a machine or human to discover available digital library services, locate those services, and obtain configuration information to services for the purpose of interfacing." []
While there are numerous examples of application-specific registries like those for OAI-PMH (described in the next section) or the OpenURL Registry, maintained by OCLC, two efforts underway in the UK and US respectively share the goal of creating "service agnostic" registries to facilitate resource discovery and use by a multitude of applications (e.g., Z39.50, SRW/U, OAI-PMH, OpenURL). [] The Information Environment Service Registry (IESR, http://www.iesr.ac.uk/) sponsored by JISC (described in section 2.1.1 above) uses a centralized approach (depicted as part of the "shared infrastructure" in Figure 06), whereas the Ockham DLSR (http://www.ockham.org/) sponsored by the NSF's National Digital Science Library (NSDL) relies on a distributed model.
The DLSR serves three primary functions:
Discovery - allowing a user or a machine to discover available, relevant services;
Resolution - providing the ability for a person or a machine to locate, or resolve to, a
Configuration - provide information necessary for a client to access a particular service. (Frumkin 2006a, 24)
As part of the DLSR Workshop noted above, Frumkin devised a series of "use cases" that demonstrate how DLSR fits into different scenarios. In the two examples below, the first depicts how the DLSR would help a user through personalized metasearching and the second, illustrates how the DLSR would facilitate development of OAI aggregations.
Bernie the researcher is exploring the history of science for a book he is working on. He uses his library's metasearch tool to search through history-related databases and collections. He also searches the web for collections and resources that do not reside within the context of his library's collections and services. During his web searching, he discovers the Linus Pauling papers at Oregon State University. Bernie would like to add the Linus Pauling collection to his default metasearch, so he goes into the metasearch tool, clicks the 'customize this search' button, and then searches for the Linus Pauling collection to see if he might be able to add it. The metasearch application searches the digital library service registry for the Linus Pauling collection, and shows Bernie that there are actually five collections which match his search. Bernie is delighted to find four more collections, and checks all of them to be added. The metasearch application then discovers that these collections are not immediately searchable via any standard protocol, but they are harvestable via the OAI-PMH protocol. The metasearch tool is intelligent enough to be able to automatically start harvesting the metadata from these collections into a local index, and then include them as part of Bernie's default search.
Name Authority Identification:
Jim is in charge of setting up an OAI-PMH aggregator that will gather distributed metadata records and then reuse them in a science digital library. He is concerned about the quality of the collected records and would like to apply some normalization and cleanup to them. One particular area of concern is the uncontrolled use of personal and corporate names in the records. He uses the service registry to locate existing name authority services offered by various organizations, and plans an aggregation strategy that uses these services for metadata cleanup.