random quote Link: Publications Forum Link: About DLF Link: News
Link: Digital Collections Link: Digital Production Link: Digital Preservation Link: Use, users, and user support Link: Build: Digital Library Architectures, Systems, and Tools
photo of books






Please send the DLF Director your comments or suggestions.

Metadata harvesting. Notes and next steps, based on a meeting held at The Andrew W. Mellon Foundation on 24 October 2000

D Greenstein
2 November 2000

Present: Micah Altman (Harvard), Caroline Arms (Library of Congress), Priscilla Caplan (Florida State University), Tim Cole (University of Illinois), Dale Flecker (Harvard), Ira Fuchs (Mellon), Daniel Greenstein (DLF), Martin Halbert (Emory), Ted Hanss (Internet2), Tom Hickerson (Cornell), Thom Hickey (OCLC), Jim Lloyd (University of Tennessee), John Ockerbloom (University of Pennsylvania), John Perkins (CIMI), Tom Peters (Committee on Institutional Cooperation, CIC), John Price-Wilkin (University of Michigan), Thornton Staples (University of Virginia), Donald Waters (Mellon)

Apologies: Kris Brancolini (University of Indiana)

1. Points of consensus

1.1. There are compelling incentives for libraries to offer metadata to harvesting services. Participants agreed that the activity promises exposure for library collections and will help harvesting services illuminate the "hidden web" to the advantage of scholars.

1.2. Metadata harvesting services have potential scholarly and cultural value. Agreement here as well particularly in the following generic types of harvesting services:

  • hidden-web wide - an Academic Lycos style services that delivers "what you can't get anywhere else" - the hidden web including information stored in database not accessible to commercial search engines that look exclusively at static html files;
  • hidden-web narrow - community-specific services where community may be defined in terms of discipline, region/location/institution, format of metadata (EAD), even format or genre of the information objects that metadata describe (texts, screenplays); and
  • brokering services (that expose harvesting metadata assembled by any or all of the above and present it, perhaps with some pre-processing, to other harvesting service)

    1.3. Harvesting services can be selective about the metadata they expose, but a testbed should not be so selective

    1.3.1. We discussed the relative merits of harvesting services that expose metadata attached to or pointing at digital information objects vs. those that included metadata referring to books, artifacts, and other non-digital objects. We did not see any reason at this stage to restrict our exploration to one or other type of service.

    1.3.2. We discussed the relative merits and difficulties involved with harvesting services that expose metadata and refer to objects that have no rights restrictions vs. those that deal with metadata and/or referred objects that have some rights restrictions. We did not see any reason at this stage to restrict our exploration of one or other type of service.

    1.4. The meaning of Dublin Core. We had questions about whether we need to be more prescriptive about how metadata suppliers use Dublin Core. We agreed the testbed would "create a market in good and bad metadata practices" and consequently that it will be desirable at this early stage to encourage metadata suppliers to expose DC metadata in their own image (unconstrained by guidelines that may be imposed at this time).

    1.5. The role of research libraries as harvesting services. We disagreed about whether research libraries had a role as harvesting services. Some felt harvesting services needed to come from the likes of OCLC, RLG, or commercial third parties. Others felt that libraries have a role in prototyping such services in the hopes that prototypes will find their own organizational legs and sustaining business models or at a minimum encourage third party suppliers into the game. I hope we can agree to disagree on this point and to encourage those who want to step forward and offer the innovation we require.

    1.6. Open source is valuable. Ready consensus that any tools, prototypes, etc. should be developed along Open Source lines and could be registered e.g. at OCLC

    1.7. OAI conformant servers in a box could be important. Tools are good where they help to lower the barriers for those who want to make item-level metadata available for harvesting.

    1.8. [Some] testbeds might be mounted for specified duration and on a limited-access basis. This might help us gain metadata contributions involving metadata with access restrictions and/or metadata referring to objects that have access restrictions

    1.9. The special case of EADs. It might be appropriate to sponsor some research into some very specific topics: how EADs are being developed and applied with a view to recommending "good practices" that could support their representation in Dublin Core and their exposure to harvesting services; how search facilities are currently being used (e.g. to guide our deliberations about fielded vs google-style searching).

2. Next steps

2.1. Encouraging development of prototype harvesting services

There are good opportunities to encourage the development of a small number of prototype harvesting services, hopefully ones that will be broadly representatives of some of the types we identified in our discussion (see above). Some more discussion about process etc. is required. In the meantime, potential services need to clarify their stance on issues that arose at the meeting and including:
  • Description of service, its role/importance and intended audience/user community
  • Nature / extent of metadata that service expected to harvest (presumably all will have access to any list of core metadata that will be available for harvesting and will be in a position to go out and drum up other offers appropriate to their services)
  • Requirements service might have of any registry (i.e. would it envisage using a common registry service or setting one specific to the community of metadata suppliers it was hoping to deal with)
  • Any access management issues involved in developing the service and how they might be addressed
  • Any presentation tools (about metadata search and retrieval scenarios but also delivery of any content that might be linked to harvested metadata where that content could be a digital object, a URL, a call number...)
  • Presentation styles - (about bells, whistles, thesauruses, etc a service might have to make it appropriate to the information requirements of an intended user community / audience)
  • Selectivity/partitioning - how if at all a service would select appropriate metadata collections from a registry or even item level metadata from within collections
  • Records management stuff - how a service would address issues of de-duplication, updating and removing harvested records, periodically polling metadata sources, etc.
  • Scaleability - how a service would scale activities in a far less tightly controlled environment than the one envisaged for any testbed activity
  • Formal evaluation - how a service would involve users in design, development, and evaluation, and how it would assess its performance / value vis a vis the performance / value of other complementary services (e.g. Lycos, subject gateways, etc)

2.2. Next steps. Developing a pool of harvestable metadata

Building on our confidence that some prototype harvesting services can be encouraged, building a pool of metadata is an essential next step. Taking it requires additional information about metadata that has been offered as follows:
  • whether those offers are still on the table;
  • a more precise characterization of the metadata on offer including
  • number of records;
  • nature, format and availability (or restrictions upon availability) of objects to which metadata records refer;
  • descriptions of native metadata formats with sample cross-walks to DC;
  • any access restrictions that apply to the metadata and their use in harvesting services.

return to top >>