Metadata harvesting. Notes and next
steps, based on a meeting held at The Andrew W. Mellon Foundation
on 24 October 2000
D Greenstein
2 November 2000
Present: Micah Altman (Harvard), Caroline Arms (Library of
Congress), Priscilla Caplan (Florida State University), Tim Cole
(University of Illinois), Dale Flecker (Harvard), Ira Fuchs
(Mellon), Daniel Greenstein (DLF), Martin Halbert (Emory), Ted
Hanss (Internet2), Tom Hickerson (Cornell), Thom Hickey (OCLC),
Jim Lloyd (University of Tennessee), John Ockerbloom (University
of Pennsylvania), John Perkins (CIMI), Tom Peters (Committee on
Institutional Cooperation, CIC), John Price-Wilkin (University of
Michigan), Thornton Staples (University of Virginia), Donald
Waters (Mellon)
Apologies: Kris Brancolini (University of Indiana)
1. Points of consensus
1.1. There are compelling incentives for libraries to offer
metadata to harvesting services. Participants agreed that the
activity promises exposure for library collections and will help
harvesting services illuminate the "hidden web" to the advantage
of scholars.
1.2. Metadata harvesting services have potential scholarly
and cultural value. Agreement here as well particularly in
the following generic types of harvesting services:
- hidden-web wide - an Academic Lycos style services that
delivers "what you can't get anywhere else" - the hidden web
including information stored in database not accessible to
commercial search engines that look exclusively at static html
files;
- hidden-web narrow - community-specific services where
community may be defined in terms of discipline,
region/location/institution, format of metadata (EAD), even
format or genre of the information objects that metadata describe
(texts, screenplays); and
- brokering services (that expose harvesting metadata assembled
by any or all of the above and present it, perhaps with some
pre-processing, to other harvesting service)
1.3. Harvesting services can be selective about the
metadata they expose, but a testbed should not be so
selective
1.3.1. We discussed the relative merits of harvesting services
that expose metadata attached to or pointing at digital
information objects vs. those that included metadata referring to
books, artifacts, and other non-digital objects. We did not see
any reason at this stage to restrict our exploration to one or
other type of service.
1.3.2. We discussed the relative merits and difficulties
involved with harvesting services that expose metadata and refer
to objects that have no rights restrictions vs. those that deal
with metadata and/or referred objects that have some rights
restrictions. We did not see any reason at this stage to restrict
our exploration of one or other type of service.
1.4. The meaning of Dublin Core. We had questions about
whether we need to be more prescriptive about how metadata
suppliers use Dublin Core. We agreed the testbed would "create a
market in good and bad metadata practices" and consequently that
it will be desirable at this early stage to encourage metadata
suppliers to expose DC metadata in their own image (unconstrained
by guidelines that may be imposed at this time).
1.5. The role of research libraries as harvesting
services. We disagreed about whether research libraries had a
role as harvesting services. Some felt harvesting services needed
to come from the likes of OCLC, RLG, or commercial third parties.
Others felt that libraries have a role in prototyping such
services in the hopes that prototypes will find their own
organizational legs and sustaining business models or at a
minimum encourage third party suppliers into the game. I hope we
can agree to disagree on this point and to encourage those who
want to step forward and offer the innovation we require.
1.6. Open source is valuable. Ready consensus that any
tools, prototypes, etc. should be developed along Open Source
lines and could be registered e.g. at OCLC
1.7. OAI conformant servers in a box could be
important. Tools are good where they help to lower the
barriers for those who want to make item-level metadata available
for harvesting.
1.8. [Some] testbeds might be mounted for specified
duration and on a limited-access basis. This might help us
gain metadata contributions involving metadata with access
restrictions and/or metadata referring to objects that have
access restrictions
1.9. The special case of EADs. It might be appropriate
to sponsor some research into some very specific topics: how EADs
are being developed and applied with a view to recommending "good
practices" that could support their representation in Dublin Core
and their exposure to harvesting services; how search facilities
are currently being used (e.g. to guide our deliberations about
fielded vs google-style searching).
2. Next steps
2.1. Encouraging development of prototype harvesting
services
There are good opportunities to encourage the development of a
small number of prototype harvesting services, hopefully ones
that will be broadly representatives of some of the types we
identified in our discussion (see above). Some more discussion
about process etc. is required. In the meantime, potential
services need to clarify their stance on issues that arose at the
meeting and including:
- Description of service, its role/importance and intended
audience/user community
- Nature / extent of metadata that service expected to harvest
(presumably all will have access to any list of core metadata
that will be available for harvesting and will be in a position
to go out and drum up other offers appropriate to their
services)
- Requirements service might have of any registry (i.e. would
it envisage using a common registry service or setting one
specific to the community of metadata suppliers it was hoping to
deal with)
- Any access management issues involved in developing the
service and how they might be addressed
- Any presentation tools (about metadata search and retrieval
scenarios but also delivery of any content that might be linked
to harvested metadata where that content could be a digital
object, a URL, a call number...)
- Presentation styles - (about bells, whistles, thesauruses,
etc a service might have to make it appropriate to the
information requirements of an intended user community /
audience)
- Selectivity/partitioning - how if at all a service would
select appropriate metadata collections from a registry or even
item level metadata from within collections
- Records management stuff - how a service would address issues
of de-duplication, updating and removing harvested records,
periodically polling metadata sources, etc.
- Scaleability - how a service would scale activities in a far
less tightly controlled environment than the one envisaged for
any testbed activity
- Formal evaluation - how a service would involve users in
design, development, and evaluation, and how it would assess its
performance / value vis a vis the performance / value of other
complementary services (e.g. Lycos, subject gateways, etc)
2.2. Next steps. Developing a pool of harvestable
metadata
Building on our confidence that some prototype harvesting
services can be encouraged, building a pool of metadata is an
essential next step. Taking it requires additional information
about metadata that has been offered as follows:
- whether those offers are still on the table;
- a more precise characterization of the metadata on offer
including
- number of records;
- nature, format and availability (or restrictions upon
availability) of objects to which metadata records refer;
- descriptions of native metadata formats with sample
cross-walks to DC;
- any access restrictions that apply to the metadata and their
use in harvesting services.
return to top >>
|