DLF Fall Forum 2005 Program

DIGITAL LIBRARY FEDERATION
FALL FORUM 2005
CHARLOTTESVILLE, VA
NOVEMBER 7—9, 2005

Omni Charlottesville Hotel
245 West Main Street
Charlottesville, VA 22902
(434) 971-5500
Floor Plan

Alderman Library, UVa

PRECONFERENCE: SUNDAY, NOVEMBER 6

9:00am—4:00pm American West Partners Meeting—for project participants
(Salon C, Lobby Level)

PRECONFERENCE: MONDAY, NOVEMBER 7

9:00am—11:30am NDIIPP Technical Architecture Affinity Group Meeting—
for project participants (Lewis/Clark, Lobby Level)

8:30am—12:30pm DLF Aquifer Meeting—for project participants (James Monroe, Lobby Level)

DAY ONE: MONDAY, NOVEMBER 7

10:30am—1:00pm Registration (Prefunction Area, Lobby Level)

11:30am—12:15pm First-time Attendee Orientation (Salon A, Lobby Level)

12:45pm—1:00pm Opening Remarks by David Seaman and Barrie Howard (Salons A and B, Lobby Level)

1:00pm—2:30pm Session 1: THE NATIONAL DIGITAL INFORMATION INFRASTRUCTURE AND PRESERVATION PROGRAM (Salon A,
Lobby Level)

Maintaining Archive Integrity During Inter-repository Transfer: Lessons Learned from the NDIIPP Archive Ingest and Handling Test

Martha Anderson, Project Manager, Library of Congress, Moderator
Clay Shirky, NDIIPP Technical Lead, NYU. PRESENTATION
Michael Nelson, Old Dominion University PRESENTATION
Tim DiLauro, Johns Hopkins University. PRESENTATION
Keith Johnson, Stanford University. PRESENTATION
Stephen Abrams, Harvard University. PRESENTATION

The Archive Ingest and Handling Test (AIHT), a practical experience with the proposed National Digital Information Infrastructure and Preservation Program (NDIIPP) architecture, completed its work in early summer 2005. Project teams from Harvard, Johns Hopkins, Old Dominion and Stanford investigated diverse technologies and approaches ranging from the examination of new models of preservation with self-archiving object technologies to testing content repository technologies as platforms for preservation. Risk assessment tools and file evaluation and validation tools were developed and used. A final phase of the project simulated change over time by testing file migration to different formats.

Surprising challenges to basic assumptions about file and whole archive transfer were brought to light. The tested approaches and lessons learned about the transfer, assessment and management of a digital archive will be discussed by principal investigators from participating institutions and the Library of Congress. The University of Michigan digital library system was developed in the 1990s to process SGML files using Perl scripts.

1:00pm—2:30pm Session 2: METADATA STRATEGIES (Salon B, Lobby Level)

A Taxonomic Approach to the Organization of Penn State Web Space

Michael Pelikan, Pennsylvania State University.

Penn State University has had broad experience with several search engines, and currently holds a University-wide license for the Google Search Appliance. While useful for general-purpose Web searches, the Google Search Appliance does not currently address critical search and retrieval issues in the Penn State Web environment

The Taxonomic Tags group was formed as a university-level group with representatives from Information Technology Services, Finance & Business Administration, and the Penn State Libraries, to examine whether a taxonomic model, expressed as metadata tags and systematically applied across the university's Web pages could:

permit specific pages to be the top hit for specific searches,
make it easier to find specific pages from among the University's more than 1,000,000 public Web pages,
remain useful amidst increasing adoption of content management systems across Penn State,
remain useful over time as search engines continue to evolve, despite whether open source or commercial (and often, proprietary) search algorithms are employed.

The Tags group has developed recommendations to address these issues. These recommendations include the development of a controlled vocabulary, along with synonyms, for Penn State departments, colleges, administrative units, etc. These terms would be incorporated both into Web pages, and into the university's LDAP system.

The Group has recommended that a broadcast search mechanism be developed for the main Penn State Web search screen. Under this system, user search terms will be submitted both to a Web search appliance and to the university's LDAP system. The results will be combined, identified and presented to the user.

Members of the Tags Group will present background on the project and update its progress to date. The Tags Group is highly interested in questions and comment, and will tailor the presentation to permit as much time for discussion as possible.

Unpacking the Interpretation of METS Markup

David Dubin, University of Illinois at Urbana-Champaign. PRESENTATION

Like most XML applications, METS, the Metadata Encoding and Transmission Standard, overloads a small number of generic syntactic relationships (e.g., parent/child) to represent a variety of specific semantic relationships. Human beings correctly infer the meaning of METS markup, and these understandings inform the logic and design of applications that import, export, and transform METS-encoded resources and descriptions.

However, METS's flexibility and generality invite diverse interpretations, posing challenges for processing across different METS profiles and local adaptations. Robust processing requires support in the form of a general software library for reasoning about METS documents. We describe the current state of development for such a library.

This METS interpretation software is an application of the BECHAMEL markup semantics framework (Dubin et al, 2003). BECHAMEL applications translate properties and relationships expressed in conventional markup into logical assertions that unpack the overloaded XML-based syntax. The inference problems we aim to support include identifying inline and external storage objects, mapping storage objects to resources and descriptions, and correctly classifying the role of namespaces.

Another goal of explicating the interpretation of METS documents is to reserialize them in XML, directly asserting as many of the inferred facts as we can. In this way we hope to improve prospects for long term digital preservation.

D. Dubin, C. M. Sperberg-McQueen, A. Renear, and C. Huitfeldt. A logic programming environment for document semantics and inference. Literary and Linguistic Computing, 18(2):225-233, 2003.

The Problem with Duplicates

Esme Cowles, UCSD Libraries.

When harvesting large numbers of non-unique metadata records from several different institutions, duplicate records are inevitable. Typically, duplicates are identified by comparing globally-unique identifiers (such as ISBNs) or definitive metadata elements (such as creator and title statements). However, some disciplines (such as art) have neither unique identifiers nor definitive metadata, making the task of identifying duplicate records much harder.

The Union Catalog for Art Images (UCAI) project aggregated 920,000 metadata records for slides and digital images from six institutions, mapped all records to a common schema (VRA Core) and attempted to identify and merge duplicate records. Drawing on clustering techniques, controlled vocabularies, and string-comparison algorithms the UCAI team developed tools and software to compare metadata records and identify duplicate records.

The experience of the UCAI team demonstrates the challenges of working with metadata created without common content standards, controlled vocabularies, or unique identifiers, and provides guidance for content producers and aggregators.

2:30pm—3:00pm Break (Prefunction Area, Lobby Level)

3:00pm—4:30pm Session 3: DIGITAL PRESERVATION (Salon A, Lobby Level)

Preserving Digital Resources: Complexities and Emerging Solutions
(A View from the NDIIPP Partners Early Work) PRESENTATION

Joanne Kaczmarek, UIUC
Patricia Cruse, CDL.
Martin Halbert , Emory.
Anthony Ramirez , University of Maryland.
Bill LeFurgy, Library of Congress.
Jim Tuttle , NCSU.
Nan Rubin , WNET.

The general purpose of this discussion panel is to present the challenges of digital preservation as experienced to date by institutions engaged in digital preservation projects through the NDIIPP initiative, and, specifically, to examine and discuss examples of solutions currently implemented or being considered by NDIIPP partners.

Discussion will open with remarks from the Library of Congress on specific challenges presented by the "digital preservation problem." After a brief introduction of NDIIPP partners, representatives from each project will provide concise examples of currently implemented or pondered solutions to the specific challenges they are encountering.

Challenges arising include varied technical issues, as well as issues related to workflow, rights clearance, and the economic sustainability of preservation activities. This discussion will also include consideration of the role the NDIIPP partnership model might play in developing preservation strategies and solutions. The session format is intended to foster discussion among panelists, and audience participation is encouraged.

Libraries and archives have traditionally played the role of "trusted repositories," assuming long-term responsibilities for assuring the integrity and authenticity of materials deposited with or collected by them. With the proliferation of digital resources, the role of a "trusted repository" takes on a new aspect, requiring libraries, archives and other institutions to re-conceptualize their place in providing assurances of long-term digital preservation.

Recognizing the need for a coordinated approach to preserving digital resources, the Library of Congress launched a $99.8M national digital strategy effort through the National Digital Infrastructure Information Preservation Program (NDIIPP). Its mission is to "develop a national strategy to collect, archive and preserve the burgeoning amounts of digital content, especially materials that are created only in digital formats, for current and future generations." In Fall 2004, the eight projects participating in this proposed panel were awarded funding. More information about NDIIPP and its partners may be found at http://www.digitalpreservation.gov.

3:00pm—4:30pm Session 4: MANAGING DIGITAL LIBRARY CONTENT AND CODE (Salon B, Lobby Level)

Flipping the Switch: Lessons Learned from a Major Digital Library Migration Project

Jon Dunn , Mark Notess , and Ryan Scherle , Indiana University Digital Library Program. PRESENTATION

In 2005, the Variations2 digital music library transitioned from being only a research project to becoming the replacement for the heavily-used Variations online music service at Indiana University. The effort to bring this second-generation digital library into production included re-processing and checking approximately 10,000 digitized sound recordings, the creation of a new digital ingest tool, and the development of an access control mechanism to ensure appropriate copyright safeguards.

Moving nine years' worth of digitized recordings proved to be more than a simple matter of pointing the new tool at the old files. We had to retrieve, and in some cases locate, the original .wav files and re-encode them to support the superior capabilities of the new tool. We also moved the production file server from a tape-based system to hard disks. Re-encoding ran 24x7 for approximately two months. Subsequent error checking and clean-up took several months more.

The transition provided an opportunity to reassess and improve our audio ingest process. The new digitization tools were designed in consultation with the digitizing staff and fit much better with the digitization workflow, increasing throughput and improving quality.

The Variations2 access control mechanism limits out-of-library use of copyrighted materials according to a new access policy, based on student course enrollment. With this access mechanism in place, we are distributing the Variations2 client software to students to support home access of streaming audio and scanned score images.

This talk describes the lessons learned, and the surprises along the way, during the Variations to Variations2 migration.

Organizing Project Code for Digital Library Applications

Eric Stedfeld , New York University. PRESENTATION

Many digital library projects suffer a circuitous evolution, starting as a demo or proof of concept in a scripting language, then adding on requirements and expectations until the project reaches production status.

This can result in disorganized code that is difficult to maintain, update, extend, or scale, let alone transition into another coding environment such as Java. Even more challenging, programmers familiar with the original scripting language may have little background in the principles and methods of the new environment that lead to good programming practice.

This presentation provides some approaches for better structuring and maintaining such code, based on Java Guidelines and Patterns. The example application, a digitized collection of Colonial and Early American documents, utilizes servlets, JSP, JavaBeans, a database backend, and XML files generated with the METS Java Toolkit. The principles presented can help session participants make their application code more manageable, extensible and scalable, while saving time and reducing frustration in software development.

4:30pm—4:45pm Break (Prefunction Area, Lobby Level)

4:45pm—6:00pm BIRDS OF A FEATHER SESSION 1

1) The DLF Electronic Resource Management Initiative: Phase 2 (Salon A, Lobby Level)

Adam Chandler, Cornell. PAPER

This presentation will describe the scope of Phase 2 of the DLF Electronic Resource Management Initiative ("ERMI 2"), including timeline, objectives and deliverables.

2) OCKHAM: Digital Library Service Registries (Salon B, Lobby Level)

Jeremy Frumkin, Oregon State University. PRESENTATION

Martin Halbert, Emory University.

The Ockham Project will hold a BOF session to explore Digital Library Service Registries (DLSRs) at the DLF Fall Forum. With the gaining popularity of metasearch tools, OAI-PMH available collections, the use of OpenURL resolvers, and the emergence of new efforts such as COinS, there is a growing need for registries to support access to these services. This BOF will focus on the concept of the DLSR, what functions a DLSR supports, and will examine current DLSR efforts, including the OCLC OpenURL registry, the JISC/IESR DLSR, and the Ockham Distributed DLSR.

In addition, we will discuss how DLSRs might play a key role in enabling new digital library functionality. Combined with the concept of "Autodiscovery" (techniques for automatically finding machine-processable resources associated with a particular web page), can we utilize DLSRs to lower the barriers to information integration while at the same time enabling greater and more complex information workflows? Can we create a "digital library dialtone" which makes connecting digital library services and content as easy as placing a phone call? Come to this BOF to find out!

3) Archival Information Control (Ashlawn/Highlands, Lobby Level)

Stephen Davis, Columbia, with Ellie Brown and Karen Calhoun, Cornell.

A discussion about how to get control institutionally over finding aid creation and management as well as the full lifecycle of archival collection information. Lee Mandell will also report briefly on the status of the Mellon-funded Archivists Toolkit project.

4) Digital Preservation for Photographs (Lewis/Clark, Lobby Level)

Erin Rhodes, U.S. National Archives and Records Administration. PRESENTATION

David Seaman, DLF.

Last year, NARA produced a very well received electronic publication, Technical Guidelines for Digitizing Archival Materials for Electronic Access: Creation of Production Master Files - Raster Images (Steven Puglia, Jeffrey A. Reed, and Erin Rhodes, the U.S. National Archives and Records Administration, June 2004), subsequently issued as a print document by DLF (see http://www.diglib.org/pubs/dlfpubs.htm#nara-raster).

This report addresses a spectrum of considerations for digitizing a variety textual and photographic records, including file formats, image capture, metadata, and quality assessment. In April 2005, a team of experts from across the DLF and beyond, including Harvard, NARA, LC, Kodak, Chicago Albumen Works, RLG, RIT, and the Swiss Federal Institute of Technology, convened to build on this work to produce a guide for the digitization of photographs for preservation reformatting. This session allows you to hear about our plans, and have input into the process. You will also be able to preview a proposed new quality control target designed for image digitizing workflows in libraries and museums.

7:00pm—9:30pm Reception (Harrison Institute, University of Virginia)

Note: Round-trip transportation will be provided from the Omni hotel to the reception site on the University of Virginia campus.

DAY TWO: TUESDAY, NOVEMBER 8, 2005

Harrison Institute, UVa

DAY TWO: TUESDAY, NOVEMBER 8

8:00am—9:00am Breakfast (Atrium, Lobby Level)

9:00am—10:30am Session 5: REMODELING DIGITAL LIBRARY SYSTEMS (Salon A, Lobby Level)

Re-architecting a Digital Library System for XML/XSL and Unicode:
Lessons Learned

Phil Farber, Alan Pagliere, Chris Powell, John Weise, and Perry Willett
(all University of Michigan). PRESENTATION

The University of Michigan digital library system was developed in the 1990s to process SGML files using Perl scripts. At Michigan, this system provides access to over 20,000 texts, 250,000 images, and 850 archival finding aids. In addition, this system is used at 28 other institutions.

This presentation will describe the transformation of our digital library system for XML/XSL and Unicode, and the lessons learned. Panelists will describe the original system, discuss the reasons for this major undertaking, and will cover topics such as the planning process, how production systems were maintained during the re-architecture, large-scale data conversion from SGML/ISO-Latin1 to XML/Unicode files, tools for conversion, processing, version control and debugging, testing, and what they've learned in the process.

9:00am—10:30am Session 6: DLF AQUIFER (Salon B, Lobby Level)

Katherine Kott, DLF Aquifer. PRESENTATION
Leslie Johnston, UVA.
Sarah Shreeves, UIUC. PRESENTATION
Jon Dunn, Indiana.
Martin Halbert, Emory.

DLF Aquifer, the Digital Library Federation distributed open digital library is in implementation. Since the DLF Spring Forum, project working groups have created a collection development policy, a DLF Aquifer MODS metadata profile to support service development and have selected a small set of digital collections to use as an initial test-bed. University of Michigan will begin harvesting metadata for the project soon.

The services working group has identified target audiences, developed use cases and surveyed DLF libraries to learn what is already known about digital collection use. Taking their cues from the services working group, the technology/architecture working group completed a draft of architectural principles and selected the "repository neutral" framework designed at Johns Hopkins as the DLF Aquifer "middleware layer".

This panel will review the accomplishments of the past six months and outline the phase I deliverables that will be demonstrated at the DLF Spring Forum in Austin next April.

10:30am—11:00am Break (Prefunction Area, Lobby Level)

11:00am—12:30pm Session 7: DYNAMIC DIGITAL ENVIRONMENTS
(Salon A, Lobby Level)

A Format-registry Based Automated Workflow for the Ingest and Preservation of Electronic Journals

Evan Owens, Chief Technology Officer, Portico. PRESENTATION

Portico (http://www.portico.org), with funding from The Andrew W. Mellon Foundation, Ithaka, and JSTOR, has developed an automated workflow for the ingest of publisher-supplied e-journal source files into a preservation repository. Electronic journals as a preservation challenge sit somewhere between traditional digitization projects and Web-harvesting projects in that the formats are known and controlled but by the content provider rather than by the archive. The workflow that Portico has developed builds on concepts that were developed in the preliminary work towards a Global Digital Format Registry (GDFR) and on the JHOVE tool set. The components of the workflow include package disassembly, format identification and verification, structure mapping, automated metadata harvesting, rule-based format normalization, and support for quality control and inspection. The system implementation uses a service-based architecture built upon a format registry and a tool registry with support for distributed and pluggable tools. This presentation will review the workflow and system design and discuss our experience in designing and building a system based on a format registry.

WikiD—Applying Wiki Principles to Structured Data

Jeffrey A. Young, OCLC Online Computer Library Center, Inc. PRESENTATION

Ward Cunningham describes a wiki as "the simplest online database that could possibly work". The cost of this simplicity is that wikis are generally limited to a single collection containing a single kind of record (viz. WikiMarkupLanguage records). WikiD extends the Wiki model to support multiple collections containing arbitrary schemas of XML records with minimal additional complexity.

WikiD is essentially a lightweight framework combining:

Open-source implementations of various loosely-coupled open-standard protocols (e.g. OpenURL, SRW/U, SRW Update, OAI-PMH, RSS)
An open-source version-controlled database.
A set of bootstrap collections:
CollectionCollection - the master collection of all collections defined in WikiD
CollectionExternalSchemas - a registry of XML Schemas used to constrain the items in WikiD collections
CollectionWikiPages - the default collection that not only provides WikiD's conventional out-of-the-box wiki functionality but also acts as the user interface for the creation and maintenance of other collections.
XSL Stylesheets to render collection-level open-standard protocol responses into HTML for human consumption. Automated processes can ignore the stylesheet reference and use the open-standard protocol responses directly.

Possible applications for WikiD include collaborative maintenance of registries, thesauri, taxonomies, reviews, and documentation. In addition to a standard set of features available for all collections, custom code (e.g. Java or XSL) can also be assigned to provide new types of Web services related to individual collections.

The WikiD project page can be found at http://www.oclc.org/research /projects /wikid/default.htm. A demo is running at http://alcme.oclc.org /wikid/. Instructions for creating a new collection can be found at http://alcme.oclc.org/wikid/DemoInstructions. A J2EE Web app distribution is in the works.

11:00am—12:30pm Session 8: COLLABORATIVE METADATA AGGREGATIONS (Salon B, Lobby Level)

Collaborative Metadata Aggregations: The Road to Shareable Metadata

Sarah L. Shreeves, University of Illinois at Urbana-Champaign, Moderator
Bill Landis, California Digital Library
PRESENTATION
Trish Rose, University of California at San Diego
PRESENTATION
Timothy Cole, University of Illinois at Urbana-Champaign
PRESENTATION
Jenn Riley, Indiana University PRESENTATION

DLF emphasizes the role of collaboration to better understand how best to share our digital content and metadata. To this end it has supported the writing of the DLF-NSDL Best Practices for OAI Data Provider Implementations and Shareable Metadata in order to cope with the most common difficulties in the exchange of metadata between content providers and aggregators. This best practices work has drawn on the experiences of collaborative projects involving metadata sharing and has highlighted the importance of such projects to facilitate the dialogue between content providers and aggregators.

This panel will briefly highlight the experiences of several OAI based and non-OAI based collaborative aggregations and best practices building initiatives and will then turn to an open discussion of the issues facing these collaborative projects and initiatives and how they help foster more efficient mechanisms for sharing metadata.

12:30pm—2:30pm Break for Lunch (Individual Choice)

Note: There is an open-air, pedestrian mall just outside the Omni hotel where there are many restaurants for lunch choices.

2:30pm—4:00pm Session 9: NARA'S ELECTRONIC RECORDS ARCHIVES (Salon A, Lobby Level)

Metadata Implementation Perspectives for the ERA System

Quyen Nguyen, Systems Engineering Division, ERA Program Management Office, U.S. National Archives and Records Administration. PRESENTATION

With the advent of Information Technology, more and more records today are digital born. In order to continue to fulfill its mission in the computer age, the U.S. National Archives and Records Administration (NARA) has made the decision to develop the Electronic Records Archives (ERA) system.

The ERA system represents an endeavor undertaken by the agency to preserve digital records, and make those records accessible independently of hardware and software with which they are created. Metadata is an important element in such a system whose core functionality is digital preservation for long term access by the public.

This paper will discuss the potential issues and impact of implementing metadata in ERA from the perspectives of system architecture, data management, and software design.

We also present the information technologies that we are considering for the implementation of metadata within the ERA system such as XML, and Web services. By referring to the OAIS information model, we will look at different types of metadata, and how the system could support the creation and maintenance of these metadata automatically and manually via workflow. Meeting the security requirements with different levels of data and metadata classification also constitutes a challenge in the system design process. The decision of how to store metadata vis-à-vis records and record aggregates will significantly impact the software design object model, the storage size, and data replication. Metadata as well as data replication are critical to ensure the availability and safeguard of the archival records.

Indexing and Search Implementation Perspectives for the ERA System

Dyung Le, Systems Engineering Division, ERA Program Management Office, U.S. National Archives and Records Administration. PRESENTATION

With the advent of Information Technology, more and more records today are born digital. In order to continue to fulfill its mission in the computer age, the U.S. National Archives and Records Administration (NARA) has made the decision to develop the Electronic Records Archives (ERA) system. The ERA system represents an endeavor undertaken by the agency to preserve digital records over an indefinite period of time, and make those records accessible independently of the hardware and software with which they were created. The capability for indexing and searching of its assets is an important element in a system whose core functionality is digital preservation of electronic records for long term access by NARA and the public.

This paper will discuss the potential issues and impacts related to the implementation of Indexing and Search functionality in ERA from the perspectives of system architecture, software design, usability, and long term maintainability. We will present the information technologies represented by the major vendors of Enterprise Search COTS that we are analyzing for possible implementation of Indexing and Search services within the ERA system.

Meeting the information retrieval needs of the diverse, and potentially huge, ERA user community, given the resource limitations of ERA is a serious challenge. We will discuss options being considered by NARA to meet this challenge. Meeting the security requirements of a solution which must house records with different levels of classification or that contain sensitive information also constitutes a challenge for the Indexing and Search service. ERA is intended to exist for an essentially indefinite period of time, and its service oriented architecture provides the flexibility to evolve over time as technology changes, including changing out COTS products. There are no current or emerging standards (other than for metadata) governing the Enterprise Search arena. Hence there is a real danger of becoming locked into a particular Enterprise Search vendor's proprietary approach. The paper will discuss the related technical issues and possible mitigation.

2:30pm—4:00pm Session 10: OAI FOR DIGITAL LIBRARY AGGREGATION (Salon B, Lobby Level)

David Seaman, DLF PRESENTATION
Kat Hagedorn, Michigan PRESENTATION
Martin Halbert, Emory PRESENTATION
Sarah Shreeves, UIUC PRESENTATION
Tom Habing, UIUC PRESENTATION
Perry Willett, Michigan

DLF in partnership with Emory, Michigan, and UIUC, is researching, designing, and prototyping a "second generation" OAI finding system, capitalizing on the lessons learned from the first wave of OAI harvesting and using as its raw material collections drawn from across the DLF membership. The aim is to foster better teaching and scholarship through easier, more relevant discovery of digital resources, and enhance libraries' ability to build more responsive local services on top of a distributed metadata platform.

This panel will update the DLF community on the progress of this work, and solicit feedback while we are still in medias res. The major deliverables will be described and demonstrated such as they can be, with particular emphasis on the first three, which are furthest along at this point:

1) Best Practices guidance for OAI use in libraries, with particular emphasis to the recommendation that we adopt MODS as the metadata schema to convey the richness of description that we are convinced we need to build OAI records that truly support innovative scholarship. The first version of the Best Practices document will be available online by the Forum and in print soon after, as grant deliverables. Emory University and the other DLF IMLS partners have also developed a curriculum series of OAI best practices training materials. These materials will be used to train staff and coordinate activities of DLF libraries interested in sharing metadata concerning their digital collections, and is intended to be shared with the larger digital library community.

2) A pair of portal prototype finding systems, informed heavily by feedback received from the grant-funded Scholars Advisory Panel, http://www.diglib.org/architectures/oai/imls2004/OAISAP05.htm. One portal, The DLF OAI Portal, offers a single place to access all OAI records (items and collections) from DLF institutions: http://www.hti.umich.edu/cgi/b/bib/bib-idx?c=imls;page=simple The second, in production now, takes 330,000 MODS-based OAI records from four DLF institutions and is building a prototype service that reflects the service and functional desires of our scholarly team.

3) An Experimental OAI Registry at UIUC of use principally to builders of OAI services: http://gita.grainger.uiuc.edu/registry. The most significant recent additions to the registry are rich, human-generated collection descriptions for many of the DLF member OAI data providers, including description of select subsets. These data are browsable via the registry Web interface or as XML which conforms to the DC Collection Description profile.

4) A Survey of Digital Library Aggregation Services, version 2: as part of the grant, Martha Brogan is revisiting her 2003 Survey, and we will be publishing the results early in 2006.

As the grant progresses, we are also expecting to look at auto-characterization of data, Web services, and interfacing our prototype systems with Google. Evaluation will be a significant component later in the grant period.

4:00pm—4:30pm Break (Prefunction Area, Lobby Level)

4:30pm—6:00pm BIRDS OF A FEATHER SESSION 2

5) Open Archives Initiative Protocol for Metadata Harvesting: Best Practices for Data Provider Implementations and Shareable Metadata (James Monroe, Lobby Level)

Sarah L. Shreeves, University of Illinois Library at Urbana-Champaign

A working group made up of members of the DLF and NSDL and representing both service and data providers have been developing a set of best practices for OAI data provider implementations and shareable metadata (http://oai-best.comm.nsdl.org/cgi-bin/wiki.pl ?TableOfContents). Specifically the Best Practices for OAI Data Provider Implementations ( http://oai-best.comm.nsdl.org/cgi-bin/wiki.pl?DataProviderPractices) offer guidelines and recommendations for a range of integral and optional pieces of the protocol (deleted records, sets, datestamps, descriptive containers). The Best Practices for Shareable Metadata ( http://oai-best.comm.nsdl.org/cgibin/wiki.pl?IntroductionMetadataContent) outline general guidelines for authoring metadata that is useful and effective within larger aggregations.

These best practices are now beginning to go under public review and comment in preparation for publication in the coming year. This session is open to anyone who would like to discuss these best practices and ask questions or raise concerns with members of the working group.

6) Preservation Planning for Digital Objects and Repositories (Ashlawn/Highlands, Lobby Level)

Taylor Surface, OCLC
Stephen Abrams, Harvard.

Many of us are implementing or operating repositories of digital materials with an eye toward preserving the objects for the long-term. While there has been much theoretical discussion on practices, those of us with these repositories now face the very real challenge of providing these services. Come to this BOF to share your current practice for digital preservation planning and discuss opportunities for creating best practice.

7) METS Implementation and METS Profile Building (Lewis/Clark, Lobby Level)

Nancy J. Hoebelheinrich, Stanford University Libraries.

The METS Editorial Board will hold a Birds of a Feather session on Technical Issues related to METS implementation and METS Profile Building. Members of the METS Editorial Board who have successfully written, registered and implemented METS profiles will be in attendance, as well as members of the METS community who are in the process of developing METS profiles (including Arwen Hutt from UCSD and Rob Wolfe from MIT/DSpace). Topics to be discussed include identifying the purpose and function that a METS profile can serve in local implementations of METS, how and whether local workflow should influence the development of a profile, and how profiles are designed to facilitate content and metadata sharing among members of the METS community. Lively discussion will be encouraged!

DAY THREE: WEDNESDAY, NOVEMBER 9

8:00am—9:00am Breakfast (Atrium, Lobby Level)

9:00am—10:30am Session 11: DIGITAL LIBRARY INITIATIVES IN SOUTH AFRICA AND ENGLAND (Salon A, Lobby Level)

Herding Big Cats: An African Experience of Collaboration

D. P. Peters, University of KwaZulu Natal. PRESENTATION

DISA, Digital Imaging South Africa, is a national collaborative digitization project, funded by The Andrew W. Mellon Foundation, to make available for international research, the repressed archival documentation of the apartheid era in South Africa. Some 70,000 pages have already been made available from http://disa.nu.ac.za.

In partnership with Aluka, a project of Ithaka Harbors Inc., the project has recently embarked upon a second phase, to make available research resources of neighboring regimes under the regional topic, Southern African Freedom Struggles, 1950-1994. This presentation will share an African experience of collaboration, in building partnerships with users to provide context, and with institutions to provide content.

The South African apartheid system can be equated on the level of crimes against humanity that history and responsible stewardship must prevent in the future. Digital technologies ideally serve this aim in the dissemination of information, but the process of building collaboration is not without pitfalls, beyond the organizational challenges.

The DISA project has recently focused on developing a common understanding amongst librarians, archivists, scholars and politicians, of its role in interpreting a sensitive and painful period of history.

Engagement of the scholarly community serves to build contextual layers in the information architecture, with descriptive essays linked to the archival resources by means of topic maps. The objective is to build a research resource for teaching and learning, stimulating curriculum development in this subject area. But archival concerns for a perceived loss of ownership must be juggled with access, and local heritage preservation with global cultural imperialism.

This presentation will investigate some of the benefits and pitfalls experienced in building new user communities in national and international collaboration.

Digital Library Activities at Oxford

Michael Popham, Oxford University. PRESENTATION

The libraries of Oxford University have a long-standing interest in digital technologies. Large-scale digitization projects undertaken more than a decade ago such as "Early Manuscripts at Oxford" (http://www.image.ox.ac.uk/) are still attracting a growing number of users who wish to access selected items from our extensive holdings. However, by 2000 it had become apparent that the hitherto piecemeal and largely project-based approach to the selective digitization of material was not going to be sustainable in the longer term. Quite apart from the resources required to support and maintain dozens of separate Web sites built on a variety of applications and data standards, it was clear that digital surrogates were becoming acceptable to the scholarly community and greatly increasing public access to material that most readers would be unlikely to see first hand. A new approach was required.

In the summer of 2001, Oxford University Library Services established the Oxford Digital Library (ODL): a combination of services and technologies intended to develop, test, and implement the policies and standards that would underpin a University-wide framework for the digitization of library holdings. Thanks to a generous grant from the Andrew W Mellon Foundation, a Development Fund was established to create a testbed of core content for the ODL intended to be used by researchers, teachers, and the global community of learners.

The initial 4-year development phase of the ODL concludes in October 2005, and this presentation will outline the lessons we have learned to date, the implications for the way the ODL is likely to develop, and also look at the impact of such endeavors as the Oxford-Google digitization partnership.

This presentation will provide an update on digital library developments at the University of Oxford, outline the lessons that have been learned from the initial four-year development phase of the Oxford Digital Library, and discuss the likely impact of the Oxford-Google digitization agreement.

9:00am—10:30am Session 12: WEB ARCHIVING SERVICES (Salon B, Lobby Level)

Martha Anderson, Library of Congress. PRESENTATION
John Tuck, The British Library. PRESENTATION
Taylor Surface, OCLC. PRESENTATION
John Kunze, CDL. PRESENTATION

Web archiving services emerging at a number of different institutions will enable librarians and other document selectors to extend their historic collection-building roles into the domain of web-based materials. Such services will allow curators to initiate and monitor web crawls relevant to specific topic areas, analyze and annotate harvested data, and search and browse local archives built from sites that may have been harvested multiple times.

1) "Introduction to Web Archiving" (Martha Anderson): a brief overview of the current landscape of challenges and opportunities of archiving web resources.

2) "Web Archiving at the British Library" (John Tuck): The British Library is lead partner in the UK Web Archiving Consortium (UKWAC) (www.webarchive.org.uk) and is a member of the International Internet Preservation Consortium (IIPC).

The focus of the presentation will be on collaborative working nationally and internationally. There will be specific reference to the challenges faced by UKWAC in areas such as permissions and legal deposit, software, and collection development and, in the case of IIPC, to current initiatives including progress on procurement for an automated smart crawler in conjunction with the Bibliothèque nationale de France.

3) "UIUC/OCLC's ECHO DEPository Project" (Taylor Surface): OCLC, as part of the ECHO DEPository NDIIPP project, is leading the development of a suite of open source web archiving tools named the Web Archives Workbench, which is based on an archival selection model developed at the Arizona State Library. OCLC will discuss the challenges facing state libraries in the collection of web information and review how the tools of the Web Archives Workbench help with those challenges.

4) "CDL's Web Archiving Service" (John Kunze): An overview of CDL's Web Archiving Service (WAS) and its approach to long-term preservation. The approach includes generating "dessicated data" (long-lived, low-tech derivatives for certain formats), defining service levels, assigning persistent identifiers, and replicating content at geographically distant locations.

10:30am—11:00am Break (Prefunction Area, Lobby Level)

11:00am—12:30pm Session 13: SUSTAINING DIGITAL SCHOLARSHIP (Salon A,
Lobby Level)

Bradley Daigle, Mike Furlough, Thornton Staples, and Madelyn Wessel
(all University of Virginia Library).

Sustaining Digital Scholarship ("SDS") is a project at the University of Virginia Library that explores the complex technical, legal, institutional and policy issues arising for libraries in the development and formal collection of original digital scholarship. In the humanities, these born-digital scholarly efforts tend to look less like existing genres based on print-models (i.e., monographs, articles) and more like exhibitions, library collections, and thematic research archives. Such projects challenge us to develop consistent methods for production, delivery, rights management, access, and archiving of digital content of multiple media and content types. To sustain original digital scholarship, we assume that we must move beyond providing individual piecemeal solutions to define standard methods for collecting these projects by the library.

SDS is a collaboration among the University of Virginia Library, NINES (Networked Interface for Nineteenth Century Electronic Scholarship), the Tibetan and Himalayan Digital Library, and the Virginia Center for Digital History. Pilot projects under SDS assume that: (1) the library will formally select, collect, preserve and distribute original digital scholarly projects through a digital library architecture based upon Fedora; (2) that intellectual property rights of those projects allow open access to the broadest extent possible; (3) that the library will strive to preserve the intellectual content, structures, and designs of the project; and (4) that the library will elaborate formal collection agreements with the scholars and possibly other institutions.

Mike Furlough will moderate and discuss the overall aims of the SDS project at Virginia; Thornton Staples will outline the project's theory of collection and content aggregation; Madelyn Wessel will review the policy and legal issues that the project raises for libraries and scholars; Bradley Daigle will discuss the implementation of pilot projects and expected outcomes.

11:00am—12:30pm Session 14: ARCHIVE-IT: A WEB ARCHIVING APPLICATION (Salon B, Lobby Level) PRESENTATION

Archive-it

Merrilee Proffitt, RLG, Moderator
Michele Kimpton, Director Web Archive, Internet Archive.
Carolyn Palaima, Lanic Project Director, University of Texas at Austin.
Cecile Jagodzinski, Indiana University.
Kathy Jordan, Electronic Resources Manager, Library of Virginia.
Dan Avery, Senior Crawl Engineer, Internet Archive.

Archive-it is a Web application uniquely designed for the needs of University and government institutions interesting in preserving Web content. The application allows organizations with limited infrastructure and technical staff to collect, catalogue, search and manage archived Web content through a Web interface.

The Internet Archive (IA), a nonprofit that manages the largest publicly available Web archive, developed Archive-it. IA currently provides these services to large institutions such as Library of Congress and the US National Archives. It is working with RLG and a handful of other organizations to make the same service available at a scale and cost that is broadly accessible. RLG member institutions participating in this pilot are Indiana University, the International Institute of Social History, Swarthmore College (with partner Haverford College), and the University of Toronto.

Other pilot participants working directly with the Internet Archive include the Library of Virginia, University of Texas at Austin, and North Carolina State Archives. The pilot run of the service is scheduled to conclude in November 2005, and the service is scheduled to launch in January 2006.

This presentation will include an overview of Archive-it and its major functions; pilot participants will give an overview of why they are interested in Web archiving, challenges they face in their own institutions regarding Web archiving, what they've learned so far using the Archive-it Web application, and how it's being applied in their institution. By the time of this panel, participants will be able to discuss their experience with Archive-it, challenges of Web archiving in general, and provide information and informed experiences to audience members.

12:30pm Adjourn

POST-CONFERENCE, NOVEMBER 9

1:00pm—5:00pm METS Editorial Board Meeting (Monticello, Lobby Level)

12:30pm—4:30pm OAI Vendors' Panel—for project participants (Ashlawn,
Lobby Level)

1:00pm—5:00pm DLF Developers' Forum Meeting—for project participants
(James Monroe, Lobby Level)

1:30pm—5:30pm CCS docWORKS Workshop—for project participants
(Computer Classroom, Fourth Floor, Alderman Library, University of Virginia)

POST-CONFERENCE, NOVEMBER 10

9:00am—5:00pm METS Editorial Board Meeting
(James Monroe, Lobby Level)

8:30am—3:30pm DLF OAI Implementers Workshop—for project participants
(Computer Classroom, Fourth Floor, Alderman Library, University of Virginia)