1	NDIIPP Preservation Architecture: Archive Ingest and Handling Test Interim Report Digital Library Federation October 2004, Baltimore, MD
2	National Digital Information Infrastructure and Preservation Program Goals Develop a national digital collection and preservation strategy Work with industry, concerned federal agencies, libraries, research institutions and not-for-profit entities Help identify and preserve at-risk digital content Support development of improved tools, models, and methods for digital preservation
3	NDIIPP Focus Areas Network of preservation partners Preservation architecture Digital preservation research
4	What is the Preservation Architecture? A conceptual framework for supporting the technical functions and developing tools required for cooperative, distributed preservation of digital content It must support relationships between institutions. allow questions of preservation to be handled separately from questions of public access. be built modularly, using existing technology and efforts wherever possible. be able to be assembled over time. be specified using broadly adoptable protocols.
5	Archive Ingest & Handling Test AIHT is a first test of proposed preservation architecture. The test is conducted with a common data set. George Mason University 9/11 Archive Phase I tests ingest and data handling in local systems. Phase II tests export and import between institutions.
6
7	Participants Harvard University Library The Johns Hopkins University, Sheridan Libraries Old Dominion University, Department of Computer Science Stanford University Libraries & Academic Information Resources The Library of Congress, Office of Strategic Initiatives
8	Harvard University Background Current policy limits deposit approved workflows small set of formats accompanied by preservation metadata. Evolution towards that of an institutional repository arbitrary content unknown provenance
9	Harvard University Approach Use JHOVE to provide enriched technical metadata Build tools to generate SIP packages automatically Enhance metadata model to record PREMIS-like provenance information Add export functionality to repository API Investigate TIFF-to-JPEG 2000 transformations
10	Harvard University Team Dale Flecker – Principal investigator Stephen Abrams – Project manager Stephen Chapman – Reformatting analyst Sue Kriegsman – Project administration and reporting Gary McGath – Developer Germain Seac – Operations Robin Wendler – Metadata analyst Technologies Digital Repository Service (DRS) – Oracle (metadata), Java API, RAID (content), Solaris, XML-based SIP package JHOVE for extraction of encapsulated technical properties Automated SIP creation tools
11	Harvard University Observations JHOVE can process 97% of the 57,000 files ASCII/UTF-8, HTML, JPEG, WAV, TIF, PDF, GIF, AIFF, XML The PREMIS event model is very flexible, but it is difficult to determine the best way to capture provenance metadata Data manipulation issues: You can FTP 13GB as one file in 3 hours; to FTP it as 57,000 files takes 35+ hours Some FTP clients do not like 0 length files Some ZIP tools have a file size limitation Some network appliance file servers have a file size limitation The data does not include any infected files!
12	The Johns Hopkins University Background Johns Hopkins University Sheridan Libraries has been investigating multiple repositories. AIHT provides a digital preservation use case. Project Approach Large-scale ingestion with a repository-agnostic design
13	The Johns Hopkins University Team Mark Patton (developer) Sayeed Choudhury (PI) Tim DiLauro (tech lead) Jacque Gourley (project manager) Ying Gu (student) David Reynolds (metadata) Jason Riesa (student) Technologies Dspace, Fedora, METS, Java, OS X
14	The Johns Hopkins University Observations: Bulk ingestion of a complex archive is a good way to stress test repository interfaces Coordinate between provider and recipient as much as possible Design metadata from established standards, instead of attempting to shoehorn No seamless way to ingest to multiple repositories Needs repository agnostic layer
15	Old Dominion University Background Experiment with alternate archive architectures Create self-preserving digital objects Project Approach Build ingestion tool to test individual file validity JHOVE, “file”, Fred, etc. to generate technical metadata Create an MPEG-21 DIDL that contains: preservation analysis, technical metadata, original tar file, current tar file, “deltas” (cf. diff/patch semantics) for intermediate versions Store DIDLs in self-contained, mobile archivelets (“buckets”) Compare archived version with versions available on open Internet original site, Google, Yahoo, IA, etc.
16	Old Dominion University Project Team Professors Michael L. Nelson, Johan Bollen Graduate students Giridhar Manepalli, Rabia Haq Technologies Bucket 3.0 Digital Objects MPEG-21 DIDL JHOVE, file, Fred locally developed ingestion / conversion scripts
17	Old Dominion University Observations Significant learning curve for MPEG-21 DIDL Hoping to incorporate MPEG-21 Rights Expression Language (REL) in the AIHT testbed Conversion utilities (e.g. ImageMagick) are assumed to: Exist outside of the archive Be transient services Significant discrepancies between archived and live site:
18	Stanford University Background Stanford Digital Repository originally focused on highly normative bibliographic digital objects. The AIHT provides an opportunity to develop capabilities for real-world, non-normative collections.
19	Stanford University Approach Develop or integrate tools Stanford Empirical Walker™ JHOVE Automate digital collection assessment technical metadata harvesting structural description preservation risk assessment
20	Stanford University Team: Richard Anderson – Programming Keith Johnson – Project Management Hannah Frost – Preservation Methodologies Nancy Hoebelheinrich – Metadata Jerry Persons – Information Architecture Cathy Aster – Reporting and Financial Management Technologies: METS, Harvard METS Toolkit, JHOVE, PREMIS, Java, Solaris, Windows
21	Stanford University Observations User-supplied metadata can be messy and difficult to transform to a standard format Expected preservability status: 70% HIGH 27.5% ACCEPTABLE 2.5% MINIMAL Large file collection generates large METS file Requires lots of memory and processing power Parallel metadata hierarchy vs. single XML file PREMIS data elements/model looks very promising for storing preservation status and methodologies
22	Things We’ve Learned Great minds don't think alike Metadata is worldview Simple operations are harder than you think Support for forensics is essential 1% times a big number is a big number It's all triage
23	Next Steps Next revision of Transfer Metadata format Work on inspection tools Empirical Walker™ Explore format registry Fred Work on whole-archive export/ingest Work on format conversion JPEG->TIFF Web sites as complex objects
24	NDIIPP Preservation Architecture: Archive Ingest and Handling Test Interim Report Digital Library Federation October 2004, Baltimore, MD