Click to edit Master text styles
Second level
Third level
Fourth level
Fifth level
As many of you probably know, Stanford was involved in the Archive Ingest and Handling Test, a project of NDIIPP in 2004-05. Collaborated with LC, Harvard, Old Dominion, and Johns Hopkins to ingesting and disseminating the 9/11 archive collected by George Mason Univ.
That project has been covered here at DLF forums in the past, and the Dec 2005 issue of D-Lib Magazine was devoted to the subject. Official project reports are available from the NDIIPP web site.
So we are not here to report on that project again.
Nancy and I do want to share with this audience some of the specifics on a methodology that our team developed in the course of the AIHT project to automatically assess files in a digital collection for risks associated with their long-term preservation. I’ll be discussing the theory and concepts underlying the method and walk through the process.
NANCY will discuss metadata implications of assessment process, how we recorded in METS using PREMIS, discuss some of issues regarding managing metadata over time.
The Stanford Digital Repository, from the early days of its conception, has expected it will offer a range of services– from bit preservation to format migrations, metadata encoding, and anything between and beyond given available resources, technology, and know-how.
Levels of Service will be necessary because we anticipate a varied range of clients with varied needs.
This slide – which I liberally borrowed from Keith Johnson – concisely conveys our need for more than one (and no doubt more than these two) preservation workflows.
A Prescriptive workflow for digital content created by the libraries in our digitization labs, which assumes that we have total control over the quality and integrity of our own output, and that it is more homogenous is nature and thus easier to manage, more predictable over time.
We also have need for a reactive workflow, for “real world”, “off the street” content that we must assume is untamed and heterogeneous in nature.
A reactive workflow translates into repository services for
Stanford Curators who are acquiring: websites (both snapshots grabbed from web and entire archives from commercial enterprises)
electronic records in the personal papers of prolific poets (email, business records).
Then there’s the content produced by Stanford’s academic departments:
just kicking off a Faculty Advisory Board for the SDR to help to develop, review, and advise on University policy on the preservation of Stanford's digital assets
And then there is the talk of the possibility of non-Stanford entities availing themselves of SDR services.
We feel that any viable digital stewardship organization that accepts content from such varied sources and that does not conform to de facto preservation “requirements” must find a way to minimize unpredictability in its workflows to be efficient, especially if the repository intends to offer levels of service beyond bitstream preservation.
We decided that the best way to minimize unpredictability was to identify and gauge the risks associated with preserving digital content prior to its ingestion. Clear that we needed some framework in which to make agreements between the repository and its clients (the content owners) about preservation commitments. Such a framework also serve to inform and guide the development of repository services, e.g., metadata encoding, pre-ingestion transformation, long-term format migration and delivery, as well as simple bit preservation.
A few years a small team of folks at Stanford set about researching file formats and assessment of arbitrarily-created digital objects, conducted a close evaluation of technical metadata for image and text formats.
The research project resulted in a questionnaire designed to serve as a data collection tool, and to be used in a conversation between repository staff and content depositor.
In its seminal form, the questionnaire was limited in its utility. First of all, its scope was limited to text and image formats. And ultimately it served mostly to gauge and curb human expectations about long-term prospects for files that are target of preservation services. It required manual analysis on a file-by-file basis, so data collection was time-consuming, awkward, and the results were not yet machine actionable. While it felt like the right start, clearly it was inefficient and incomplete.
Then came along the AIHT project, which offered Stanford a substantial opportunity to:
-- design an ingestion workflow for a real world digital archive,
-- build on our previous format research and test our assessment methodology
-- by incorporating the assessment process into the workflow, begin to consider how to generate and record pertinent metadata
The test’s subject – the 9/11 archive -- was a true “off the street” collection. Its heterogeneous nature required us to expand the scope of our assessment tool to include a number of format types.
Also, We recognized the need to automate the assessment tool. Automation, we presumed, would enable the methodology to scale sufficiently, would provide more realistic and trustworthy data for technical assessment, and therefore would help us maintain more control over workflow.
It would at the same time allow us to treat a heterogeneous collection as a more manageable set of object classes, enabling directed investment in preservation actions and the possibility of levels of service beyond bit preservation for select classes.
Two key developments since original work on file assessment and development of questionnaire:
Carl Fleischhauer and Caroline Arms of LC Office of Strategic Initiatives presented their framework for evaluating formats under the title “Digital File Formats: Factors for Sustainability, Functionality, and Quality”
And the tool known as JHOVE was born.
Assume that many of you are familiar with JHOVE’s capabilities; in the interest of time, I won’t go into great detail. Suffice it to say, It greatly simplifies process of identifying formats and exposing technical characteristics.
Fleischhauer and Arms work was something of a revelation, because it effectively generalizes and categorizes much of the information we were trying to capture, at a more detailed file level, in our original questionnaire.
We built on this work, adopting most of it, and developed a matrix for the analysis of predominant formats. It is this matrix that now serves as the basis underlying the SDR preservation assessment tool and associated format risk policies. The tool is encoded in XML and implemented in a program written in JAVA which is used during repository ingestion to produce a Submission Information Package.
Assessment takes place in three steps.
What the matrix looks like:
Partial – only a selection of common image formats TIFF JPEG and GIF
Full matrix includes wide range of formats found among the AIHT target collection, including a number of other image formats, as well as text (plain and mark-up), office documents, audio, video, animation, and page viewer formats.
Formats are evaluated against the five sustainability factors listed in the left columns. The format scores positively for a given factor if that factor is considered not to impede future digital preservation efforts; a negative score is assigned when the factor is expected to hinder preservation. The accumulation of negative scores results in an increasingly lower status for a format. Therefore, a higher score indicates that a format's prospects for sustainability are poorer than one with a low score. If no match on the matrix is made, or if the MIME type is determined to be application/octet-stream, a file undergoing assessment receives a score of 5.
The resulting Format Score is matched to a corresponding Preservation Quality Value and assigned a Policy Status.
The idea here is to apply qualitative terminology to assist in our understanding of and communication about the relative prognosis for the file or a collection of files. Such terminology can be useful in contract specifications, determination of service level commitments and so on.
Like many institutional repositories, Stanford also settled on the concept of a “preferred” status to focus our commitment to (and our depositors’ expectations of) the most thorough preservation services on a small number of formats. We hope this will result in more predictability in our workflow and increased efficiency of our preservation activities. Note that not all formats with a Format Score of zero automatically earn the “preferred” status; here we consider what Fleischhauer and Arm refer to as quality & functionality factors to help us choose. Similarly, assigning “preferred” status does not require a Format Score of zero; a format that is both highly suited to a specific purpose in our institutional context and free of risk factors does not always exist.
This slide puts the three charts together, displaying the format scores, quality values, and policy ratings associated with the image formats.
I threw in audio formats too, for further demonstration.
The supplemental analysis phase is where the file assessment result is fine-tuned.   It  enables our automated workflow to assess preservation risk against format-specific rules. The analysis has two primary goals.
The first is to examine the metadata created in the file inspection and validation process – produced by JHOVE -- to identify:
- Invalid or not well-formed files
- Files with format-specific preservation risks
-Files with format-specific preservation advantages
We apply a set of rules to search for files that are exceptions to the general conditions embodied by the format matrix. Identified files are “red-flagged” and/or their Policy Status is adjusted as necessary.
The second goal of supplemental analysis is to identify files that are suitable candidates for preservation or normalization services, according to SDR policy. For instance, though files in Adobe Photoshop format currently carry a low format score (4), those meeting certain technical profiles can be reformatted to PNG or TIFF with little or no loss, thereby creating equivalent derivatives with an improved Policy Status. Similarly, a file in MS Word can be reformatted to plain text, creating a highly preservable derivative of its textual content.
If the previous steps of the assessment process involve sustainability considerations that the greater digital preservation community agrees on by and large, then it is possible to see this step of the process as the place where local rules apply – where conditions particular to the local repository are in place (i.e., we don’t accept invalid files, we don’t store outdated versions of TIFFs, etc.)
the comparison of the numbers suggests that the automated workflow in the Empirical Walker results in numbers quite close to that of the manual process, an indication that the Empirical Walker is a successful execution of the assessment methodology specification.
Can an automated assessment methodology, prototyped in Stanford’s Empirical Walker, help to maintain control over workflow and extend to the development of services beyond bit-preservation in the long-term? While the AIHT project provided Stanford with a only small taste of ingesting and handling “off-the-street” data in a preservation repository environment, we believe that current indications are net positive.
The preservation assessment process creates new, additional metadata to manage. The value of that metadata depends on whether the efficiency it brings to the preservation process outweighs the cost of creating and managing it over time. The costs of managing the assessment metadata we generated are currently unknown. The quantity of assessment metadata is relatively small, so the cost of their physical storage is not an immediate concern. However, maintaining the infrastructure to support the methodology is more daunting; it requires a complex layer of management that includes ongoing technology watch, format research, policy maintenance and possibly deposit agreement re-negotiations. Perhaps a federated approach to some of this activity, as a service to a community of repositories and their users, would be most economical. In any case, the costs to be borne are not inconsequential. And yet there are a number of ways in which the assessment data can inform decisions to be made at different points in the preservation cycle, and it is conceivable that, if used effectively, the data’s value in the decision-making process offsets or exceeds the cost to create and maintain it.
Currently in the process of specifying the boundaries and functional requirements of the SDR administrative system. Once this is more formal, we intend to incorporate assessment meth
Three step assessment process is somewhat adaptable in its modular design:
ideally the process would be built to allow for swapping in new and updated tools.
Possible to separate local rules & conditions from more general digital preservation guidelines.
the assessment outcome can be used to perform triage — to prioritize preservation actions in a repository with limited resources
Results can be used to inform decisions about when the most efficient time is to act ; The preservation risk assessment metadata can be used to weigh the risks and costs of acting now, acting later, or delaying decision
In devising the Format Score Matrix, none of the factors are weighted. We are still not sure how to effectively do that; something to investigate and test in future iterations of the method
A second area identified as a need for improvement is to add multiple MIME type mappings, because for some file types there is more than one possible MIME type. Need a way to resolve this ambiguity with some authority because it could affect the status assigned to the file further downstream in the assessment process.
Current method is designed to assess the preservability of individual files, not complex or compound objects. Future methods should have the capability to test for the integrity of compound objects.
Currently in the process of specifying the boundaries and functional requirements of the SDR administrative system. Once this is more formal, we intend to incorporate assessment meth