Notes
Slide Show
Outline
1
Format Dependencies in Repository Operation
  • Stephen L. Abrams
  • Gary McGath
  • Harvard University Library
2
Introduction
  • Almost all aspects of repository operation are dependent upon the format of the objects in the repository
  • Without proper characterization of digital objects (format typing and technical metadata), effective long-term preservation is difficult, if not impossible
  • Repositories need to ensure that:
    • Digital object content streams are valid with respect to their format
    • Metadata encapsulated within object content streams are consistent with externally supplied metadata
3
A Case Study
  • Retrospective examination of the digital objects deposited in the Harvard University Library repository over its 4 year production history
    • Are they valid?
    • Do we have consistent characterizations in repository metadata?
4
Digital Repository Service (DRS)
  • 4 years of production operation
    • 1.9 million objects (7.5 TB)
    • 27 administrative units with custodial oversight
    • 10 depositing agents
    • 5 on-campus conversion labs; 3 commercial vendors
  • Administrative/technical metadata stored in Oracle 9i
  • Content streams stored on RAID file system
  • Technical metadata consistent with evolving standards
    • Image metadata: NISO Z39.87
    • Audio metadata: AES-X098B
    • Text metadata: METS text extension schema
5
DRS Growth Trend
6
Efficiency and Automation
  • We are experiencing exponential growth in the number of objects and their size
    • Accessioning on the order of 8,000 objects/day
  • At the same time, staffing and budget remain constant
  • We need to exploit operational efficiencies wherever possible
  • Efficiency comes from automation
    • We can’t afford manual intervention
7
DRS Submission
  • Submission Information Package (SIP) composed of content files and an XML control file FTP’ed to a drop box
8
SIP Control File
  • Contains externally-supplied administrative, structural, and technical metadata for all content files in the SIP
9
Object and Metadata Validity
  • Depositing agent is responsible for creating/validating the SIP
  • Technical metadata is (typically) supplied by the conversion lab that created the objects
    • Nominally based on DRS specifications
    • Values specified by staff as parameters to creating software
  • Are the objects submitted for deposit being created properly and are they being characterized properly?


10
A Test of Object and Metadata Validity
  • 1,152,454 DRS objects checked for validity and metadata consistency in August 2004
  • 97 hours running time
    • Sun E450, 2´300MHz, 2 GB RAM, Solaris 2.6
  • Object validity determined by JHOVE
  • Internal metadata extracted by JHOVE and tested for consistency with DRS metadata
11
JHOVE
  • JSTOR/Harvard Object Validation Environment
    • Extensible Java framework for format-specific identification, validation, and characterization
  • Plug-in modules available for:
    • AIFF / AIFF-C
    • ASCII
    • GIF (87a, 89a)
    • HTML (3.2, 4.0, 4.01, XHTML 1.0 and 1.1) (forthcoming)
    • JPEG (ISO 10918-1, JFIF, Exif, SPIFF, JTIP, JPEG-LS)
    • JPEG 2000 (JP2, JPX)
    • PDF (1.0-1.4, Linearized, Tagged, PDF/X, PDF/A)
    • TIFF (4.0-6.0, Class B, G, P, R, Y, TIFF/IT, TIFF/EP, Exif, GeoTIFF, FX)
    • UTF-8
    • WAVE / BWF
    • XML 1.0
12
Summary Results
  • Object validity determined by JHOVE
    • A small, but non trivial number of objects are not valid
      • 6,323 (1.4%) of all TIFF objects
      •       937 (4%)     of all XML objects
      •       101 (>1%)   of all JPEG objects
      • TIFFs make up 40% of all DRS objects; XML, 2%; and JPEG, 35%
  • Internal metadata extracted by JHOVE and testing for consistency with DRS metadata
    • Significant numbers of inconsistencies
    • Some systemic errors
13
Object Validity Errors
  • Some errors are technical violations that nevertheless do not generally effect interpretation of object content
    • 18,139 instances of value constraint violations
    •     5,954 instances of non-word-aligned offsets
    •            368 instances of components out of sequence
    •                      5 instances of unknown data structures
14
DateTime Format Errors
  • TIFF DateTime tag
    • “The format is: “YYYY:MM:DD HH:MM:SS”, with hours like those on a 24-hour clock” [TIFF Revision 6.0, p. 31]
    • “2004.02.12 13.02.16”
    • “2004:00:09 10:08:34”
    • “Mon Apr 10 12:41:59 1995”
  • Vendor tools include:  imgPrep [v. 2.27], iFilters [v. 2.44], ImageGear [11.00.023], Pixel Translations PixView 3.0, TMS Sequoia ScanFix 4.01, ImageSpace 3.2, …
15
Non-Word-Aligned Offsets
  • TIFF offsets must be word-aligned
    • “The [IFD] directory may be at any location in the file after the header but must begin on a word boundary….The Value Offset is expected to begin on a word boundary; the corresponding Value Offset will thus be an even number” [TIFF Revision 6.0, pp. 13, 15]
    • IFD at offset 62811
    • Tag value at offset 1365
  • Vendor tools include: Nikon D100, Photoshop 7.0, ScanFix, Enhanced Pixel Translations Inc., PIXTIFF Version 54.2.21, …
16
Components Out of Sequence
  • TIFF tags must be in numerical order
    • “The entries in an IFD must be sorted in ascending order by Tag” [TIFF Revision 6.0, p. 15]
    • …
    •     00000202: 34665 (34665) LONG 1 = 770
    •     00000214: 34853 (GPSInfo) LONG 1 = 824
    •     00000226: 34675 (InterColourProfile) UNDEFINED 480
    •     …
  • Some software may have tag order dependencies
17
Unknown Data Structures
  • TIFF tags and data types
    • “Readers should skip over fields containing an unexpected field type” [TIFF Revision 6.0, p. 16]
    • Undefined tag 0 of undefined type 0
    • The specification gives no instruction on how to treat unknown tags
18
Object Validity Errors
  • Other errors have substantial effect on proper interpretation of object content
    • 2,284 instances of value constraint violations
    •        938 instances of external dependencies
    •        108 instances of non-well-formedness
    •             96 instances of invalid headers
    •                  3 instances of invalid component definition
    •                  1 instance of unexpected EOF
19
Value Constraint Violations
  • TIFF TileLength and TileWidth values are constrained to multiples of 16
    • “TileWidth … [and] TileLength must be a multiple of 16” [TIFF Revision 6.0, p. 67-68]
    • TileLength value is 129
    • Visual corruption
  • Vendor tools include: Scitex Leaf Volare, Leaf ColorShop 5.x
20
Value Constraint Violations
21
External Dependencies
  • XML files with hardcoded local DTD filepath
    • <!DOCTYPE indexMap SYSTEM
    •    "file:///home/indxadm/prod/conf/indexMap.dtd">
    • Attempts to validate in contexts without a local copy of the DTD fail





22
Non-Well-Formedness
  • XML files not well-formed
    • ID “471319” referenced, but not defined
    • ID “599384” not unique
    • End tag </FileGrp> with no corresponding start tag
    • References to non extant or non-unique IDs can lead to software failures
23
Invalid Headers
  • JPEG files do not start with a SOI marker (0xFFD8)
    • “SOI: Start of image marker – Marks the start of a compressed image represented in the interchange format or abbreviated format” [ISO/IEC 10918-1:1994, p. 34]
    • SOI marker is the JPEG magic number; without it, rendering tools will not recognize the file as a JPEG
  • Vendor tools include: Phase One FX, PhotoShop 6.x
24
Invalid Component Definition
  • TIFF strip definition
    • Combination of starting offset and length defines a strip that extends beyond the physical size of the file
    • Visual corruption
    • Potential buffer overflow security hole
      • C.f. the recent MSIE JPEG buffer overflow exploit (US-CERT TA04-260A)
  • Vendor tools include: ScanFix, Enhanced Pixel Translations Inc., PIXTIFF Version 54.2.210
25
Invalid Component Definition
26
Unexpected End-of-File
  • JPEG file truncated prior to deposit
    • Irretrievable loss of data
  • Vendor tools include: EZImage, Phase One Light Phase
  • Transferred to DRS in a damaged condition; SIP checksum value matched
27
Unexpected End-of-File
28
External vs. Internal Metadata
  • We uncovered a significant number of cases of discrepancies between values reported by the DRS metadata (derived from SIP data) and those reported by JHOVE
    • 100,000 instances image channel size mismatches
    •     29,683 instances of image size mismatches
    •          1,068 instances of invalid data values
    •                  401 instances of resolution mismatches
    •                  160 instances of compression type mismatches
29
Image Channel Size Mismatches
  • TIFF and JPEG BitsPerSample values are not consistent
    • BitsPerSample reported as “8”; should be “8 8 8”
    • BitsPerSample reported as “8 8 8”; should be “8”
    • BitsPerSample reported as “8 8 8”; should be “16 16 16”
    • BitsPerSample reported as “16 16 16”; should be “8 8 8”
    • Human error; failure to use the correct SIP template
    • BitsPerSample reported as “24”; should be “8 8 8”
    • Deprecated form of SIP control file markup
30
Image Size Mismatches
  • Image height and width values are not consistent
    • Image size reported as “1274 x 2240”; should be “2240 x 1274”
    • Values are transposed, most probably by a (non-JHOVE) metadata extraction tool
    • Image size reported as “5104 x 6614”; should be “6738 x 5268”
    • SIP value may reflect the cropped, rather than the full image size
    • Image height is 3 times as likely to be wrong as image width
31
Invalid Data Values
  • Invalid values in DRS metadata
    • DRS reports “” (zero-length string) as the value for image X and/or Y resolution
    • SIP control file contains empty element “<xres></xres>”
    • Empty element is interpreted by loader is a zero-length string, not null
    • SIP control file defined by DTD, not a Schema that would permit numeric typing of the <xres> element
  • Error in DRS loader, not the objects themselves
32
Image Resolution Mismatches
  • Image resolution values are not consistent
    • Resolution reported as “300 x 400”; should be “400 x 300”
    • Values are transposed, most probably by a metadata extraction tool
33
Compression Value Mismatches
  • TIFF compression types inconsistent
    • Compression reported as “4” (Group 4); should be “5” (LZW)
    • Compression reported as “4” (Group 4); should be “1” (none)
    • Compression reported as “1” (none); should be “5” (LZW)
    • Human error; failure to use the correct SIP template
34
Processing System Mismatches
  • Many formats provide an internal property to report systems used to create the formatted objects
    • TIFF reports System/Software tags as “Topax IX/Lino Color 6.x”; should be “Nikon D100/Photoshop 7.0”
    • JPEG reports Software as “DeBabelizer Pro 4.5”; should be “Photoshop 5.2”
    • Many tools applied downstream in a processing path appear to overwrite any trace of their predecessors
35
Why Should We Care?
  • Validity errors
    • Ability to preserve information content is compromised
  • Consistency errors
    • Reliance on external metadata to guide workflow processes
    • Inconsistencies breed doubt: which one is correct?
  • We work closely with vendors, who are highly technically competent; what will happen when we work with faculty, students, and staff?
    • Minimize the opportunities for these errors to occur
36
DSIP – DRS SIP Packaging Tool
  • Automated tool for creating DRS SIP control files based on technical metadata extracted from the SIP objects by JHOVE
  • Implemented by a custom JHOVE output handler
  • Local configuration files permit the specification of metadata that JHOVE cannot provide, and the overriding of extracted values, if appropriate
37
DSIP/JHOVE in Repository Workflow
38
Validation at Point of Deposit
  • Try to discover errors as far upstream as possible
  • With DSIP format errors can be caught/corrected before transfer to the repository
    • It’s the less disruptive point to deal with errors
    • It’s the point at which errors can be most easily handled with the support of the object owner, curator, creator
    • It’s the point at which errors can be handled with the least expense
  • Having depositing agents/object creators correct errors helps to focus their attention on format quality concerns
39
Retrospective Cleanup of DRS
  • Where appropriate and practicable, create clean versions of invalid objects
    • We probably won’t worry about dates
    • We will concentrate on objects that cannot be rendered properly
  • Update metadata to bring into consistency
40
“Trust, but verify”
  • Question assumptions about the behavior of tools and vendors
  • Integrate validation and extraction tools into repository workflows to reduce human error
  • RLG Automatic Exposure
    • Encourage tool and system vendors to populate image formats with the fullest possible set of correct metadata
      • However, values must not be provided solely for the purpose of compliance
      • A wrong or misleading value is worse than no value at all
    • Encourage vendors to “play well with others”
      • Tools should not erase traces of previous processing
41
“Spare the rod, spoil the child”
  • Forgiving software encourages sloppy practice
  • Just because poorly formed objects are usable today doesn’t mean that they will be usable in the future
    • Just because a creation or processing tool does not return an error condition does not mean the object is correct
    • Just because an object renders does not mean that it is correct
  • Since it has become easy to distinguish between the two, we should demand well-formed objects and not tolerate malformed objects
42
“All errors are equal; but some are more equal than others”
  • An invalid date can probably be ignored without consequence; an invalid header cannot
    • How do we determine the appropriate balance between strict compliance and collection development?
  • What should be done with malformed or invalid objects?
    • The DRS is not under an obligation to accept any object; an institutional repository may not have this latitude
    • Accept, but normalize on deposit?
    • Accept as is, but with the proviso of lower service level?
43
More Information