1
|
- Stephen L. Abrams
- Gary McGath
- Harvard University Library
|
2
|
- Almost all aspects of repository operation are dependent upon the format
of the objects in the repository
- Without proper characterization of digital objects (format typing and
technical metadata), effective long-term preservation is difficult, if
not impossible
- Repositories need to ensure that:
- Digital object content streams are valid with respect to their format
- Metadata encapsulated within object content streams are consistent with
externally supplied metadata
|
3
|
- Retrospective examination of the digital objects deposited in the
Harvard University Library repository over its 4 year production history
- Are they valid?
- Do we have consistent characterizations in repository metadata?
|
4
|
- 4 years of production operation
- 1.9 million objects (7.5 TB)
- 27 administrative units with custodial oversight
- 10 depositing agents
- 5 on-campus conversion labs; 3 commercial vendors
- Administrative/technical metadata stored in Oracle 9i
- Content streams stored on RAID file system
- Technical metadata consistent with evolving standards
- Image metadata: NISO Z39.87
- Audio metadata: AES-X098B
- Text metadata: METS text extension schema
|
5
|
|
6
|
- We are experiencing exponential growth in the number of objects and
their size
- Accessioning on the order of 8,000 objects/day
- At the same time, staffing and budget remain constant
- We need to exploit operational efficiencies wherever possible
- Efficiency comes from automation
- We can’t afford manual intervention
|
7
|
- Submission Information Package (SIP) composed of content files and an
XML control file FTP’ed to a drop box
|
8
|
- Contains externally-supplied administrative, structural, and technical
metadata for all content files in the SIP
|
9
|
- Depositing agent is responsible for creating/validating the SIP
- Technical metadata is (typically) supplied by the conversion lab that
created the objects
- Nominally based on DRS specifications
- Values specified by staff as parameters to creating software
- Are the objects submitted for deposit being created properly and are
they being characterized properly?
|
10
|
- 1,152,454 DRS objects checked for validity and metadata consistency in
August 2004
- 97 hours running time
- Sun E450, 2´300MHz, 2 GB RAM,
Solaris 2.6
- Object validity determined by JHOVE
- Internal metadata extracted by JHOVE and tested for consistency with DRS
metadata
|
11
|
- JSTOR/Harvard Object Validation Environment
- Extensible Java framework for format-specific identification,
validation, and characterization
- Plug-in modules available for:
- AIFF / AIFF-C
- ASCII
- GIF (87a, 89a)
- HTML (3.2, 4.0, 4.01, XHTML 1.0 and 1.1) (forthcoming)
- JPEG (ISO 10918-1, JFIF, Exif, SPIFF, JTIP, JPEG-LS)
- JPEG 2000 (JP2, JPX)
- PDF (1.0-1.4, Linearized, Tagged, PDF/X, PDF/A)
- TIFF (4.0-6.0, Class B, G, P, R, Y, TIFF/IT, TIFF/EP, Exif, GeoTIFF,
FX)
- UTF-8
- WAVE / BWF
- XML 1.0
|
12
|
- Object validity determined by JHOVE
- A small, but non trivial number of objects are not valid
- 6,323 (1.4%) of all TIFF objects
- 937 (4%) of all XML objects
- 101 (>1%) of all JPEG objects
- TIFFs make up 40% of all DRS objects; XML, 2%; and JPEG, 35%
- Internal metadata extracted by JHOVE and testing for consistency with
DRS metadata
- Significant numbers of inconsistencies
- Some systemic errors
|
13
|
- Some errors are technical violations that nevertheless do not generally
effect interpretation of object content
- 18,139 instances of value constraint violations
- 5,954 instances of non-word-aligned
offsets
- 368 instances of components
out of sequence
- 5 instances of
unknown data structures
|
14
|
- TIFF DateTime tag
- “The format is: “YYYY:MM:DD HH:MM:SS”, with hours like those on a
24-hour clock” [TIFF Revision 6.0, p. 31]
- “2004.02.12 13.02.16”
- “2004:00:09 10:08:34”
- “Mon Apr 10 12:41:59 1995”
- Vendor tools include: imgPrep [v.
2.27], iFilters [v. 2.44], ImageGear [11.00.023], Pixel Translations
PixView 3.0, TMS Sequoia ScanFix 4.01, ImageSpace 3.2, …
|
15
|
- TIFF offsets must be word-aligned
- “The [IFD] directory may be at any location in the file after the
header but must begin on a word boundary….The Value Offset is expected
to begin on a word boundary; the corresponding Value Offset will thus
be an even number” [TIFF Revision 6.0, pp. 13, 15]
- IFD at offset 62811
- Tag value at offset 1365
- Vendor tools include: Nikon D100, Photoshop 7.0, ScanFix, Enhanced Pixel
Translations Inc., PIXTIFF Version 54.2.21, …
|
16
|
- TIFF tags must be in numerical order
- “The entries in an IFD must be sorted in ascending order by Tag” [TIFF
Revision 6.0, p. 15]
- …
- 00000202: 34665 (34665) LONG
1 = 770
- 00000214: 34853 (GPSInfo)
LONG 1 = 824
- 00000226: 34675
(InterColourProfile) UNDEFINED 480
- …
- Some software may have tag order dependencies
|
17
|
- TIFF tags and data types
- “Readers should skip over fields containing an unexpected field type” [TIFF
Revision 6.0, p. 16]
- Undefined tag 0 of undefined type 0
- The specification gives no instruction on how to treat unknown tags
|
18
|
- Other errors have substantial effect on proper interpretation of object
content
- 2,284 instances of value constraint violations
- 938 instances of external
dependencies
- 108 instances of
non-well-formedness
- 96 instances of invalid
headers
- 3 instances of invalid
component definition
- 1 instance of
unexpected EOF
|
19
|
- TIFF TileLength and TileWidth values are constrained to multiples of 16
- “TileWidth … [and] TileLength must be a multiple of 16” [TIFF Revision
6.0, p. 67-68]
- TileLength value is 129
- Visual corruption
- Vendor tools include: Scitex Leaf Volare, Leaf ColorShop 5.x
|
20
|
|
21
|
- XML files with hardcoded local DTD filepath
- <!DOCTYPE indexMap SYSTEM
- "file:///home/indxadm/prod/conf/indexMap.dtd">
- Attempts to validate in contexts without a local copy of the DTD fail
|
22
|
- XML files not well-formed
- ID “471319” referenced, but not defined
- ID “599384” not unique
- End tag </FileGrp> with no corresponding start tag
- References to non extant or non-unique IDs can lead to software
failures
|
23
|
- JPEG files do not start with a SOI marker (0xFFD8)
- “SOI: Start of image marker – Marks the start of a compressed image
represented in the interchange format or abbreviated format” [ISO/IEC
10918-1:1994, p. 34]
- SOI marker is the JPEG magic number; without it, rendering tools will
not recognize the file as a JPEG
- Vendor tools include: Phase One FX, PhotoShop 6.x
|
24
|
- TIFF strip definition
- Combination of starting offset and length defines a strip that extends
beyond the physical size of the file
- Visual corruption
- Potential buffer overflow security hole
- C.f. the recent MSIE JPEG buffer overflow exploit (US-CERT TA04-260A)
- Vendor tools include: ScanFix, Enhanced Pixel Translations Inc., PIXTIFF
Version 54.2.210
|
25
|
|
26
|
- JPEG file truncated prior to deposit
- Irretrievable loss of data
- Vendor tools include: EZImage, Phase One Light Phase
- Transferred to DRS in a damaged condition; SIP checksum value matched
|
27
|
|
28
|
- We uncovered a significant number of cases of discrepancies between
values reported by the DRS metadata (derived from SIP data) and those
reported by JHOVE
- 100,000 instances image channel size mismatches
- 29,683 instances of image size
mismatches
- 1,068 instances of invalid data
values
- 401 instances of
resolution mismatches
- 160 instances of
compression type mismatches
|
29
|
- TIFF and JPEG BitsPerSample values are not consistent
- BitsPerSample reported as “8”; should be “8 8 8”
- BitsPerSample reported as “8 8 8”; should be “8”
- BitsPerSample reported as “8 8 8”; should be “16 16 16”
- BitsPerSample reported as “16 16 16”; should be “8 8 8”
- Human error; failure to use the correct SIP template
- BitsPerSample reported as “24”; should be “8 8 8”
- Deprecated form of SIP control file markup
|
30
|
- Image height and width values are not consistent
- Image size reported as “1274 x 2240”; should be “2240 x 1274”
- Values are transposed, most probably by a (non-JHOVE) metadata
extraction tool
- Image size reported as “5104 x 6614”; should be “6738 x 5268”
- SIP value may reflect the cropped, rather than the full image size
- Image height is 3 times as likely to be wrong as image width
|
31
|
- Invalid values in DRS metadata
- DRS reports “” (zero-length string) as the value for image X and/or Y
resolution
- SIP control file contains empty element “<xres></xres>”
- Empty element is interpreted by loader is a zero-length string, not null
- SIP control file defined by DTD, not a Schema that would permit numeric
typing of the <xres> element
- Error in DRS loader, not the objects themselves
|
32
|
- Image resolution values are not consistent
- Resolution reported as “300 x 400”; should be “400 x 300”
- Values are transposed, most probably by a metadata extraction tool
|
33
|
- TIFF compression types inconsistent
- Compression reported as “4” (Group 4); should be “5” (LZW)
- Compression reported as “4” (Group 4); should be “1” (none)
- Compression reported as “1” (none); should be “5” (LZW)
- Human error; failure to use the correct SIP template
|
34
|
- Many formats provide an internal property to report systems used to
create the formatted objects
- TIFF reports System/Software tags as “Topax IX/Lino Color 6.x”; should
be “Nikon D100/Photoshop 7.0”
- JPEG reports Software as “DeBabelizer Pro 4.5”; should be “Photoshop
5.2”
- Many tools applied downstream in a processing path appear to overwrite
any trace of their predecessors
|
35
|
- Validity errors
- Ability to preserve information content is compromised
- Consistency errors
- Reliance on external metadata to guide workflow processes
- Inconsistencies breed doubt: which one is correct?
- We work closely with vendors, who are highly technically competent; what
will happen when we work with faculty, students, and staff?
- Minimize the opportunities for these errors to occur
|
36
|
- Automated tool for creating DRS SIP control files based on technical
metadata extracted from the SIP objects by JHOVE
- Implemented by a custom JHOVE output handler
- Local configuration files permit the specification of metadata that
JHOVE cannot provide, and the overriding of extracted values, if
appropriate
|
37
|
|
38
|
- Try to discover errors as far upstream as possible
- With DSIP format errors can be caught/corrected before transfer to the
repository
- It’s the less disruptive point to deal with errors
- It’s the point at which errors can be most easily handled with the
support of the object owner, curator, creator
- It’s the point at which errors can be handled with the least expense
- Having depositing agents/object creators correct errors helps to focus
their attention on format quality concerns
|
39
|
- Where appropriate and practicable, create clean versions of invalid
objects
- We probably won’t worry about dates
- We will concentrate on objects that cannot be rendered properly
- Update metadata to bring into consistency
|
40
|
- Question assumptions about the behavior of tools and vendors
- Integrate validation and extraction tools into repository workflows to
reduce human error
- RLG Automatic Exposure
- Encourage tool and system vendors to populate image formats with the
fullest possible set of correct metadata
- However, values must not be provided solely for the purpose of
compliance
- A wrong or misleading value is worse than no value at all
- Encourage vendors to “play well with others”
- Tools should not erase traces of previous processing
|
41
|
- Forgiving software encourages sloppy practice
- Just because poorly formed objects are usable today doesn’t mean that
they will be usable in the future
- Just because a creation or processing tool does not return an error
condition does not mean the object is correct
- Just because an object renders does not mean that it is correct
- Since it has become easy to distinguish between the two, we should
demand well-formed objects and not tolerate malformed objects
|
42
|
- An invalid date can probably be ignored without consequence; an invalid
header cannot
- How do we determine the appropriate balance between strict compliance
and collection development?
- What should be done with malformed or invalid objects?
- The DRS is not under an obligation to accept any object; an
institutional repository may not have this latitude
- Accept, but normalize on deposit?
- Accept as is, but with the proviso of lower service level?
|
43
|
|