Almost all aspects
of repository operation are conditioned by the format of the objects in the
repository. The Harvard University Library Digital Repository Service (DRS)
has been in production operation for four years and has over 1.5 million
digital objects (7 TB) under managed storage.
A recent comparison
of internal technical metadata extracted from these objects with the external
metadata supplied in the objects' Submission Information Packages (SIPs)
revealed some troubling inconsistencies. Additionally, a small percentage of
objects were found to be invalid or malformed with respect to their formats.
A post- mortem investigation determined the cause of these problems to
include both human and system failures, in some instances on a systemic basis
with regard to format.
We report on the
findings of this effort and discuss systems that are now in place and under
development for automated SIP construction and pre-ingest validation intended
to mitigate such problems in the future. We also present an update on JHOVE,
the JSTOR/Harvard Object Validation Environment, useful for format-specific
object identification, validation, and characterization.