PROPOSAL FOR A STUDY OF ELECTRONIC
JOURNAL ARCHIVING
Submitted to the Andrew W. Mellon Foundation
October 13, 2000
The Harvard University Library is requesting funding from the
Andrew W. Mellon Foundation to create a plan for the archiving of
electronic journals, in response to a letter of invitation
received from the Foundation in August 2000.
BACKGROUND
The Harvard University Library, from its founding in 1638, has
sustained a commitment to preserve its vast research collections
for current and future use by scholars worldwide. In the
twentieth century, traditional book repair and rebinding
activities were complemented by a preservation microfilming
program that is more vigorous today than ever before and
continues to strengthen. Programs for collections conservation,
special collections conservation, digitization, studio
photography, and sound re-recording are equally vigorous; as are
preservation field services, disaster preparedness activities,
and efforts to upgrade environmental controls in library
buildings throughout the institution.
Two years ago the Harvard libraries received a $12 million
grant from the University to build the infrastructure needed to
support the creation, storage, and delivery storage of digital
library collections. Since then a team of library and computer
specialists has been assembled, and the development of many basic
architectural components for Harvard's digital library is now
well underway. We believe that this infrastructure will provide
the basis for a robust digital preservation program. Although we
already have enough experience to be keenly aware that
establishment of a preservation program for digital library
resources will be a complex and expensive undertaking, we
anticipate that building such a program on top of the library's
general digital collections infrastructure will help to control
development and operating costs. The components of our
infrastructure that we believe are applicable to a digital
archiving project include:
- A repository system to support reliable, sustained storage
and management of very large numbers of digital objects;
- A "naming" system to support the identification and location
of digital objects in a persistent and system-independent
way;
- A generalized model of access management to protect
intellectual property;
- Expertise in the design of metadata schemes, the information
necessary to manage, display, preserve, and re-purpose digital
objects;
- Scalable operations support to ensure the professional
operation of an ongoing production service for the ingestion and
storage of a significant number of digital objects
The library has funding to carry development of this
infrastructure forward for at least 3 more years.
Harvard's level of acquisition of published digital resources
is growing rapidly, and the University now lists more than 2,000
journals and databases in its electronic gateway. Shared expenses
for electronic resources across Harvard now exceed one million
dollars, and combined expenditures for the purchase of digital
resources by individual Harvard libraries is at least an equal to
this. Strategically, the growth of digital acquisitions at
Harvard has been directed toward collection priorities and the
cultivation of best practices rather than by market trends. Most
journals that we subscribe to electronically are individually
selected by subject bibliographers on the basis of merit rather
than being acquired through publisher-driven "all-or-nothing"
packages. Licensing guidelines have been developed to advise the
libraries and prospective vendors in the acquisition and
negotiation process, with an emphasis on fair use protections,
archival rights, and vendor accountability.
Currently, most of the electronic journals we offer are also
acquired in paper format. In large part this is because both
librarians and the faculty they serve are concerned about
preserving archival access for the long term, for which currently
only the paper copy will suffice. This duplication involves
duplicate costs and efforts, and is not sustainable over time.
Addressing the archiving of electronic journals is thus
strategically important to Harvard.
PLANNING TOPICS
Digital preservation is a new and very immature field. An
E-journal archiving program will involve the exploration of many
complex and ill-defined issues. During the planning process we
will analyze a number of key areas, with the intent of developing
sufficient understanding to allow us to prepare a concrete plan
for a 4-year archiving project. Among the key areas we will
address in the planning process are:
Target content
We face at least 2 key questions regarding content. The first
is, "What specific journals will we archive?" We envision a
three-part selection strategy.
- We will seek at least one arrangement with a publisher
providing a significant volume of articles (the "anchor tenant")
to test the scaling of our archive.
- We will seek agreement with at least one scholarly society
with an active publishing program. Working with these two
sources, which we believe will present quite different
perspectives on the project, we will develop a model for the
archive-publisher relationship.
- We will analyze the list of E-journals for which we acquire
only digital copies, and select a number of titles
for archiving. These titles will likely involve quite different
challenges from those with the "anchor tenant," reflecting
different business models, technical expertise, and institutional
roles. We are also very likely to try to extend our archiving
efforts to include non-journal electronic serials, where we are
acquiring an increasing number of titles with equally pressing
preservation issues.
The second content-related question has to do with which of
the various components of the selected journals we will encompass
in the archive. Journal articles are reasonably straightforward
as objects. They have obvious granularity, well-developed
conventions and systems for bibliographic description, and
established identification schemes (SICI, DOI). The other parts
of journals are less obvious. Does one include such textual
elements as letters to the editor or book reviews?
Advertisements? Does one need to replicate the "issue" as
originally published, including such elements as cover and table
of contents page? Are links in articles included, and what
responsibility for these does the archive assume? E-journals are
beginning to include sophisticated related materials (models,
data sets, simulations, programs, etc.). Are these to be
included? And preserved? A key task in our planning year will be
to analyze such additional content, create an inventory of types,
draft a policy addressing how such materials should be handled,
and discuss with our peers this proposed policy (see "Discussion
of NERL role" below).
Discussions with publishers
A critical element in any attempt to archive commercial
intellectual property in electronic form will be negotiating the
respective rights and responsibilities of publishers and
archiving institutions. To what degree will publishers bear the
cost of preparing archival formats of published materials? Or of
creating the metadata required for archiving? Who will be able to
access archived materials, and under what circumstances? Will
publishers support part of the cost of archiving?
We will use the planning year to begin discussions with
selected publishers about these issues. Given the uncertainty of
funding for archival operations, we cannot begin negotiating
archiving licenses, but we hope to raise both our own level of
sophistication and that of possible project publishing partners,
and to begin to build a consensus regarding what "normal"
practice will be in this area.
Technical requirements
While it is popular to contend there are more policy than
technical issues involved in archiving, Harvard believes that
there are indeed a great number of technical issues and
developments involved in a sophisticated archiving program. Among
the areas for investigation are:
- Accession automation. In order to control costs and
make a large-scale operation possible, as much as possible of the
process of adding materials to the archive will need to be
handled by computers rather than people. Areas such as ingestion,
quality control, format transformations, and metadata ingestion
and creation lend themselves to automation.
- Format investigation. The cost and complexity of
archiving will depend to a significant degree on the technical
formats of the objects archived. Investigation of the formats
available from publishers, as well as the formats preferable for
archiving, will be a key planning issue. It may be necessary to
develop a policy regarding what formats will be accepted into the
archive.
- On-going validation or auditing. Archival collections
will require periodic evaluation to ensure against the loss of
content or viability. Some form of systems support for such
auditing will be required. Auditing may also require third-party
access facilities (discussed below).
- Bibliographic control. It is not clear what level of
bibliographic control over the archival collection will be
needed. It is likely that some new bibliographic facility will be
required for at least some objects (such as journal issues, as
opposed to articles).
- Naming. Analysis is needed to determine what archival
objects will require formal naming, where such names should be
recorded (particularly outside of the archiving institution), and
what the relationship is between archived objects and existing
formal naming systems such as the Digital Object Identifier.
- Access management. While Harvard has established a
generalized access management facility for digital content, it is
primarily oriented towards internal users. Depending on the
nature of archiving licenses, extensions will probably be needed
to provide for appropriate external access. (Such access will
presumably be offered at the institutional rather than individual
level, based on the Mellon guideline specifying that the primary
responsibility of an archive should be to institutional
subscribers).
- Storage strategy. As envisioned, Harvard's archiving
project is likely to involve a considerable amount of content.
During the planning year we will create estimates of the number
of objects involved and their size, and devise an appropriate
strategy to minimize the combined cost of storage and
operations.
- Output facilities. A key issue will be who has access
to archived materials, and what the service expectations of such
users will be. Will we provide Harvard users with day-to-day
access to materials, perhaps as part of ongoing validation? If
so, a reasonably sophisticated user interface to objects will be
required. Will selected outside users have access in order to
validate the functioning of the archive? If so, what delivery
format will be required? The Mellon guidelines require that
archives deliver objects to other institutional subscribers when
the "fail safe" is exercised. Will there be standard formats in
which such objects will be delivered?
These issues will be investigated during the planning phase of
the proposed project to determine the nature and scale of
developments required, and to create a plan and timeline.
Prototype and exploratory developments will be pursued in the
planning year as appropriate.
Internal institutional roles
Traditional preservation activities are a shared
responsibility between curatorial departments and preservation
technologists. At Harvard, curatorial responsibility is heavily
decentralized along topical subject and language lines. It seems
unlikely that the scope of an archiving program will be defined
in terms of curatorial responsibility, particularly since such a
program will likely involve inter-institutional arrangements. Yet
archiving necessarily involves fundamental curatorial issues,
such as deciding when to migrate materials and to what format,
and periodically assessing the state of the archival collection.
The planning process will address how to allocate responsibility
across current (or new) institutional structures.
External relationships
No institution will archive solely for its own use, nor rely
solely on its own archiving program. Therefore no institution can
decide on the nature of its archiving operation in isolation.
Harvard proposes to work with its peers in the NorthEast Research
Libraries consortium (NERL) on several key aspects of its
program:
- requirements for archiving licenses, and the negotiation of
such licenses,
- decisions about what specific components of journals should
be included in an archive,
- the standards and "good practices" an archive must follow to
be relied upon by other institutions subscribing to the archived
content,
- the appropriate level of inter-institutional auditing,
and
- the output formats that must be immediately available from an
archive in order to respond to the access needs of other
institutions.
Costs and economic issues
Harvard believes that the most important single economic issue
in archiving electronic journals will be managing costs. The
level of archiving eventually undertaken, and the distribution of
responsibilities and costs among the players will be very
sensitive to the total cost of the enterprise. It is critical
that the archiving process be designed to minimize marginal
costs. We suspect that three factors will have the greatest
impact on costs:
- effective automation of the archiving process to reduce labor
and complexity;
- sophisticated selection of archiving hardware and software;
and
- the efficiencies inherent in large-scale archiving
operations.
All three factors will be a constant focus of our efforts in
both the planning and implementation phases.
A proposed model for archive cost distribution will be
addressed during the planning phase. For paper-based
publications, archiving costs were borne in a largely unplanned
pattern, with the great majority of libraries and scholars
relying on the preservation activities of a relatively small
number of institutions. In the archiving institutions, the costs
of preservation (the binding of journals, for example) were also
largely bundled into collections budgets, as much of the required
activity was also related to daily services to users. With
digital publications, archiving and day-to-day access can be (and
are likely to frequently be) separately calculated and budgeted.
This will require that the community more explicitly address the
issue of cost distribution.
There are 3 immediately obvious sources of support for
archiving: publishers, the digital archiving institutions, and
other subscribing institutions (particularly those that would
previously have archived paper copies of a title). In addition,
there may be new sources of support, including governments (some
of which have been active in the support of microform-based
preservation), foundations, scholarly societies, and cooperatives
of various sorts (library, scholarly, higher education). It seems
likely that the pattern for support will not be the same for all
types of publications. The funding structure for archiving the
publications of a scholarly society, for example, may be
different from that for publications issued by a for-profit
publisher or of an academic department.
During the planning phase, we will undertake several
initiatives related to long-term funding:
- analysis of different types of publications for which
different sources of funding might be appropriate;
- exploration with publishers of the distribution of costs,
particularly those related to the preparation of objects and
metadata for archiving; and
- discussions within NERL about the distribution of costs,
perhaps making this an explicit part of the negotiations at the
point of acquisition. (NERL's primary role today is the joint
licensing of commercial electronic resources.)
A key product of the planning phase will be a proposed budget
for the archiving project, addressing staffing (development,
operations, and management), hardware and software acquisitions
and maintenance, third-party service vendor costs, and
overhead.
PROJECT STAFFING
The planning project will be carried out primarily by a
Project Manager with both library and technology experience,
working with various teams within Harvard and without. The
Project Manager will be Y. Kathy Kwan, a librarian with
significant systems experience who recently joined Harvard after
serving as an Associate Fellow at the National Library of
Medicine. Kwan will be dedicated full time to the planning
project. Within the Harvard libraries we will form two primary
teams:
Project Steering Committee. This group will be composed
of senior curators, preservation experts, and library systems
staff, and will address functional and organizational issues.
Proposed members of the Committee are:
- Marianne Burke (Assistant Director for Resource Management,
Countway library of Medicine)
- Dale Flecker (Associate Director for Planning and Systems,
Harvard University Library)
- Diane Garner (Librarian for the Social Science, Harvard
College Library)
- Jeffrey Horrell (Associate Librarian of Harvard College for
Collections)
- Y. Kathy Kwan (Project Manager)
- Jan Merrill-Oldham (Malloy-Rabinowitz Preservation
Librarian)
- Constance Rinaldo (Librarian, Ernst Mayr Library of the
Museum of Comparative Zoology)
- Lynne Schmelz (Librarian for the Sciences, Harvard College
Library)
- MacKenzie Smith (Digital Library Projects Manager)
Technical Team. An internal team composed of staff with
significant experience in digital library development will
investigate technical issues and systems requirements. Proposed
members of the Team include:
- Stephen Chapman (Preservation Librarian for Digital
Projects)
- Dale Flecker (Associate Director for Planning and Systems,
Harvard University Library)
- Y. Kathy Kwan (Project Manager)
- MacKenzie Smith (Digital Library Projects Manager)
- Robin Wendler (Metadata Analyst)
In addition, developers from the current Library Digital
Initiative technical staff with appropriate expertise (e.g.,
repositories, access management) will participate on the
Technical Team as appropriate.
A fundamental feature of the LDI development process is the
regular open review of all technical analyses and designs by a
group technical experts drawn from across the university.
Representatives from academic computing groups, libraries, and
museums, and the senior staff of the chief information officer
participate in these reviews. The same methodology will be used
in reviewing all technical work in the archiving project.
PLAN OF WORK
Policy and functional issues will be addressed by the Steering
Committee, and technical issues will be handled by the Technical
Team. Project activities will be coordinated and documented by
the Project Manager. The programmer will be involved in preparing
technical specifications and in exploratory development.
1. Review literature and current practices (Quarter one)
- Review current efforts in digital electronic archiving,
specifically E-journal archiving.
2. Secure core publishers (Quarters one and two)
- Approach a potential "anchor tenant" and one scholarly
society publisher to explore interest in project
participation.
- Discuss in detail with the committed publishers, the business
and technical issues associated with archiving of E-journals,
including title selection, right of access, archival format,
metadata creation, and cost distribution.
3. Analyze targeted journals (Quarters one and two)
- Examine content of the targeted journals and inventory their
types and format.
- Decide which components will be archived.
4. Foster partnership with NERL (Quarters one and three)
- Meet with NERL members to establish common goals and
requirements for an E-journal archive, including content analysis
of targeted journals, archival format, good/established practices
in handling different data types, trusted access, licensing
issues, cost distribution, and ongoing audit.
- Formulate policy for the archive based on the outcome of
above discussion.
5. Define functional requirements (Quarters one, two and
three)
- Develop requirements for the archive in the areas of
accession, archival format, validation/audit, bibliographic
control, metadata, naming, access management, storage, and output
facilities.
6. Design high-level system architecture (Quarters two,
three, and four)
- Design high-level system architecture based on functional
requirements.
- Propose specifications for software and hardware.
7. Select additional titles (Quarters two, three, and
four)
- Add to the project a selection of titles that Harvard has
acquired in electronic form only.
- Discuss project participation with respective
publishers.
8. Explore internal institutional roles (Quarters two, three,
and four)
- Conduct exploratory discussions with relevant parties within
Harvard to determine how responsibility, such as selection of
content, format migration, system maintenance, and assessment of
the state of the archival collection, will be distributed.
9. Exploratory development (Quarters three and four)
- Test the validity and explore possible inadequacy of
functional requirements as drafted.
- Select, with agreement from a relevant publisher, a very
small sample for the test.
- Develop a system prototype for exploration.
- Analyze the result to refine the proposed functional
requirements and system design.
10. Define the archive policy (Quarter four)
- Consolidate the outcome from discussions with publishers,
NERL, Harvard institutions, and the project team, to formulate
appropriate policy for the E-journal archive.
- Identify all players for implementation and clarify their
roles.
11. Estimate the cost of an archiving project (Quarter
four)
- Chart all parts of the E-journal archive implementation with
proposed timeline.
- Estimate software and hardware needs, including acquisition
and maintenance costs and scale.
- Estimate staffing needs.
- Estimate the overall budget for program implementation.
12. Prepare the program implementation plan (Quarter
four)
- Report the findings of the planning year.
- Propose the implementation plan for an E-journal archive at
Harvard, detailing its mission, guiding policies, scope,
participating parties, technical infrastructure, budget, and
estimated outcome.
return to top >>
|