random quote Link: Publications Forum Link: About DLF Link: News
Link: Digital Collections Link: Digital Production Link: Digital Preservation Link: Use, users, and user support Link: Build: Digital Library Architectures, Systems, and Tools
photo of books

DLF PARTNERS

""

DLF ALLIES

""

Comments

Please send the DLF Director your comments or suggestions.

PROPOSAL FOR A STUDY OF ELECTRONIC JOURNAL ARCHIVING

Submitted to the Andrew W. Mellon Foundation
October 13, 2000

The Harvard University Library is requesting funding from the Andrew W. Mellon Foundation to create a plan for the archiving of electronic journals, in response to a letter of invitation received from the Foundation in August 2000.

BACKGROUND

The Harvard University Library, from its founding in 1638, has sustained a commitment to preserve its vast research collections for current and future use by scholars worldwide. In the twentieth century, traditional book repair and rebinding activities were complemented by a preservation microfilming program that is more vigorous today than ever before and continues to strengthen. Programs for collections conservation, special collections conservation, digitization, studio photography, and sound re-recording are equally vigorous; as are preservation field services, disaster preparedness activities, and efforts to upgrade environmental controls in library buildings throughout the institution.

Two years ago the Harvard libraries received a $12 million grant from the University to build the infrastructure needed to support the creation, storage, and delivery storage of digital library collections. Since then a team of library and computer specialists has been assembled, and the development of many basic architectural components for Harvard's digital library is now well underway. We believe that this infrastructure will provide the basis for a robust digital preservation program. Although we already have enough experience to be keenly aware that establishment of a preservation program for digital library resources will be a complex and expensive undertaking, we anticipate that building such a program on top of the library's general digital collections infrastructure will help to control development and operating costs. The components of our infrastructure that we believe are applicable to a digital archiving project include:

  • A repository system to support reliable, sustained storage and management of very large numbers of digital objects;
  • A "naming" system to support the identification and location of digital objects in a persistent and system-independent way;
  • A generalized model of access management to protect intellectual property;
  • Expertise in the design of metadata schemes, the information necessary to manage, display, preserve, and re-purpose digital objects;
  • Scalable operations support to ensure the professional operation of an ongoing production service for the ingestion and storage of a significant number of digital objects

The library has funding to carry development of this infrastructure forward for at least 3 more years.

Harvard's level of acquisition of published digital resources is growing rapidly, and the University now lists more than 2,000 journals and databases in its electronic gateway. Shared expenses for electronic resources across Harvard now exceed one million dollars, and combined expenditures for the purchase of digital resources by individual Harvard libraries is at least an equal to this. Strategically, the growth of digital acquisitions at Harvard has been directed toward collection priorities and the cultivation of best practices rather than by market trends. Most journals that we subscribe to electronically are individually selected by subject bibliographers on the basis of merit rather than being acquired through publisher-driven "all-or-nothing" packages. Licensing guidelines have been developed to advise the libraries and prospective vendors in the acquisition and negotiation process, with an emphasis on fair use protections, archival rights, and vendor accountability.

Currently, most of the electronic journals we offer are also acquired in paper format. In large part this is because both librarians and the faculty they serve are concerned about preserving archival access for the long term, for which currently only the paper copy will suffice. This duplication involves duplicate costs and efforts, and is not sustainable over time. Addressing the archiving of electronic journals is thus strategically important to Harvard.

PLANNING TOPICS

Digital preservation is a new and very immature field. An E-journal archiving program will involve the exploration of many complex and ill-defined issues. During the planning process we will analyze a number of key areas, with the intent of developing sufficient understanding to allow us to prepare a concrete plan for a 4-year archiving project. Among the key areas we will address in the planning process are:

Target content

We face at least 2 key questions regarding content. The first is, "What specific journals will we archive?" We envision a three-part selection strategy.

  • We will seek at least one arrangement with a publisher providing a significant volume of articles (the "anchor tenant") to test the scaling of our archive.
  • We will seek agreement with at least one scholarly society with an active publishing program. Working with these two sources, which we believe will present quite different perspectives on the project, we will develop a model for the archive-publisher relationship.
  • We will analyze the list of E-journals for which we acquire only digital copies, and select a number of titles for archiving. These titles will likely involve quite different challenges from those with the "anchor tenant," reflecting different business models, technical expertise, and institutional roles. We are also very likely to try to extend our archiving efforts to include non-journal electronic serials, where we are acquiring an increasing number of titles with equally pressing preservation issues.

The second content-related question has to do with which of the various components of the selected journals we will encompass in the archive. Journal articles are reasonably straightforward as objects. They have obvious granularity, well-developed conventions and systems for bibliographic description, and established identification schemes (SICI, DOI). The other parts of journals are less obvious. Does one include such textual elements as letters to the editor or book reviews? Advertisements? Does one need to replicate the "issue" as originally published, including such elements as cover and table of contents page? Are links in articles included, and what responsibility for these does the archive assume? E-journals are beginning to include sophisticated related materials (models, data sets, simulations, programs, etc.). Are these to be included? And preserved? A key task in our planning year will be to analyze such additional content, create an inventory of types, draft a policy addressing how such materials should be handled, and discuss with our peers this proposed policy (see "Discussion of NERL role" below).

Discussions with publishers

A critical element in any attempt to archive commercial intellectual property in electronic form will be negotiating the respective rights and responsibilities of publishers and archiving institutions. To what degree will publishers bear the cost of preparing archival formats of published materials? Or of creating the metadata required for archiving? Who will be able to access archived materials, and under what circumstances? Will publishers support part of the cost of archiving?

We will use the planning year to begin discussions with selected publishers about these issues. Given the uncertainty of funding for archival operations, we cannot begin negotiating archiving licenses, but we hope to raise both our own level of sophistication and that of possible project publishing partners, and to begin to build a consensus regarding what "normal" practice will be in this area.

Technical requirements

While it is popular to contend there are more policy than technical issues involved in archiving, Harvard believes that there are indeed a great number of technical issues and developments involved in a sophisticated archiving program. Among the areas for investigation are:

  • Accession automation. In order to control costs and make a large-scale operation possible, as much as possible of the process of adding materials to the archive will need to be handled by computers rather than people. Areas such as ingestion, quality control, format transformations, and metadata ingestion and creation lend themselves to automation.
  • Format investigation. The cost and complexity of archiving will depend to a significant degree on the technical formats of the objects archived. Investigation of the formats available from publishers, as well as the formats preferable for archiving, will be a key planning issue. It may be necessary to develop a policy regarding what formats will be accepted into the archive.
  • On-going validation or auditing. Archival collections will require periodic evaluation to ensure against the loss of content or viability. Some form of systems support for such auditing will be required. Auditing may also require third-party access facilities (discussed below).
  • Bibliographic control. It is not clear what level of bibliographic control over the archival collection will be needed. It is likely that some new bibliographic facility will be required for at least some objects (such as journal issues, as opposed to articles).
  • Naming. Analysis is needed to determine what archival objects will require formal naming, where such names should be recorded (particularly outside of the archiving institution), and what the relationship is between archived objects and existing formal naming systems such as the Digital Object Identifier.
  • Access management. While Harvard has established a generalized access management facility for digital content, it is primarily oriented towards internal users. Depending on the nature of archiving licenses, extensions will probably be needed to provide for appropriate external access. (Such access will presumably be offered at the institutional rather than individual level, based on the Mellon guideline specifying that the primary responsibility of an archive should be to institutional subscribers).
  • Storage strategy. As envisioned, Harvard's archiving project is likely to involve a considerable amount of content. During the planning year we will create estimates of the number of objects involved and their size, and devise an appropriate strategy to minimize the combined cost of storage and operations.
  • Output facilities. A key issue will be who has access to archived materials, and what the service expectations of such users will be. Will we provide Harvard users with day-to-day access to materials, perhaps as part of ongoing validation? If so, a reasonably sophisticated user interface to objects will be required. Will selected outside users have access in order to validate the functioning of the archive? If so, what delivery format will be required? The Mellon guidelines require that archives deliver objects to other institutional subscribers when the "fail safe" is exercised. Will there be standard formats in which such objects will be delivered?

These issues will be investigated during the planning phase of the proposed project to determine the nature and scale of developments required, and to create a plan and timeline. Prototype and exploratory developments will be pursued in the planning year as appropriate.

Internal institutional roles

Traditional preservation activities are a shared responsibility between curatorial departments and preservation technologists. At Harvard, curatorial responsibility is heavily decentralized along topical subject and language lines. It seems unlikely that the scope of an archiving program will be defined in terms of curatorial responsibility, particularly since such a program will likely involve inter-institutional arrangements. Yet archiving necessarily involves fundamental curatorial issues, such as deciding when to migrate materials and to what format, and periodically assessing the state of the archival collection. The planning process will address how to allocate responsibility across current (or new) institutional structures.

External relationships

No institution will archive solely for its own use, nor rely solely on its own archiving program. Therefore no institution can decide on the nature of its archiving operation in isolation. Harvard proposes to work with its peers in the NorthEast Research Libraries consortium (NERL) on several key aspects of its program:

  • requirements for archiving licenses, and the negotiation of such licenses,
  • decisions about what specific components of journals should be included in an archive,
  • the standards and "good practices" an archive must follow to be relied upon by other institutions subscribing to the archived content,
  • the appropriate level of inter-institutional auditing, and
  • the output formats that must be immediately available from an archive in order to respond to the access needs of other institutions.

Costs and economic issues

Harvard believes that the most important single economic issue in archiving electronic journals will be managing costs. The level of archiving eventually undertaken, and the distribution of responsibilities and costs among the players will be very sensitive to the total cost of the enterprise. It is critical that the archiving process be designed to minimize marginal costs. We suspect that three factors will have the greatest impact on costs:

  • effective automation of the archiving process to reduce labor and complexity;
  • sophisticated selection of archiving hardware and software; and
  • the efficiencies inherent in large-scale archiving operations.

All three factors will be a constant focus of our efforts in both the planning and implementation phases.

A proposed model for archive cost distribution will be addressed during the planning phase. For paper-based publications, archiving costs were borne in a largely unplanned pattern, with the great majority of libraries and scholars relying on the preservation activities of a relatively small number of institutions. In the archiving institutions, the costs of preservation (the binding of journals, for example) were also largely bundled into collections budgets, as much of the required activity was also related to daily services to users. With digital publications, archiving and day-to-day access can be (and are likely to frequently be) separately calculated and budgeted. This will require that the community more explicitly address the issue of cost distribution.

There are 3 immediately obvious sources of support for archiving: publishers, the digital archiving institutions, and other subscribing institutions (particularly those that would previously have archived paper copies of a title). In addition, there may be new sources of support, including governments (some of which have been active in the support of microform-based preservation), foundations, scholarly societies, and cooperatives of various sorts (library, scholarly, higher education). It seems likely that the pattern for support will not be the same for all types of publications. The funding structure for archiving the publications of a scholarly society, for example, may be different from that for publications issued by a for-profit publisher or of an academic department.

During the planning phase, we will undertake several initiatives related to long-term funding:

  • analysis of different types of publications for which different sources of funding might be appropriate;
  • exploration with publishers of the distribution of costs, particularly those related to the preparation of objects and metadata for archiving; and
  • discussions within NERL about the distribution of costs, perhaps making this an explicit part of the negotiations at the point of acquisition. (NERL's primary role today is the joint licensing of commercial electronic resources.)

A key product of the planning phase will be a proposed budget for the archiving project, addressing staffing (development, operations, and management), hardware and software acquisitions and maintenance, third-party service vendor costs, and overhead.

PROJECT STAFFING

The planning project will be carried out primarily by a Project Manager with both library and technology experience, working with various teams within Harvard and without. The Project Manager will be Y. Kathy Kwan, a librarian with significant systems experience who recently joined Harvard after serving as an Associate Fellow at the National Library of Medicine. Kwan will be dedicated full time to the planning project. Within the Harvard libraries we will form two primary teams:

Project Steering Committee. This group will be composed of senior curators, preservation experts, and library systems staff, and will address functional and organizational issues. Proposed members of the Committee are:

  • Marianne Burke (Assistant Director for Resource Management, Countway library of Medicine)
  • Dale Flecker (Associate Director for Planning and Systems, Harvard University Library)
  • Diane Garner (Librarian for the Social Science, Harvard College Library)
  • Jeffrey Horrell (Associate Librarian of Harvard College for Collections)
  • Y. Kathy Kwan (Project Manager)
  • Jan Merrill-Oldham (Malloy-Rabinowitz Preservation Librarian)
  • Constance Rinaldo (Librarian, Ernst Mayr Library of the Museum of Comparative Zoology)
  • Lynne Schmelz (Librarian for the Sciences, Harvard College Library)
  • MacKenzie Smith (Digital Library Projects Manager)

Technical Team. An internal team composed of staff with significant experience in digital library development will investigate technical issues and systems requirements. Proposed members of the Team include:

  • Stephen Chapman (Preservation Librarian for Digital Projects)
  • Dale Flecker (Associate Director for Planning and Systems, Harvard University Library)
  • Y. Kathy Kwan (Project Manager)
  • MacKenzie Smith (Digital Library Projects Manager)
  • Robin Wendler (Metadata Analyst)

In addition, developers from the current Library Digital Initiative technical staff with appropriate expertise (e.g., repositories, access management) will participate on the Technical Team as appropriate.

A fundamental feature of the LDI development process is the regular open review of all technical analyses and designs by a group technical experts drawn from across the university. Representatives from academic computing groups, libraries, and museums, and the senior staff of the chief information officer participate in these reviews. The same methodology will be used in reviewing all technical work in the archiving project.

PLAN OF WORK

Policy and functional issues will be addressed by the Steering Committee, and technical issues will be handled by the Technical Team. Project activities will be coordinated and documented by the Project Manager. The programmer will be involved in preparing technical specifications and in exploratory development.

1. Review literature and current practices (Quarter one)

  • Review current efforts in digital electronic archiving, specifically E-journal archiving.

2. Secure core publishers (Quarters one and two)

  • Approach a potential "anchor tenant" and one scholarly society publisher to explore interest in project participation.
  • Discuss in detail with the committed publishers, the business and technical issues associated with archiving of E-journals, including title selection, right of access, archival format, metadata creation, and cost distribution.

3. Analyze targeted journals (Quarters one and two)

  • Examine content of the targeted journals and inventory their types and format.
  • Decide which components will be archived.

4. Foster partnership with NERL (Quarters one and three)

  • Meet with NERL members to establish common goals and requirements for an E-journal archive, including content analysis of targeted journals, archival format, good/established practices in handling different data types, trusted access, licensing issues, cost distribution, and ongoing audit.
  • Formulate policy for the archive based on the outcome of above discussion.

5. Define functional requirements (Quarters one, two and three)

  • Develop requirements for the archive in the areas of accession, archival format, validation/audit, bibliographic control, metadata, naming, access management, storage, and output facilities.

6. Design high-level system architecture (Quarters two, three, and four)

  • Design high-level system architecture based on functional requirements.
  • Propose specifications for software and hardware.

7. Select additional titles (Quarters two, three, and four)

  • Add to the project a selection of titles that Harvard has acquired in electronic form only.
  • Discuss project participation with respective publishers.

8. Explore internal institutional roles (Quarters two, three, and four)

  • Conduct exploratory discussions with relevant parties within Harvard to determine how responsibility, such as selection of content, format migration, system maintenance, and assessment of the state of the archival collection, will be distributed.

9. Exploratory development (Quarters three and four)

  • Test the validity and explore possible inadequacy of functional requirements as drafted.
  • Select, with agreement from a relevant publisher, a very small sample for the test.
  • Develop a system prototype for exploration.
  • Analyze the result to refine the proposed functional requirements and system design.

10. Define the archive policy (Quarter four)

  • Consolidate the outcome from discussions with publishers, NERL, Harvard institutions, and the project team, to formulate appropriate policy for the E-journal archive.
  • Identify all players for implementation and clarify their roles.

11. Estimate the cost of an archiving project (Quarter four)

  • Chart all parts of the E-journal archive implementation with proposed timeline.
  • Estimate software and hardware needs, including acquisition and maintenance costs and scale.
  • Estimate staffing needs.
  • Estimate the overall budget for program implementation.

12. Prepare the program implementation plan (Quarter four)

  • Report the findings of the planning year.
  • Propose the implementation plan for an E-journal archive at Harvard, detailing its mission, guiding policies, scope, participating parties, technical infrastructure, budget, and estimated outcome.

return to top >>