DLF PARTNERS

DLF ALLIES

Comments

Please send the DLF Director your comments or suggestions.

UNIVERSITY OF PENNSYLVANIA

PROPOSAL FOR A PLANNING GRANT FOR ARCHIVING AND PRESERVATION OF ELECTRONIC JOURNALS

The problem

Like many research libraries, the University of Pennsylvania Library subscribes to an increasing number of journals in electronic form. The digital form of journals offers some significant benefits compared to the print form, including enhanced accessibility, search facilities, linking services, and the ability to include and extract new kinds of content (such as data sets) in scholarly publications. However, subscribing to publications in electronic form also carries substantial risks. When we obtain print journals, we can be reasonably sure that we will continue to have full access to them in the years to come, at little cost to us, because the technology of print preservation is well understood and because access and preservation of our print journals is not constrained by legal restrictions or the continuing fortunes or policy of a publisher. On the other hand, technology for the preservation of digital information is unproven for long-term preservation. Furthermore, the legal and economic environment of electronic journal subscriptions is also in flux, making our ability to maintain control over our copies of electronic journals uncertain. These risks have made libraries hesitant to take full advantage of the benefits electronic journals can provide. They also threaten the long-term viability of the record of scholarship.

Reliable, persistent electronic journal archives solve many of these problems. Such archives benefit the libraries that subscribe to electronic journals, by ensuring that they will continue to have access to journal content, and be able to use it effectively for scholarly activity, over the long term. Such archives also benefit the publishers of electronic journals: They make it more attractive for libraries to buy subscriptions, since they know they will have long-term access to current and past issues of the journals. Archives also benefit the authors of journal content, by ensuring that their scholarship remains in public view.

Since the Penn Library started acquiring and creating digital content, we have recognized the need to preserve it for decades, or even centuries. All of our digital collections need to have preservation support: not just our electronic journals, but also our digitized images, our on-line books, our catalogs and finding aids, our databases, and our multimedia content. Journal literature, as the primary record of progress in scholarly research, is especially important to preserve over the long term. Moreover, we expect to use techniques and tools for archiving and preserving electronic journals to archive and preserve other types of digital information as well.

The proposal

Therefore, the Penn Library proposes to establish a long-term digital archive for electronic journals, as part of the Mellon Electronic Journal Archiving Program. We intend to make arrangements with selected publishers of electronic journals to archive their publications. We intend to set up a system that can ensure their long-term accessibility. We intend to study how such systems can be set up to effectively archive electronic journals at low cost. We intend to share our findings, and (where permitted by applicable licenses) our archival systems and content, with the broader library community.

Paul Mosher, the Director of Penn Libraries and Vice-Provost of the University of Pennsylvania, is very interested in moving this project forward. He will be visiting with publishers later this month to start partnerships with them in creating such an archive.

We propose to start with a one-year phase to design and plan the archive, to start archiving selected journals on an experimental basis, and to study the costs and benefits, and optimal strategies, for maintaining such an archive. We expect that this phase would produce several important contributions:

Agreements with publishers to provide electronic journal content for perpetual use and archiving, and models for agreements on the rights and responsibilities of electronic journal archives.
A design, and the beginnings of a first implementation, for sustainable, distributed archives for electronic journal content.
A framework (both technical and procedural) for working with peer institutions to share information and responsibilities concerning archived electronic journals, including a consensus on minimum criteria for trustworthy electronic archives.
An experience report and evaluation on best practices for starting and maintaining journal archives.

Advantages of Penn as a pilot site

Penn's digital library program has several features that make it especially well-equipped to act as a pilot site for this program:

We have in place, or are working on developing, many of the pieces of the digital architecture that would be needed to support an electronic journal archive. In particular:
1. We have acquired a terabyte-scale networked disk storage unit for on-line access to our digital holdings, which should provide reliable access and backup. While this unit is largely already allocated to other projects, we can expand the existing structure for archiving electronic journal issues.
2. We have implemented, and now maintain, a database containing information about the electronic journals to which we subscribe. We are now developing tools for librarians to input new information into this database, and for patrons to browse and search the database to gain access to full-text journal content. The database, and accompanying tools, could be extended to manage metadata about journals we archive, and provide access to them.
3. We have installed a Handle server, and are implementing tools for maintaining Handle databases, that should provide persistent identification and location of digital content, including electronic journal content, that is not dependent on the location of the content or the technology used to manage or serve it.
We are participating in collaborative programs organized by RLG and the DLF to provide metadata about our digital collections to peer institutions using standard formats and protocols. We hope to use similar mechanisms to give libraries information about our electronic journal holdings, and share content where licenses permit it.
We have been working with Oxford University Press, a major publisher of scholarly books and journals, for over a year to make their current book releases in history available to our local community. We have gained experience in working with publishers to receive their digital content, convert it to a form suitable for on-line use, maintain it, and make it available under terms mutually beneficial to the publisher and our scholarly community. Our OUP book project has included tools to browse and search the content of books. It aims to study how electronic books can be used most effectively in a university environment. We hope to reuse infrastructure, findings, and experience from this project in our journal-archiving efforts.
We have a large body of professionals who are experienced and knowledgeable in applying digital library technology to meet the needs of library users. All major branches of the library, from public services to cataloging to special collections, have integrated digital technology into their operations for several years. In addition, we have specialized staff, with library, computer science, and information technology experience, whose primary job is to research and develop digital library technology. They work with the rest of the library staff to deploy it in effective ways.

In the past year, our digital library group has gained substantial experience in migrating digital data and metadata into new forms suitable for long-term use. Projects we have worked on have included converting digital image data and metadata from low-resolution and proprietary formats to high-resolution, standard formats; conversion of static web pages and scripts into database forms that can be presented and browsed in a variety of forms; and conversion of a Yiddish database built on 1980s-era dBase programs and private character encodings into a database searchable and browsable on the Web that uses standard Unicode representations for non-Roman characters.

We can use, and make available to other institutions, a distributed software architecture for managing data formats, and for supporting operations that extract information from these formats and migrate data from old to new data formats while controlling information loss. This architecture, known as the Typed Object Model (TOM), has been developed by a computer scientist now on our library staff, has been used for several years as the basis for a web-based conversion service, and has open-source software implementations available through the Penn library. TOM should be an important tool in keeping electronic journals usable even if the data formats in which they were originally published become obsolete.

Planned Activities

Here is what we plan to do during the planning phase of the project:

1. Select a set of electronic journals to start archiving, and make arrangements with their publishers to archive them.

We intend to concentrate on academic publishers, and to archive as many of each publisher's journals as feasible. For the initial phase of the project, we hope to set up archiving for at least 80 journals from at least two publishers. Although we do not yet have confirmed journal archiving partners, we plan to initially approach Oxford University Press (with whom we already have an arrangement to store and provide on-line books), and Cambridge University Press, with whom we also have working ties. Our library director, Paul Mosher, will be visiting both publishers in late October. We already subscribe to about 120 electronic journals provided by Oxford and Cambridge, which would be a sufficient base for the initial phase of this project. If our needs and scale warrant, we would also approach other publishers.

For the planning phase, we would start by archiving issues that are already in electronic format, and issues that appear in the future. However, we would design the archive so that newly digitized past issues could also be included. (Retrospective conversion may play more of a role in the second phase of the program, but some initial experimentation, to assess how the archive could accommodate different production workflows and formats, may also be conducted during the planning phase.)

Milestones:

By 3 months from the start of the grant period, we expect to have made initial arrangements with at least two publishers to work with us in setting up an archive for their journals.
By 6 months, we plan to start collecting journal content from these publishers, either harvested from the Web, or sent to us by special arrangement with the publishers.

2. Negotiate agreements (licenses) of archival rights and responsibilities with our publisher partners.

When we started to make Oxford University Press books available to our users, a simple "gentleman's agreement" was sufficient for the first phase of our collaboration. In the longer term, however, archives must negotiate more explicit licensing agreements that clearly spell out the rights and responsibilities of the publishers and archivers of electronic journals, and guarantee them into the future. Otherwise, an archive may find itself unable to keep the promises that it has made to preserve archived issues and provide them to the scholarly community.

We will need to negotiate licenses with the publishers sufficient to enable us to maintain the utility of the archived issues and provide them to the scholarly community. In some respects, the terms of these licenses can simply guarantee the same legal rights and abilities that libraries already have for archiving their print journals. In particular, we would seek

the right to store the electronic copies, and provide access to our campus users, in perpetuity.
othe right to provide access to other institutions and mirror sites (based on their subscriptions, parallel or consortial arrangements with publishers, and/or the passage of time since the original publication of the journal).
the right to create derivative works based on the originally licensed material, for the purpose of maintaining useful, high-quality access to the journal content as technology changes.
the right to create and supply metadata on the journal content to the public at large.
the right to transfer the electronic content, rights, and responsibilities, to another archiving institution.

At the same time, we would prepare and publish statements of the archival responsibilities assumed for each journal or set of journals. Because the exact form of an electronic journal may need to change as technology changes, these statements would need to make clear exactly what functions of the journal would be preserved. For example, the commitments made for one journal might include preserving the text, charts, and other illustrations of the editorial matter of the journal, and its table of contents, but not include preserving the exact pagination and layout, or advertising matter. (A journal of more historical interest than direct scholarly interest, though, might have its page images preserved, in contrast.) Other commitments may relate to the metadata preserved for journals, the policy for corrections and errata, supplementary data sets, or value-added services like full-text indexing or reference linking. Because no institution is guaranteed to go on forever, we would also need to account for the possibility of transferring responsibility for these commitments to other parties, both in our licensing agreements and in our statements of responsibility.

We intend to seek guidance from our university counsel, and possibly from our law school and library, in crafting such agreements. Even more importantly, we intend to collaborate with other partner libraries in the journal archiving project to produce a set of model licenses and statements of responsibility. We believe the most effective journal archiving system will involve many different publishers, with archival responsibilities shared by multiple institutions. Standard licenses and statements of responsibilities, developed through the collaboration of several active archives and publishers, can greatly smooth the operation of a distributed archival system.

Milestones:

By 3 months from the start of the grant period, we will have locally prepared an initial draft of rights and responsibilities we expect our archive to have.
By 6 months, we expect to have made initial legal agreements with the publishers we started working with, of at least limited duration. At this point, we would provide the details of these agreements to partners in the archiving project, for discussion and consultation.
By 12 months, we expect to have "permanent" agreements with these publishers (that is, with archival rights for materials collected to date granted in perpetuity).

3. Determine the necessary metadata and workflow needed to receive, validate, archive, and provide access to electronic journals. Make necessary arrangements with publishers, and within the library, to support this workflow and maintain this metadata. Decide on standard initial formats and protocols to use to communicate our electronic journal data and metadata.

Our agreements with journal publishers would include specifications of the content formats and metadata we expect them to provide. We would also create our own metadata for descriptive and administrative purposes, and for supporting value-added services. We would plan to share non- administrative metadata with the public at large, so that others can see what we hold, what we're committed to providing, and how we keep track of our holdings.

Milestones:

By 4 months into the grant period, we intend to have detailed descriptions of the metadata we intend to use in the project, and the initial workflow we plan to follow.
By 7 months, we plan to have provided the details of the metadata and workflow actually used to partner institutions.
By 12 months, we plan to have at least some of this metadata publicly viewable.

4. Design, and begin installing, an information base (consisting of databases and filesystems) for storing and accessing archived journal content and related metadata.

This would include installing tools for checking the integrity of this data, and for backing it up (both within the organization, and with mirror sites). We would also install (and implement, where needed) required client and server software for communicating e-journal data and metadata, under appropriate access control.

Unlike the archival commitments specified earlier, the exact choices of formats, protocols, and software used in the information base can change over time as needed. However, it is still important to make initial choices that will support our archival commitments, that will be compatible with the practices of our content suppliers, archiving partners, and readers, and that can be transferred or translated for new technology when needed.

Milestones:

By 6 months into the grant period, we expect to have an initial information base set up to store journal content and metadata.
By 12 months, we intend to have made initial agreements with institutional partners to mirror material.

5. Plan, and begin to install, documentation and support for the data formats and protocols used by the archive.

Support should include, where feasible, translation of data to alternative formats, and/or services to provide access to information managed using formats and protocols not directly supported by clients. (That is to say, we should set up migration and emulation services for our data.)

Even if there is no immediate need for migration and emulation services, due to the exclusive use of widely-used formats and protocols, such services will eventually be necessary as technology changes. By setting up support for migration and emulation from the start, we hope to reduce costs incurred later on, by building on the support and knowledge base already built for our old systems. Furthermore, the startup costs for this form of data format support give us some empirical basis for estimating, and amortizing, the costs of keeping up with technology change over the long term.

Milestones:

By 9 months, we will have available documentation (written by ourselves or by third parties) of the standard formats and protocols used in our information base.
By 11 months, the standard data formats used in our archive will be registered with our type brokers (see http://tom.library.upenn.edu/ ) with converters or data extraction services available.

6. Start acquiring and archiving journals, indexing them, providing metadata, and providing content to authorized parties, on an experimental basis.

From the start, measure computing resources and labor required to enter and maintain the journals. Measure their usage as well. Using this data, evaluate the costs and benefits of running the archive, and consider possibilities for ongoing funding sources (such as internal funding, shared costs in a consortium, and/or revenue from subscriptions). Our experience and analysis would also permit us to make well-informed plans for ongoing staffing, acquisitions, and access policies for the archive.

Milestones:

By 6 months, we will have decided on measurement and assessment practices for our initial phase. Measurement will begin on incoming content. Also at this point, initial proposals for sustaining the archive will be made, and publicized to partners.
By 9 months, local users will be able to examine content in our information base under appropriate access control and usage logging.

7. Produce a report, at the end of the year, describing our experience and findings from the planning year. Assuming our findings are reasonably positive, produce a plan for continuing, and growing, the archive on a permanent basis.

We would work with fellow archiving and journal-using sites to reach agreement on suitable protocols, practices, and architecture for long- term, distributed journal archives. We would also distribute any software and metadata that we develop for managing the electronic journal archives. (For example, this could include TOM servers and associated access and translation scripts for data formats used in the archive.)

Milestones:

By 12 months, an experience report will be published, and a plan for continuing the archive will be submitted to Mellon. Software developed by us, and not subject to licensing constraints, would also be made available when in a reasonably stable state.

Schedule

Here is a summary of the milestones described above, arranged chronologically. (Note that these are dates of expected completions. Work on these items will start considerably earlier. In particular, work towards the milestones of the first 6 months would start as soon as possible.)

3 months:	Initial archiving arrangements made with publishers. Draft licenses/statements of rights and responsibilities written.
4 months:	Detailed descriptions of metadata and workflow created.
6 months:	Initial journal archive information base created. Initial licenses/statements of rights and responsibilities agreed to and distributed for comment. Collection of journal content and metadata begins. Initial plans for sustainability proposed.
7 months:	Details of metadata and workflow provided to partner institutions.
9 months:	Journal archive content available to local users Standard formats and protocols publicly documented.
11 months:	Broker services available for standard archive data formats.
12 months:	Initial mirroring arrangements made. "Permanent" licenses/statements of rights and responsibilities agreed to. Journal metadata publicly viewable. Experience report and plan for continuation. Software depot set up for locally developed tools.

Staffing

The following staff members will be working on this project:

John Mark Ockerbloom (project lead) is digital library architect and planner at the University of Pennsylvania library. He joined the Penn Library in 1999, coming from Carnegie Mellon University, where he earned a Ph.D. in computer science in 1998. He has been involved in digital library- related projects since 1993. His thesis work, for instance, included the design and implementation of a distributed architecture for managing diverse data formats, which has applications in digital library interoperability and preservation. He was a consultant for the Universal Library project at Carnegie Mellon, and has presented his work to the Coalition for Networked Information, the ACM Digital Libraries Conference, and the Mid-Atlantic Regional Archives Conference. At Penn, his achievements in migrating old technology to new uses have included the successful adaptation of a multilingual Yiddish song database from 1980s dBase technology and private character sets to a Web-accessible searchable form using Unicode and standard HTML. He also edits The On-Line Books Page, the Internet's longest-running site indexing and supporting freely available on-line books.

Delphine Khanna is a digital projects librarian at Penn. She joined the Penn Library in 1999. She came to Penn from Rutgers, where she worked on several digital-library projects of the Center for Electronic Text in the Humanities. At Penn, she has led several digital library projects, including an ongoing, project to archive and provide access to hundreds of thousands of digital images from our Fine Arts library. She has experience and expertise implementing projects with many formats and programs used in digital libraries, including SGML and XML, EAD, TEI, Cold Fusion, Verity, and DLXS. She holds a Masters of Library Science from Syracuse University, and a Masters of Linguistics and Computer Science from the University of Paris, France. Michael Winkler is the Web manager at the Penn Library. He joined the Penn library in 1999, after working in a similar role at North Carolina State University. At Penn, he has defined both data structures and workflows for changing our electronic journals, databases, and new materials information from ad-hoc, hand-maintained web pages to structured, low-overhead databases which are being integrated with our Franklin library catalog, and delivered to the public through new, powerful search tools. (Some of these tools can now be found at http://www.library.upenn.edu/prototypes/public.html) Roy Heinz is the director of Library Computing at Penn Library, which supports all aspects of digital library use and development at Penn. He is also the project lead for our Oxford University Press on-line books project, funded by the Mellon Foundation. Grover McKenzie is senior systems programmer and Unix system administrator at the Penn Library. He has worked for Library Computing for the past nine years, and has installed, maintained, and upgraded our Unix- based servers (including those for our Oracle database, and our new terabyte- scale networked storage unit), and our networking and backup infrastructure.

We plan to expand our staff, using funds from various sources (including this grant), to include at least one more person providing programming and database support for our digital library projects. This person, once hired, may (among other duties) build the tools and database we would use for this project. Before then, John Ockerbloom, Delphine Khanna, and Michael Winkler would be developing or installing any necessary tools for this phase of the project. John Ockerbloom would be in charge of planning, designing, and reporting on the project, and maintain liasons with our external partners. Grover McKenzie would provide support for the basic network and storage infrastructure. Roy Heinz, along with other high-level library management staff, would oversee the progress of the project, and facilitate relations with our external partners. the project, and helping to build a robust infrastructure and community for electronic journal archiving.

return to top >>

Last updated: