UNIVERSITY OF PENNSYLVANIA
PROPOSAL FOR A PLANNING GRANT FOR ARCHIVING AND PRESERVATION OF ELECTRONIC
JOURNALS
The problem
Like many research libraries, the
University of Pennsylvania Library subscribes to an increasing number of
journals in electronic form. The digital form of journals offers some
significant benefits compared to the print form, including enhanced
accessibility, search facilities, linking services, and the ability to include
and extract new kinds of content (such as data sets) in scholarly publications.
However, subscribing to publications in electronic form also carries substantial
risks. When we obtain print journals, we can be reasonably sure that we will
continue to have full access to them in the years to come, at little cost to us,
because the technology of print preservation is well understood and because
access and preservation of our print journals is not constrained by legal
restrictions or the continuing fortunes or policy of a publisher. On the other
hand, technology for the preservation of digital information is unproven for
long-term preservation. Furthermore, the legal and economic environment of
electronic journal subscriptions is also in flux, making our ability to maintain
control over our copies of electronic journals uncertain. These risks have made
libraries hesitant to take full advantage of the benefits electronic journals
can provide. They also threaten the long-term viability of the record of
scholarship.
Reliable, persistent electronic journal archives solve many of
these problems. Such archives benefit the libraries that subscribe to
electronic journals, by ensuring that they will continue to have access to
journal content, and be able to use it effectively for scholarly activity, over
the long term. Such archives also benefit the publishers of electronic
journals: They make it more attractive for libraries to buy subscriptions, since
they know they will have long-term access to current and past issues of the
journals. Archives also benefit the authors of journal content, by ensuring
that their scholarship remains in public view.
Since the Penn Library started
acquiring and creating digital content, we have recognized the need to preserve
it for decades, or even centuries. All of our digital collections need to have
preservation support: not just our electronic journals, but also our digitized
images, our on-line books, our catalogs and finding aids, our databases, and our
multimedia content. Journal literature, as the primary record of progress in
scholarly research, is especially important to preserve over the long term.
Moreover, we expect to use techniques and tools for archiving and preserving
electronic journals to archive and preserve other types of digital information
as well.
The proposal
Therefore, the Penn Library proposes to
establish a long-term digital archive for electronic journals, as part of the
Mellon Electronic Journal Archiving Program. We intend to make arrangements
with selected publishers of electronic journals to archive their publications.
We intend to set up a system that can ensure their long-term accessibility. We
intend to study how such systems can be set up to effectively archive electronic
journals at low cost. We intend to share our findings, and (where permitted by
applicable licenses) our archival systems and content, with the broader library
community.
Paul Mosher, the Director of Penn Libraries and Vice-Provost of
the University of Pennsylvania, is very interested in moving this project
forward. He will be visiting with publishers later this month to start
partnerships with them in creating such an archive.
We propose to start with
a one-year phase to design and plan the archive, to start archiving selected
journals on an experimental basis, and to study the costs and benefits, and
optimal strategies, for maintaining such an archive. We expect that this phase
would produce several important contributions:
- Agreements with
publishers to provide electronic journal content for perpetual use and
archiving, and models for agreements on the rights and responsibilities of
electronic journal archives.
- A design, and the beginnings of a first
implementation, for sustainable, distributed archives for electronic journal
content.
- A framework (both technical and procedural) for working with peer
institutions to share information and responsibilities concerning archived
electronic journals, including a consensus on minimum criteria for trustworthy
electronic archives.
- An experience report and evaluation on best practices
for starting and maintaining journal archives.
Advantages of Penn as a pilot site
Penn's digital library program has several features that make
it especially well-equipped to act as a pilot site for this program:
- We
have in place, or are working on developing, many of the pieces of the digital
architecture that would be needed to support an electronic journal archive. In
particular:
- We have acquired a terabyte-scale networked disk storage
unit for on-line access to our digital holdings, which should provide reliable
access and backup. While this unit is largely already allocated to other
projects, we can expand the existing structure for archiving electronic journal
issues.
- We have implemented, and now maintain, a database containing
information about the electronic journals to which we subscribe. We are now
developing tools for librarians to input new information into this database, and
for patrons to browse and search the database to gain access to full-text
journal content. The database, and accompanying tools, could be extended to
manage metadata about journals we archive, and provide access to them.
- We
have installed a Handle server, and are implementing tools for maintaining
Handle databases, that should provide persistent identification and location of
digital content, including electronic journal content, that is not dependent on
the location of the content or the technology used to manage or serve it.
- We are participating in collaborative programs organized by RLG and the DLF
to provide metadata about our digital collections to peer institutions using
standard formats and protocols. We hope to use similar mechanisms to give
libraries information about our electronic journal holdings, and share content
where licenses permit it.
- We have been working with Oxford University Press,
a major publisher of scholarly books and journals, for over a year to make their
current book releases in history available to our local community. We have
gained experience in working with publishers to receive their digital content,
convert it to a form suitable for on-line use, maintain it, and make it
available under terms mutually beneficial to the publisher and our scholarly
community. Our OUP book project has included tools to browse and search the
content of books. It aims to study how electronic books can be used most
effectively in a university environment. We hope to reuse infrastructure,
findings, and experience from this project in our journal-archiving efforts.
- We have a large body of professionals who are experienced and knowledgeable
in applying digital library technology to meet the needs of library users. All
major branches of the library, from public services to cataloging to special
collections, have integrated digital technology into their operations for
several years. In addition, we have specialized staff, with library, computer
science, and information technology experience, whose primary job is to research
and develop digital library technology. They work with the rest of the library
staff to deploy it in effective ways.
In the past year, our digital
library group has gained substantial experience in migrating digital data and
metadata into new forms suitable for long-term use. Projects we have worked on
have included converting digital image data and metadata from low-resolution and
proprietary formats to high-resolution, standard formats; conversion of static
web pages and scripts into database forms that can be presented and browsed in a
variety of forms; and conversion of a Yiddish database built on 1980s-era dBase
programs and private character encodings into a database searchable and
browsable on the Web that uses standard Unicode representations for non-Roman
characters.
- We can use, and make available to other institutions, a
distributed software architecture for managing data formats, and for supporting
operations that extract information from these formats and migrate data from old
to new data formats while controlling information loss. This architecture,
known as the Typed Object Model (TOM), has been developed by a computer
scientist now on our library staff, has been used for several years as the basis
for a web-based conversion service, and has open-source software implementations
available through the Penn library. TOM should be an important tool in keeping
electronic journals usable even if the data formats in which they were
originally published become obsolete.
Planned Activities
Here is what we plan to do during the planning phase of the project:
1.
Select a set of electronic journals to start archiving, and make arrangements
with their publishers to archive them.
We intend to concentrate on
academic publishers, and to archive as many of each publisher's journals as
feasible. For the initial phase of the project, we hope to set up archiving for
at least 80 journals from at least two publishers. Although we do not yet have
confirmed journal archiving partners, we plan to initially approach Oxford
University Press (with whom we already have an arrangement to store and provide
on-line books), and Cambridge University Press, with whom we also have working
ties. Our library director, Paul Mosher, will be visiting both publishers in
late October. We already subscribe to about 120 electronic journals provided by
Oxford and Cambridge, which would be a sufficient base for the initial phase of
this project. If our needs and scale warrant, we would also approach other
publishers.
For the planning phase, we would start by archiving issues
that are already in electronic format, and issues that appear in the future.
However, we would design the archive so that newly digitized past issues could
also be included. (Retrospective conversion may play more of a role in the
second phase of the program, but some initial experimentation, to assess how the
archive could accommodate different production workflows and formats, may also
be conducted during the planning phase.)
Milestones:
- By 3 months from the start of the grant period, we expect to have made
initial arrangements with at least two publishers to work with us in setting up
an archive for their journals.
- By 6 months, we plan to start collecting
journal content from these publishers, either harvested from the Web, or sent to
us by special arrangement with the publishers.
2. Negotiate
agreements (licenses) of archival rights and responsibilities with our publisher
partners.
When we started to make Oxford University Press books
available to our users, a simple "gentleman's agreement" was sufficient for the
first phase of our collaboration. In the longer term, however, archives must
negotiate more explicit licensing agreements that clearly spell out the rights
and responsibilities of the publishers and archivers of electronic journals, and
guarantee them into the future. Otherwise, an archive may find itself unable to
keep the promises that it has made to preserve archived issues and provide them
to the scholarly community.
We will need to negotiate licenses with the
publishers sufficient to enable us to maintain the utility of the archived
issues and provide them to the scholarly community. In some respects, the terms
of these licenses can simply guarantee the same legal rights and abilities that
libraries already have for archiving their print journals. In particular, we
would seek
- the right to store the electronic copies, and provide access
to our campus users, in perpetuity.
- othe right to provide access to other
institutions and mirror sites (based on their subscriptions, parallel or
consortial arrangements with publishers, and/or the passage of time since the
original publication of the journal).
- the right to create derivative works
based on the originally licensed material, for the purpose of maintaining
useful, high-quality access to the journal content as technology changes.
- the right to create and supply metadata on the journal content to the public
at large.
- the right to transfer the electronic content, rights, and
responsibilities, to another archiving institution.
At the same
time, we would prepare and publish statements of the archival responsibilities
assumed for each journal or set of journals. Because the exact form of an
electronic journal may need to change as technology changes, these statements
would need to make clear exactly what functions of the journal would be
preserved. For example, the commitments made for one journal might include
preserving the text, charts, and other illustrations of the editorial matter of
the journal, and its table of contents, but not include preserving the exact
pagination and layout, or advertising matter. (A journal of more historical
interest than direct scholarly interest, though, might have its page images
preserved, in contrast.) Other commitments may relate to the metadata preserved
for journals, the policy for corrections and errata, supplementary data sets, or
value-added services like full-text indexing or reference linking. Because no
institution is guaranteed to go on forever, we would also need to account for
the possibility of transferring responsibility for these commitments to other
parties, both in our licensing agreements and in our statements of
responsibility.
We intend to seek guidance from our university counsel,
and possibly from our law school and library, in crafting such agreements. Even
more importantly, we intend to collaborate with other partner libraries in the
journal archiving project to produce a set of model licenses and statements of
responsibility. We believe the most effective journal archiving system will
involve many different publishers, with archival responsibilities shared by
multiple institutions. Standard licenses and statements of responsibilities,
developed through the collaboration of several active archives and publishers,
can greatly smooth the operation of a distributed archival system.
Milestones:
- By 3 months from the start of the grant period,
we will have locally prepared an initial draft of rights and responsibilities we
expect our archive to have.
- By 6 months, we expect to have made initial
legal agreements with the publishers we started working with, of at least
limited duration. At this point, we would provide the details of these
agreements to partners in the archiving project, for discussion and
consultation.
- By 12 months, we expect to have "permanent" agreements with
these publishers (that is, with archival rights for materials collected to date
granted in perpetuity).
3. Determine the necessary metadata and
workflow needed to receive, validate, archive, and provide access to electronic
journals. Make necessary arrangements with publishers, and within the library,
to support this workflow and maintain this metadata. Decide on standard
initial formats and protocols to use to communicate our electronic journal data
and metadata.
Our agreements with journal publishers would include
specifications of the content formats and metadata we expect them to provide.
We would also create our own metadata for descriptive and administrative
purposes, and for supporting value-added services. We would plan to share non-
administrative metadata with the public at large, so that others can see what we
hold, what we're committed to providing, and how we keep track of our holdings.
Milestones:
- By 4 months into the grant period, we
intend to have detailed descriptions of the metadata we intend to use in the
project, and the initial workflow we plan to follow.
- By 7 months, we plan to
have provided the details of the metadata and workflow actually used to partner
institutions.
- By 12 months, we plan to have at least some of this metadata
publicly viewable.
4. Design, and begin installing, an
information base (consisting of databases and filesystems) for storing and
accessing archived journal content and related metadata.
This would
include installing tools for checking the integrity of this data, and for
backing it up (both within the organization, and with mirror sites). We would
also install (and implement, where needed) required client and server software
for communicating e-journal data and metadata, under appropriate access control.
Unlike the archival commitments specified earlier, the exact choices of
formats, protocols, and software used in the information base can change over
time as needed. However, it is still important to make initial choices that
will support our archival commitments, that will be compatible with the
practices of our content suppliers, archiving partners, and readers, and that
can be transferred or translated for new technology when needed.
Milestones:
- By 6 months into the grant period, we expect to
have an initial information base set up to store journal content and metadata.
- By 12 months, we intend to have made initial agreements with institutional
partners to mirror material.
5. Plan, and begin to install,
documentation and support for the data formats and protocols used by the
archive.
Support should include, where feasible, translation of
data to alternative formats, and/or services to provide access to information
managed using formats and protocols not directly supported by clients. (That
is to say, we should set up migration and emulation services for our data.)
Even if there is no immediate need for migration and emulation services, due
to the exclusive use of widely-used formats and protocols, such services will
eventually be necessary as technology changes. By setting up support for
migration and emulation from the start, we hope to reduce costs incurred later
on, by building on the support and knowledge base already built for our old
systems. Furthermore, the startup costs for this form of data format support
give us some empirical basis for estimating, and amortizing, the costs of
keeping up with technology change over the long term.
Milestones:
- By 9 months, we will have available
documentation (written by ourselves or by third parties) of the standard formats
and protocols used in our information base.
- By 11 months, the standard data
formats used in our archive will be registered with our type brokers (see
http://tom.library.upenn.edu/
) with
converters or data extraction services available.
6. Start
acquiring and archiving journals, indexing them, providing metadata, and
providing content to authorized parties, on an experimental basis.
From the start, measure computing resources and labor required to enter and
maintain the journals. Measure their usage as well. Using this data, evaluate
the costs and benefits of running the archive, and consider possibilities for
ongoing funding sources (such as internal funding, shared costs in a consortium,
and/or revenue from subscriptions). Our experience and analysis would also
permit us to make well-informed plans for ongoing staffing, acquisitions, and
access policies for the archive.
Milestones:
- By 6
months, we will have decided on measurement and assessment practices for our
initial phase. Measurement will begin on incoming content. Also at this point,
initial proposals for sustaining the archive will be made, and publicized to
partners.
- By 9 months, local users will be able to examine content in our
information base under appropriate access control and usage logging.
7. Produce a report, at the end of the year, describing our experience and
findings from the planning year. Assuming our findings are reasonably positive,
produce a plan for continuing, and growing, the archive on a permanent basis.
We would work with fellow archiving and journal-using sites to
reach agreement on suitable protocols, practices, and architecture for long-
term, distributed journal archives. We would also distribute any software and
metadata that we develop for managing the electronic journal archives. (For
example, this could include TOM servers and associated access and translation
scripts for data formats used in the archive.)
Milestones:
- By 12 months, an experience report will be published, and a plan for
continuing the archive will be submitted to Mellon. Software developed by us,
and not subject to licensing constraints, would also be made available when in a
reasonably stable state.
Schedule
Here is a summary of the milestones described above, arranged
chronologically. (Note that these are dates of expected completions. Work on
these items will start considerably earlier. In particular, work towards the
milestones of the first 6 months would start as soon as possible.)
3 months:
|
Initial archiving arrangements made with
publishers. Draft licenses/statements of rights and responsibilities written.
|
4 months:
|
Detailed descriptions of metadata
and workflow created.
|
6 months:
|
Initial
journal archive information base created. Initial licenses/statements of rights
and responsibilities agreed to and distributed for comment. Collection of
journal content and metadata begins. Initial plans for sustainability proposed.
|
7 months:
|
Details of metadata and workflow
provided to partner institutions.
|
9 months:
|
Journal archive content available to local users Standard formats
and protocols publicly documented.
|
11 months:
|
Broker services available for standard archive data formats.
|
12 months:
|
Initial mirroring arrangements
made. "Permanent" licenses/statements of rights and responsibilities agreed to.
Journal metadata publicly viewable. Experience report and plan for continuation.
Software depot set up for locally developed tools.
|
Staffing
The following staff members will be working on this
project:
John Mark Ockerbloom (project lead)
is digital library
architect and planner at the University of Pennsylvania library. He joined the
Penn Library in 1999, coming from Carnegie Mellon University, where he earned a
Ph.D. in computer science in 1998. He has been involved in digital library-
related projects since 1993. His thesis work, for instance, included the
design and implementation of a distributed architecture for managing diverse
data formats, which has applications in digital library interoperability and
preservation. He was a consultant for the Universal Library project at Carnegie
Mellon, and has presented his work to the Coalition for Networked Information,
the ACM Digital Libraries Conference, and the Mid-Atlantic Regional Archives
Conference. At Penn, his achievements in migrating old technology to new uses
have included the successful adaptation of a multilingual Yiddish song database
from 1980s dBase technology and private character sets to a Web-accessible
searchable form using Unicode and standard HTML. He also edits The On-Line Books
Page, the Internet's longest-running site indexing and supporting freely
available on-line books.
Delphine Khanna
is a digital projects
librarian at Penn. She joined the Penn Library in 1999. She came to Penn from
Rutgers, where she worked on several digital-library projects of the Center for
Electronic Text in the Humanities. At Penn, she has led several digital library
projects, including an ongoing, project to archive and provide access to
hundreds of thousands of digital images from our Fine Arts library. She has
experience and expertise implementing projects with many formats and programs
used in digital libraries, including SGML and XML, EAD, TEI, Cold Fusion,
Verity, and DLXS. She holds a Masters of Library Science from Syracuse
University, and a Masters of Linguistics and Computer Science from the
University of Paris, France. Michael Winkler is the Web manager at the Penn
Library. He joined the Penn library in 1999, after working in a similar role at
North Carolina State University. At Penn, he has defined both data structures
and workflows for changing our electronic journals, databases, and new materials
information from ad-hoc, hand-maintained web pages to structured, low-overhead
databases which are being integrated with our Franklin library catalog, and
delivered to the public through new, powerful search tools. (Some of these
tools can now be found at http://www.library.upenn.edu/prototypes/public.html)
Roy Heinz is the director of Library Computing at Penn Library, which supports
all aspects of digital library use and development at Penn. He is also the
project lead for our Oxford University Press on-line books project, funded by
the Mellon Foundation. Grover McKenzie is senior systems programmer and Unix
system administrator at the Penn Library. He has worked for Library Computing
for the past nine years, and has installed, maintained, and upgraded our Unix-
based servers (including those for our Oracle database, and our new terabyte-
scale networked storage unit), and our networking and backup infrastructure.
We plan to expand our staff, using funds from various sources (including this
grant), to include at least one more person providing programming and database
support for our digital library projects. This person, once hired, may (among
other duties) build the tools and database we would use for this project.
Before then, John Ockerbloom, Delphine Khanna, and Michael Winkler would be
developing or installing any necessary tools for this phase of the project. John
Ockerbloom would be in charge of planning, designing, and reporting on the
project, and maintain liasons with our external partners. Grover McKenzie would
provide support for the basic network and storage infrastructure. Roy Heinz,
along with other high-level library management staff, would oversee the progress
of the project, and facilitate relations with our external partners. the
project, and helping to build a robust infrastructure and community for
electronic journal archiving.
return to top >> |