YEA: The Yale Electronic Archive
One Year of Progress
Report on the Digital Preservation Planning Project
A collaboration between
Yale University Library and Elsevier Science
Funded by the Andrew W. Mellon Foundation
New Haven, CT
Archiving Scholarly Journals
PDF version |
Microsoft Reader Ebook version
Librarianship is a curious profession in which we select materials we don't know will be wanted, which we can only imperfectly assess, against criteria which cannot be precisely defined, for people we've usually never met and if anything important happens as a result we shall probably never know, often because the user doesn't realize it himself. — Charlton quoted by Revill in presentation by Peter Brophy, Manchester Metropolitan University, 4th Northumbria Conference, August 2001
(Project members began in January 2001 and are continuing with the project unless otherwise indicated)
Yale University Library
Scott Bennett, University Librarian (Principal Investigator, January – July 2001)
Paul Conway, Director of Preservation (Project Manager, January – June 2001)
David Gewirtz, Senior Systems Manager, Yale ITS (Project Technical Director)
Fred Martz, Director of Library Systems (Project Technical Advisor)
Ann Okerson, Associate University Librarian (Co-Principal Investigator, January – July 2001; Principal Investigator July 2001 –)
Kimberly Parker, Electronic Publishing & Collections Librarian (Metadata Investigator)
Richard Szary, Director of University Manuscripts & Archives (Investigator for Archival Uses)
Additional advice and support from:
Matthew Beacom, Catalog Librarian for Networked Information, Yale Library
Jean-Claude Guédon, Professor of Comparative Literature and History of Sciences, Université de Montreal
James Shetler, Asst. Head of Acquisitions, Yale Library
Rodney Stenlake, Esq., Independent Legal Consultant
Stephen Yearl, Digital Systems Archivist, Yale Library
Karen Hunter, Senior Vice President for Strategy
Geoffrey Adams, Director of IT Solutions
Emeka Akaezuwa, Associate Director, Information Technology Implementation
Additional advice and support from:
Haroon Chohan, Elsevier Science, IT Consultant
Paul Mostert, Senior Production IT Manager, Hybrid/Local Solutions, ScienceDirect, Elsevier Amsterdam
Particular thanks go to the following:
Scott Bennett, for his thoughtful and elegant framing of the issues from the outset to the midpoint of the Project and for always keeping our team on target and on time. From the point of his retirement as of July 31, 2001, we have sincerely missed his dedication and contributions to digital preservation, in which he believed passionately.
The Andrew W. Mellon Foundation, for a tangible demonstration of faith, both in the scholarly community's ability to tackle and begin to solve the vital issues associated with long-term digital preservation, and in the ability of the Yale Library to be one of the players helping to find solutions. Particularly we thank Don Waters of the Foundation for his deep commitment to electronic archiving and preservation and for his help to us.
Our team counterparts at Elsevier Science, for proving to be true partners, giving unstintingly their commitment, time, and thoughtfulness to our joint effort. They have shared fully information, data, and technology expertise. We have learned that our two entities, working together, are much stronger than the sum of our parts.
Yale Information Technology Services, for donating far more of David Gewirtz's time than we had any right to expect and for offering their enthusiastic support and advice.
Other Mellon planning projects and their staffs, for giving us help and insights along the way.
Finally, I personally wish to thank each of our team members, because everyone, out of commitment to the long-term mission of libraries, excitement about digital information technologies and their potential, the thrill of learning, and genuine respect for each other's contributions, did more than their share and contributed to a strong planning phase.
Ann Okerson, Principal Investigator
Networked information technology offers immense new possibilities in collecting, managing, and accessing unimaginable quantities of information. But the media in which such information lives are remarkably more ephemeral and fragile than even traditional print media. For such information to have a place in scientific and academic discourse, there must be assurance of long-term preservation of data in a form that can be accessed by users with the kind of assurance they now bring to print materials preserved in libraries. The Yale-Elsevier planning project undertook to study the challenges and opportunities for such preservation posed by a large collection of commercially published scientific journals.
Despite the natural interdependence between libraries and publishers, skepticism remains in both communities about the potential for successful library/publisher collaborations, especially in the electronic archiving arena. The e-archiving planning effort between the Yale University Library and Elsevier Science funded by the Andrew W. Mellon Foundation has resulted in substantial gains in bridging the traditional divide between these two groups and has paved the way for continuing collaboration. The goals of our effort were to understand better the scope and scale of digital journal preservation and to reach a position in which it was possible to identify practical next steps in some degree of detail and with a high level of confidence. We believe we have achieved these goals.
From the outset, the Yale-Elsevier team recognized and respected the important and fundamental differences in our respective missions. Any successful and robust e-archive must be built on an infrastructure created specifically to respond to preservation needs and that can only be done with a clear understanding of those missions. Managing library preservation responsibilities for electronic content while protecting a publisher's commercial interests is thus no small task. We have begun with a mutually-beneficial learning process. Work during the Mellon planning year gave us a better understanding of the commercial life cycle of electronic journals, of the ways in which journal production will impact the success of an e-archive, and of the motives that each party brings to the process and the benefits that each party expects.
From the start, the exploration was based on the premise of separating content from functionality. Embedded here is the belief that users of the e-archive are not bench scientists for whom ease of use of the most current scientific information is critical. We envision potential users of the e-archive to be focused primarily on the content. They must be confident that it remains true to what was published and is not influenced/affected by changes in technology that undoubtedly affect functionality. Minimally acceptable standards of access can be defined without mirroring all features of a publisher's evolving interface.
Our determinations include the following:
- Migration of data offers a more realistic strategy than emulation of obsolete systems;
- Preservation metadata differs from that required for production systems and adds real value to the content;
- No preservation program is an island, hence success will depend on adherence to broadly accepted standards and best practices;
- A reasonable preservation process is one that identifies clearly the "trigger events" that would require consultation of the archive and plans accordingly.
We have made effective use of the information learned by library peers in their efforts and in turn have shared the results of our work. Ultimately, the future of electronic archives depends fundamentally on a network of cooperating archives, built on well-conceived and broadly-adopted standards.
Nevertheless, the relationship between publisher and archiver is fundamental. We have begun work on a model license that draws on Yale's extensive experience in developing and modeling successful license agreements. Such an agreement will shape the publisher/archive relationship in ways that control costs, increase effectiveness, and give the archive a place in the economic and intellectual life cycle of the journals preserved.
The Yale-Elsevier team has demonstrated that working collaboratively, we can now begin to build a small prototype archive using emerging standards and available software. This prototype has the potential to become the cornerstone of an e-journal archive environment that provides full backup, preservation, refreshing, and migration functions. We have demonstrated that the prototype — offering content from many or all of the more than 1,200 Elsevier Science journals — can and will function reliably. We are guardedly optimistic about the economic prospects for such archives, but can only test that optimism against a large-scale prototype.
For this archive to become a reality, we must play a continuing lead role in the development and application of standards; help shape and support the notion of a network of cooperating archives; explore further the potential archival uses; and understand and model the economic and sustainability implications.
The following report provides in some detail the results of the Mellon planning year. We believe it demonstrates the deep commitment of the Yale University Library and Elsevier Science to the success of this kind of collaboration, the urgency with which such investigations must be pursued, and the value that can be found in thus assuring the responsible preservation of critical scientific discourse.
The Big Issues
The tension that underlies digital preservation issues is the fundamental human tension between mutability and immortality. The ancient Platonic philosophers thought that divine nature was unchanging and immortal and human nature was changeable and mortal. They were right about the human nature at least.
Human products in the normal course of affairs share the fates of their makers. If they are not eradicated, they are changed in ways that run beyond the imagination of their makers. Stewart Brand's How Buildings Learn has lessons for all those who imagine they are involved in the preservation of cultural artifacts, not just people concerned with buildings.
But a very limited class of things has managed a special kind of fate. The invention of writing and the development of cultural practices associated with it has created a unique kind of survival. It is rarely the case that the original artifact of writing itself survives any substantial length of time, and when it does, that artifact itself is rarely the object of much active reading. Most of us have read the "Declaration of Independence" but few do so while standing in front of the signed original in the National Archives.
Written texts have emerged as man-made artifacts with a peculiar kind of near-immortality. Copied and recopied, transforming physically from one generation to the next, they remain still somehow the same, functionally identical to what has gone before. A modern edition of Plato is utterly unlike in every physical dimension the thing that Plato wrote and yet functions for most readers as a sufficient surrogate for the original artifact. For more modern authors, where the distance between original artifact and present surrogate is shorter, the functional utility of the latter is even greater.
That extraordinary cultural fact creates extraordinary expectations. The idea that when we move to a next generation of technologies we will be able to carry forward the expectations and practices of the last generation blithely and effortlessly is probably widely shared — and deeply misleading. The shift from organic forms of information storage (from papyrus to animal skin to paper) back to inorganic ones (returning, in silicon, to the same material on which the ancients carved words meant to last forever and in fact, lasting mainly only a few decades or centuries) is part of a larger shift of cultural practices that puts the long-term survival of the text newly at risk.
Some of the principal factors that put our expectations in a
digital age at risk include:
- The ephemeral nature of the specific digital materials we use — ephemeral both in that storage materials (e.g., disks, tapes) are fragile and of unknown but surely very limited lifespan, and in that storage technologies (e.g., computers, operating systems, software) change rapidly and thus create reading environments hostile to older materials.
- The dependence of the reader on technologies in order to view content. It is impossible to use digital materials without hardware and software systems compatible with the task. All the software that a traditional book requires can be pre-loaded into a human brain (e.g., familiarity with format and structural conventions, knowledge of languages and scripts) and the brain and eyes have the ability to compensate routinely for errors in format and presentation (e.g., typographical errors).
The combined effect of those facts makes it impossible for digital materials to survive usefully without continuing human attention and modification. A digital text cannot be left unattended for a few hundred years, or even a few years, and still be readable.
- The print medium is relatively standard among disciplines and even countries. A physicist in Finland and a poet in Portugal expect their cultural materials to be stored in media that are essentially interchangeable. A digital environment allows for multiple kinds of digital objects and encourages different groups to pursue different goals and standards, thus multiplying the kinds of objects (and kinds of hardware and software supporting different kinds of things) that various disciplines can produce and expect to be preserved.
- Rapidity of change is a feature of digital information technology. That rapidity means that any steps contemplated to seek stability and permanence are themselves at risk of obsolescing before they can be properly adopted.
- The intellectual property regimes under which we operate encourage privatization of various kinds, including restricted access to information as well as the creation of proprietary systems designed to encrypt and hide information from unauthorized users until that information no longer has commercial value, at which point the owner of the property may forget and neglect it.
- The quantity of works created in digital form threatens to overwhelm our traditional practices for management.
- The aggregation of factors so far outlined threatens to impose costs for management that at this moment we cannot estimate with even an order of magnitude accuracy. Thus we do not have a way of knowing just how much we can hope to do to achieve our goals and where some tradeoffs may be required.
- Finally, the ephemeral nature of the media of recording and transmission imposes a particular sense of urgency on our considerations. Time is not on our side.
Against all of these entropic tendencies lies the powerful force of expectation. Our deepest cultural practices and expectations have to such an extent naturalized ideas of preservation, permanence, and broad accessibility that even the most resistant software manufacturers and anxious owners of intellectual property will normally respond positively, at least in principle, to concerns of preservation. That is a great advantage and the foundation on which this project is built.
Social and Organizational Challenges
The reader is the elusive center of our attention, but not many readers attend conferences on digital preservation. We find ourselves instead working among authors, publishers, and libraries, with occasional intervention from benevolent and interested foundations and other institutional agencies. The motives and goals of these different players are roughly aligned, but subtly different.
Authors: Authors require in the first instance that their work be made widely known and, for scholarly and scientific information, made known in a form that carries with it the authorization of something like peer review. Raw mass dissemination is not sufficient. Authors require in the second instance that what they write be available to interested users for the useful life of the information. This period varies from discipline to discipline. Authors require in the third instance that their work be accessible for a very long time. This goal is the hardest to achieve and the least effectively pursued, but it nevertheless gives the author community a real interest in preservation. However, the first and second areas of concern are more vital and will lead authors to submit work for publication through channels that offer those services first. In practice, this means that the second motive — desire to see material remain in use for some substantial period of time — is the strongest authorial intent on which a project such as ours can draw. It will drive authors towards reputable and long-standing publishing operations and away from the local, the ephemeral, and the purely experimental.
It should be noted that at a recent digital preservation meeting, certain authors affirmed volubly their right to be forgotten, i.e., at the least to have their content not included in digital archives or to be able to remove it from those archives. Curiously, even as we worry about creating long-term, formal electronic archives that are guaranteed to last, we note that the electronic environment shows quite a multiplier effect: once a work is available on the Web, chances are it can be relatively easily copied or downloaded and shared with correspondents or lists (a matter that worries some rights owners immensely even though those re-sends are rarely of more than a single bibliographical object such as an article). This means that once an author has published something, the chances of his or her right to be forgotten are as slim as they were in days of modern printing — or even slimmer. Think, if you will, whether "special collections," as we have defined them in the print/fixed-format era, have a meaning in the digital environment and, if so, what might that meaning be? The focus may be not on the materials collected so much as on the expertise and commitment to collect and continue to collect materials at particular centers of excellence, especially where the ongoing task of collection is complex, exacting, difficult, and/or particularly unremunerative.
Publishers: Publishers require in the first instance that they recruit and retain authors who will produce material of reliable and widely-recognized value. Hence, publishers and authors share a strong common interest in peer review and similar systems whose functioning is quite independent of any question of survival and preservation. Publishers are as well — perhaps even more — motivated by their paying customers, e.g., the libraries in this case, who can influence the strategic direction of the publisher no less directly. Consequently, publishers require in the second instance that the material they publish be of continuing value in a specific way. That is, publishers of the type we are concerned with here publish serials and depend for their revenue and the intellectual continuity of their operations on a continuous flow of material of comparable kind and quality. The demand for such material is itself conditioned on that continuity, and the availability of previous issues is itself a mark and instrument of that continuity. The publisher, therefore, understands that readers want the latest material but also material that is not only the latest; the scientific journal is thus crucially different from a newspaper. Publishers also have more insubstantial motivations about continuing reputation and may have themselves long histories of distinguished work. But as with authors, it is the second of the motives outlined here — where self-interest intersects the interest of others — that will most reliably encourage publishers to participate in preservation strategies.
Libraries: Libraries require in the first instance that users find in their collections the materials most urgently needed for their professional and academic work. This need leads them to pay high prices for current information, even when they could get a better price by waiting (e.g., buying new hard cover books rather than waiting for soft cover or remainder prices, or continuing to purchase electronic newspaper subscriptions rather than depend on content aggregators such as Lexis-Nexis, which they may also be purchasing). They require in the second instance that they build collections that usefully reflect the state and quality of scientific, scholarly, and cultural discourse in their chosen areas of coverage.
Research library collections may be built largely independently of day-to-day use, but libraries expect certain kinds of use over a long period of time. Where information may be published to meet a particular information need, these libraries will retain that information in order to meet their other mission of collecting a cultural heritage for future generations. Yale has been pursuing that project for 300 years. With traditional media, libraries, museums, archives, and other cultural institutions have pursued that project in various independent and cooperative ways. It is reasonable to expect that such cooperation among traditional players and new players will only continue and grow more powerful when the objects of preservation are digital.
All of the above tantalizing complications of long-term electronic archiving drew us into this planning project and continued to be themes throughout the year-long planning process.
Yale Library as a Player
The Yale Library has, for close to three decades, been cognizant of and aggressive about providing access to the numerous emerging forms of scholarly and popular publications appearing in "new" electronic formats, in particular those delivered via the Internet and increasingly through a Web interface. Initially, indexing and abstracting services and other "reference" type works found their way into electronic formats and rapidly displaced their former print instantiations. By 1996, full-text content from serious and substantive players such as Academic Press (AP) and JSTOR were entering the marketplace, and libraries — and their readers — became quickly captivated by the utility, effectiveness, efficiency, and convenience of academic content freed from its traditional fixed formats. By the summer of 2000, Yale Library, like most of its peer institutions, was spending over $1 million annually on these new publication forms, offering several hundred reference databases and thousands of full-text electronic journals to its readers. Expenditures for electronic content and access paid to outside providers last academic year alone totaled nearly $1.8 million. In addition, we spend increasing sums both on the creation of digital content and on the tools to manage digital content internally — e.g., a growing digitized image collection, a fledgling collection of university electronic records, digital finding aids for analog materials in our collections, and our online library management system, which includes but is not limited to the public access catalog.
The electronic resources that Yale makes available to readers, both on its own and in conjunction with its partners in the NorthEast Research Libraries consortium (NERL), quickly become heavily used and wildly popular — and increasingly duplicative, in terms of effort and price, with a number of print reference works, journals, and books. The Yale Library, with its significant resources, 300 years of collections, and long-term commitment to acquiring and preserving collections not only for its own but also for a global body of readers, has acted cautiously and prudently in retaining print texts not only for immediate use but also for long-time ownership and access. It has treated its growing, largely licensed (and thus un-owned) electronic collections, moreover, as a boon for reader productivity enhancement and convenience, even to the point of acquiring materials that seem to duplicate content but offer different functionality.
Nonetheless, it is clear that the Library (and this is true of all libraries) cannot endlessly continue on such a dual pathway, for several compelling reasons: 1) the growing level of print and electronic duplication is very costly in staff resources and to collections budgets; 2) increasingly, readers, at least in certain fields such as sciences, technology, and medicine (STM), as well as certain social sciences, strongly prefer the electronic format and the use of print in those areas is rapidly diminishing; and 3) traditional library stacks are costly to build or renovate. All libraries face what NERL members have dubbed the "Carol Fleishauer" problem: "We have to subscribe to the electronic resources for many good reasons, but we cannot drop print — where we might wish to — because we do not trust the permanence of electronic media." The challenge for us all, then, is how to solve the problem that Carol so pointedly articulated at a NERL meeting.
An Opportunity to Learn and Plan
When the Andrew W. Mellon Foundation first began to signal its keen interest in long-term preservation of digital resources, the Yale Library had already acquired, loaded on site, and migrated certain selected databases and full-text files for its readers. The Library had also explored, between 1997 and 1999, the possibility of serving as a local repository site for either all of Elsevier Science's more than 1,100 full-text e-journal titles or at least for the 300 or so that had been identified by NERL members as the most important for their programs. Yale information technology (IT) staff experimented with a year's worth of Elsevier electronic journals, using the Science Server software for loading and providing functionality to those journals. In the end, the fuller functionality of Science Direct (Elsevier's commercial Web product) proved more attractive for immediate access, particularly as interlinking capabilities between publishers were developed and were more easily exploitable from the commercial site than through a local load. Nonetheless, Yale's positive experience in working with Elsevier content led our staff enthusiastically to consider Elsevier as a possible e-journal archive planning partner.
When we were invited to consider applying for a planning grant (summer 2000) we contacted Karen Hunter, Elsevier Science's Senior Vice President for Strategy, who signaled a keen personal and corporate interest in long-term electronic archiving solutions. Additional attractions for Yale in working with Elsevier related to the huge amount of important content, particularly in STM, that this publisher provides. It also seemed to us that the strong commercial motivation of a for-profit publisher made such a partnership more important and interesting, at least initially, than one with a not-for-profit publisher. That is, a for-profit entity might not feel so much of a long-term or eternal commitment to its content as would a learned society. We have learned in the course of this planning project that Elsevier has its own serious commitment in this area. With Ms. Hunter and other senior Elsevier staff, a self-identified Yale Library team began to formulate e-journal questions and opportunities, and Elsevier officers strongly supported our submission of the Fall 2000 application to Mellon.
Throughout the project, regular meetings between Yale and Elsevier team members have been held, and key topics have been identified and pursued therein, as well as by phone, e-mail, through the work of small sub-groups, through site visits by team members and visitors to our establishments. The principal lines of inquiry have been in the following areas: 1) "trigger" events; 2) a beginning exploration of the fascinating area of "archival uses," i.e., how real-life users might use the kind of archive we were proposing to develop; 3) contractual and licensing issues; 4) metadata identification and analysis, particularly through comparison and cross-mapping with datasets recommended by such players as the British Library and OCLC/RLG; and 5) technical issues, beginning with an examination of Elsevier's production and workflow processes, and with particular emphasis after summer of 2001 on building a small prototype archive based on the OAIS (Open Archival Information Systems) model, focusing on the archival store component. Interlaced have been the economic and sustainability questions that are crucial to all electronic preservation projects.
An Interesting Mid-Year Development: Acquisition of Academic Press (AP)
In July 2001, the acquisition by Elsevier Science of a number of Harcourt properties and their component journal imprints was announced. This real-life business event presented interesting and significant challenges, not only to Elsevier Science but also prospectively to the Yale Electronic Archive (YEA). At that time, Elsevier had not fully formulated its organizational or technical plans for the new imprints, which included not only AP (whose IDEAL service represented one of the earliest commercially-licensed groups of important scientific journals, introduced to the marketplace in 1995), but also Harcourt, Saunders, Mosby, and Churchill Livingstone, i.e., primarily medical titles which are to be integrated into a new Elsevier division called Health Science. Elsevier staff believe that the imprints of the acquired titles will survive; it is not clear whether the IDEAL electronic platform will continue or whether (more likely) it will be integrated into Science Direct production processes.
As a result of this acquisition, Elsevier's total number of journal titles rose from around 1,100-1,200, to about 1,500-1,600, i.e., by nearly 50 percent. In 2002, the combined publications will add collectively 240,000 new articles to Elsevier's electronic offerings; the size of the Elsevier e-archive will become 1.7 million articles in number. Additionally, Elsevier, like many other scholarly and scientific publishers, is pursuing a program of retrospective digitization of its journals, back to Volume One, Number One. The backfiles up to and including the years 1800-1995 (estimated at four to six million entries) will require 4.5 terabytes of storage. The years up to and including 1995-2000 require .75 terabytes. Production output for 2001 is estimated to be at 150-200 gigabytes using the new ScienceDirect OnSite (SDOS) 3.0 format, SGML, and graphics. The addition of Harcourt titles adds another 1.25 terabytes of storage requirements, for a grand total of nearly 7 terabytes of storage being required for all Elsevier Science content back to Volume One, Number One.
Although our e-archiving project is not yet working with non-journal electronic formats, it is worth noting that just as with Elsevier Science's publishing output, the new AP acquisitions include serials such as the Mosby yearbooks, various Advances in…, and certain major reference works. The Yale team believes that this non-journal output is worthy of attention and also poses significant challenges.
After the acquisition, Ken Metzner, AP's Director of Electronic Publishing, was invited to attend project meetings; scheduling permitted him to participate only to a limited extent. Because of the relative newness of this additional content, the YEA team participants were unable to determine the specific impacts of the acquisition for our electronic archiving pursuits. Details will become clearer during 2002. What is clear — and has been all along, though not so dramatically — is that content of publishers is fluid, at least around the edges. Titles come and go; rights are gained and lost, bought and sold. Any archive needs carefully to consider its desiderata with regard to such normally occurring business changes and how contractual obligations must be expressed to accommodate them.
In January 2001, the Mellon Foundation approved a one-year planning grant for the Yale Library in partnership with Elsevier Science. The two organizations issued a press release and named members to the planning team. The work proceeded according to certain assumptions, some of which were identified very early in the process and others that evolved as we began to pursue various lines of inquiry. While it is not always easy to distinguish, one year into the planning process, assumptions from findings, we have made an effort to do so and to detail our a priori assumptions here:
- The digital archive being planned will meet long-term needs. The team adopted a definition of "long" as one hundred or more years. One hundred years, while rather less than the life of well cared for library print collections, is significantly more than any period of time that digital media have so far lasted or needed to last. Users of an electronic archive must be secure in the knowledge that the content they find in that archive is the content the author created and the publisher published, the same kind of assurance they generally have with print collections.
- Accordingly, and given the rapid changes in technologies of all sorts, the archive has the responsibility to migrate content through numerous generations of hardware and software. Content can be defined as the data and discourse about the data that authors submit to the publisher. Content includes, as well, other discourse-related values added by the publisher's editorial process, such as revisions prompted by peer review or copy editing or editorial content such as letters, reviews, journal information and the like. Specifically, content might comprise articles and their abstracts, footnotes, and references, as well as supplementary materials such as datasets, audio, visual and other enhancements. For practical purposes, of course, the content consists of the finished product made available to readers.
- The archive will not compete with the publisher's presentation of or functionality for the same content nor with the publisher's revenue stream. Functionality is defined as a set of further value-adding activities that do not have a major impact on the reader's ability to read content but may help the reader to locate, interact with, and understand the content. The YEA project will not concern itself with reproducing or preserving the different instances of publisher's or provider's functionality, because this functionality is very mutable and appears differently based on who delivers the content (publisher, vendor, aggregator, and so on). That is, an increasing number of journals, books, and other databases are provided through more than one source and thus multiple interfaces are already available, in many instances, for the same works.
- Should the archive be seen as potentially competitive, the publisher would certainly have no incentive (and the authors might not have any incentive either) to cooperate with the archive by contributing to it, particularly content formatted for easy ingestion into the archive.
- That said, some immediate uses of the archive seem desirable. We imagined that if/when the archive — or some portion of its offerings, such as metadata — could be conceived of as having rather different and possibly more limited uses than the primary, extensive uses the publisher's commercial service provides, the archive could be deployed early in its existence for those uses.
- Once each set of data is loaded at regular, frequent intervals, the archive remains an entity separate from and independent of the publisher's production system and store. The publisher's data and the archive's data become, as it were, "fraternal twins" going their separate behavioral ways.
- Such environmental separation enables the archive's content to be managed and rendered separately and to be migrated separately — and flexibly — over long periods of time, unencumbered by the formatting and production processes that the publisher needs to deploy for its own print and electronic dissemination.
- The archive, accordingly, commits to preserving the author's content and does not make an effort to reproduce or preserve the publisher's presentation of the content, providing, at this time, basic functionality and content with no visible loss. The YEA is committed at this point in time only to a minimum "no frills" standard of presentation of content.
- Where extensive functionality is required, the YEA project assumes that functionality will be created — perhaps by the archive or by a separate contractor — when the time comes, and that the functionality will be created in standards applicable to that (present or future) time.
- The archive does not, in general, create electronic content that does not exist in the publisher's electronic offering of the same content, even though such content may exist in the printed version of the journals. The archive is not intended to mimic the printed version. (For example, if the print journal includes "letters to the editor" and the e-journal version does not include these, the e-archive will not create them.
- The archive will likely need to create metadata or other elements to define or describe the publisher's electronic content if the publisher has not provided these data during its production processes. However, it is desirable for the archive to create as few of these electronic elements as possible. The most accurate and efficient (and cost-effective) archive will be created if the publisher creates those data. This in turn indicates strongly the need both for industry-wide electronic preservation standards and close partnerships between archives (such as libraries) and publishers.
- The archive will work with the publisher to facilitate at-source creation of all electronic elements. The archive will work with other similar publishers and archives to develop, as quickly as possible, standards for such elements, in order to deliver consistent and complete archive ingestion packages.
- The archive will develop a system to ingest content regularly and frequently. In this way, it will record information identical to that generated by the authors and publishers. Any adjustments to content after ingestion into the archive will be added and identified as such.
- At Yale, the e-journals archival system and site will be part of a larger digital store and infrastructure comprising and integrating numerous other digital items, many of which are already on site, such as internally (to the University and Library) digitized content, images, University records, born-digital acquisitions that are purchased rather than leased, finding aids, preservation re-formatting, the online catalog, and others. In this way, efficiencies and synergies can be advanced and exploited.
- The archive will regularly and frequently "test" content, preferably both with automated systems that verify accuracy and completeness and with real users seeking content from the archive.
- YEA team members assume that the archive may be searched by outside intelligent agents or harvesters or "bots," and that it must be constructed in a way that both facilitates such searching and at the same time respects the rights agreements that have been made with the copyright owners.
- "Triggers" for ingestion are frequent and immediate; "triggers" for use of the archive are a different matter and will comply with rules developed between the publisher and the archive. These triggers would be identified during the course of the planning project.
- The archive will be constructed to comply with emerging standards such as the OAIS model, certain metadata recommendations if possible, XML, and the like. Standards that enable data portability are key to YEA development.
- The archive will be developed using, wherever possible, software tools as well as concepts being created in other institutions.
Notes About Archival Approaches and the Absolute Necessity for Standards in E-Archival Development
The assumptions listed above speak to four major activities of the YEA, which are 1) preservation planning, 2) administration of the archive, 3) access to the archive, and 4) ingestion of content. Like other Mellon research projects, the YEA defines activities associated with these processes within the context of the OAIS reference model of a digital archive. A fortiori, the model also states that implementations will vary depending upon the needs of an archival community.
Research conducted during the planning year has identified four different approaches to preservation: emulation, migration, hard copy, and computer museums. These different approaches should not, however, be viewed as mutually exclusive. They can be used in conjunction with each other. Additionally, as we are not pursuing the present study as simply an academic exercise but rather, as a very practical investigation of what it will take to build, operate, and maintain a fully functioning production-mode digital archive, we cannot possibly discount the financial implications of the different approaches and the impact of standards on a choice of approach.
In choosing our approach, we quickly discounted both hard copy and computer museums. Hard copy decays over time and multimedia objects cannot be printed. Computer museums were also discounted as impractical. To function as an archive, computer museum equipment would need to be operational, not simply a collection of static exhibits. In turn, operation would lead to inevitable wear and tear on the equipment, with the consequential need for maintenance and repair work. When considering the repair of "antique" computer equipment, one has to ask about the source of spare parts — do all of them have to be hand-made at enormous expense? Even if money were available for such expensive work, does the "antique" equipment come with adequate diagnostic and testing equipment, wiring diagrams, component specifications, and the like, which would make the "museum" choice technically feasible in the first place? Even if the antique equipment were to receive only the most minimal usage, the silicon chips would deteriorate over time. In such a museum environment, one would question whether we would be in danger of losing our focus, ending up as a living history-of-computers museum rather than an archive of digital materials.
Rejecting hardcopy and museum options left the team with two very different approaches to the storage of content in an OAIS archive. One approach to content preservation is to store objects based upon emerging standards such as XML and then migrate them to new formats as new paradigms emerge. The other approach, advanced by and associated with one of its chief proponents, Marc Rothenberg, is to preserve content through emulation. Both approaches have potential and value, but probably to different subcultures in the archival community. A goal of standards is to preserve the essential meaning or argument contained in the digital object. Archives that are charged with the responsibility of preserving text-based objects such as e-journals are likely to adopt a migratory approach. Archives that need to preserve an exact replication or clone of the digital objects may choose, in the future, to deploy emulation as an archival approach. Both approaches are rooted in the use of standards. Contrary to an argument advanced in a research paper by Rothenberg, YEA team members maintain that standards, despite their flaws, represent an essential component of any coherent preservation strategy adopted.
That said, Rothenberg criticized the use of migration as an approach to digital longevity. Rothenberg does make an insightful and enterprising case for the practice of emulation as the "true" answer to the digital longevity problem. The "central idea of the approach, " he writes, "is to enable the emulation of obsolete systems on future, unknown systems, so that a digital document's original software can be run in the future despite being obsolete." Rothenberg avers that only by preservation of a digital object's context — or, simply stated, an object's original hardware and software environment — can the object's originality (look, feel, and meaning) be protected and preserved from technological decay and software dependency.
The foundation of this approach rests on hardware emulation, which is a common practice in the field of data processing. Rothenberg logically argues that once a hardware system is emulated, all else just naturally follows. The operating system designed to run on the hardware works and software application(s) that were written for the operating system also work. Consequently, the digital object behaves and interacts with the software as originally designed.
However, emulation cannot escape standards. Processors and peripherals are designed with the use of standards. If the manufacturer of a piece of hardware did not adhere 100 percent to the standard, then the emulation will reflect that imperfection or flaw. Consequently, there is never a true solution, as suggested by Rothenberg, that a generalized specification for an emulator of a hardware platform can be constructed. In the data processing trenches, system programmers are well acquainted with the imperfections and problems of emulation. For example, the IBM operating system MVS never ran without problems under IBM's VM operating system. It was a good emulation but it was not perfect. Another major problem with emulation in a practical sense is its financial implications. The specification, development, and testing of an emulator require large amounts of very sophisticated and expensive resources.
At this stage, the YEA team believes the most productive line of research is a migratory approach based upon standards. Standards development must, therefore, feature front and center in the next phase of e-journal archiving activities. If one listens closely to academic discourse, the most seductive adverb of all is one not found in a dictionary; it is spelled "just" and pronounced "jist" and is heard repeatedly in optimistic and transparent schemes for making the world a better place. If scientists would "jist" insist on contributing to publishing venues with the appropriate high-minded standards of broad access, we would all be better off. If users would "jist" insist on using open source operating systems like Linux, we would all be better off. If libraries would "jist" spend more money on acquisitions, we would all be better off.
Many of those propositions are undoubtedly true, but the adverb is their Achilles' heel. In each case the "jist" masks the crucial point of difficulty, the sticking point to movement. To identify those sticking points reliably is the first step to progress in any realistic plan for action. In some cases, the plan for action itself is arguably a good one, but building the consensus and the commonality is the difficulty; in other cases, the plan of action is fatally flawed because the "jist" masks not merely a difficulty but an impossibility.
It would be a comparatively easy thing to design, for any given journal and any given publisher, a reliable system of digital information architecture and a plan for preservation that would be absolutely bulletproof — as long as the other players in the system would "jist" accept the self-evident virtue of the system proposed. Unfortunately, the acceptance of self-evident virtue is a practice far less widely emulated than one could wish.
It is fundamental to the intention of a project such as the YEA that the product — the preserved artifact — be as independent of mischance and the need for special supervising providence as possible. That means that, like it or not, YEA and all other seriously aspiring archives must work in an environment of hardware, software, and information architecture that is as collaboratively developed and as broadly supported as possible, as open and inviting to other participants as possible, and as likely to have a clear migration forward into the future as possible.
The lesson is simple: standards mean durability. Adhering to commonly and widely recognized data standards will create records in a form that lends itself to adaptation as technologies change. Best of all is to identify standards that are in the course of emerging, i.e., that appear powerful at the present moment and are likely to have a strong future in front of them. Identifying those standards has an element of risk about it if we choose the version that has no future, but at the moment some of the choices seem fairly clear.
Standards point not only to the future but also to the present in another way. The well-chosen standard positions itself at a crossroads, from which multiple paths of data transformation radiate. The right standards are the ones that allow transformation into as many forms as the present and foreseeable user could wish. Thus PDF is a less desirable, though widely used, standard because it does not convert into structured text. The XML suite of technology standards is most desirable because it is portable, extensible, and transformative: it can generate everything from ASCII to HTML to PDF and beyond.
Makers of an archive need to be very explicit about one question: what is the archive for? The correct answer to that question is not a large idealistic answer about assuring the future of science and culture but a practical one: when and how and for what purpose will this archive be put to use? Any ongoing daily source needs to be backed up reliably, somewhere away from the risks of the live server, and that backup copy becomes the de facto archive and the basis for serious preservation activities.
Types of Archives
The team discovered during the course of its explorations that there is no single type of archive. While it is true that all or most digital archives might share a common mission, i.e., the provision of permanent access to content, as we noted in our original proposal to the Mellon Foundation, "This simple truth grows immensely complicated when one acknowledges that such access is also the basis of the publishers' business and that, in the digital arena (unlike the print arena), the archival agent owns nothing that it may preserve and cannot control the terms on which access to preserved information is provided."
In beginning to think about triggers, business models, and sustainability, the project team modeled three kinds of archival agents. The first two types of archives include a de facto archival agent, defined as a library or consortium having a current license to load all of a publisher's journals locally, or a self-designated archival agent. Both of these types are commercial transactions, even though they do not conduct their business in the same ways or necessarily to meet the same missions. The third type of archive is a publisher-archival agent partnership and the focus of our investigation. Whether this type can now be brought into existence turns on the business viability of an archive that is not heavily accessed. Project participants varied in their views about whether an archive with an as yet uncertain mission can be created and sustained over time and whether, if created, an individual library such as Yale or a wide-reaching library enterprise like OCLC would be the more likely archival partner.
Accessing the Archive
So when does one access the archive? Or does one ever access it? If the archive is never to be accessed (until, say, material passes into the public domain, which currently in the United States is seventy years plus the lifetime of the author or rights holder), then the incentives for building it diminish greatly, or at least the cost per use becomes infinite. There is talk these days of "dark" archives, that is, collections of data intended for no use but only for preservation in the abstract. Such a "dark" archive concept is at the least risky and in the end possibly absurd.
Planning for access to the e-archive requires two elements. The less clearly defined at the present is the technical manner of opening and reading the archive, for this will depend on the state of technology at the point of need. The more clearly defined, however, will be what we have chosen to call "trigger" events. In developing an archival arrangement with a publisher or other rights holder, it will be necessary for the archive to specify the circumstances in which 1) the move to the archive will be authorized, which is much easier to agree to than the point at which 2) users may access the archive's content. The publisher or rights holder will naturally discourage too early or too easy authorization, for then the archive would begin to attract traffic that should go by rights to the commercial source. Many rights holders will also naturally resist thinking about the eventuality in which they are involuntarily removed from the scene by corporate transformation or other misadventure, but it is precisely such circumstances that need to be most carefully defined.
Project participants worked and thought hard to identify conditions that could prompt a transfer of access responsibilities from the publisher to the archival agent. These conditions would be the key factors on which a business plan for a digital archive would turn. The investigation began by trying to identify events that would trigger such a transfer, but it concluded that most such events led back to questions about the marketplace for and the life cycle of electronic information that were as yet impossible to answer. Team members agreed that too little is known about the relatively young business of electronic publishing to enable us now to identify definitively situations in which it would be reasonable for publishers to transfer access responsibility to an archival agent.
Possible Trigger Events
That said, some of the possible trigger events identified during numerous discussions by the project team were:
Long-term physical damage to the primary source. Note that we have not imagined the e-journal archive to serve as a temporary emergency service. We expect formal publishers to make provision for such access. Nevertheless, in the case of cataclysmic event, the publisher could have an agreement with the archive that would allow the publisher to recopy material for ongoing use.
Loss of access or abdication of responsibility for access by the rights holder or his/her successor, or no successor for the rights holder is identified. In other words, the content of the archive could be made widely available by the archive if the content is no longer commercially available from the publisher or future owner of that content. We should note that at this point in time, we were not easily able to imagine a situation in which the owner or successor would not make provision precisely because in the event of a sale or bankruptcy, content is a primary transactional asset. But that is not to say that such situations will not occur or that the new owner might not choose to deal with the archive as, in some way, the distributor of the previous owner's content.
Lapse of a specified period of time. That is, it could be negotiated in advance that the archive would become the primary source after a negotiated period or "moving wall," of the sort that JSTOR has introduced into the e-journal world's common parlance. It may be that the "free science" movement embodied in PubMed Central or Public Library of Science might set new norms in which scientific content is made widely available from any and all sources after a period of time, to be chosen by the rights owner. This is a variant on the "moving wall" model.
On-site visitors. Elsevier, our partner in this planning venture, has agreed that at the least, its content could be made available to any onsite visitors at the archive site, and possibly to other institutions licensing the content from the publisher. Another possibility is provision of access to institutions that have previously licensed the content. This latter option goes directly to development of financial and sustainability models that will be key in Phase II.
Archival Uses. Elsevier is very interested in continuing to explore the notion of so-called "archival uses" which represent uses very different to uses made by current subscribers in support of today's active science. Elsevier has stated that if we can identify such "archival uses," it might be willing to consider opening the archive to those. Some such uses might be studies in the history, sociology, or culture of sciences, for example. This thread in our planning processes has motivated the YEA team to devote some time to early exploration of archival uses with a view to expanding and deepening such exploration in Phase II.
Metadata Uses. In the course of preservation activity it could be imagined that new metadata elements and structures would be created that would turn out to have use beyond the archive. Appropriate uses of such data would need to be negotiated with the original rights holder.
Economic considerations are key to developing systems of digital archives. Accordingly, in our proposal, the Yale Library expressed its intention better "to understand the ordinary commercial life cycle of scientific journal archives..." In that proposal, our list of additional important questions included concerns about costs of creating and sustaining the archive, as well as sources of ongoing revenues to support the archive. While the issues of sustainability lurked in our thinking throughout the project, we determined relatively early on that the time was not right substantively to address these matters because we had as yet insufficient data and skill to make any but the very broadest of generalizations. But, that lack of hard data did not stop the group from discussing and returning frequently to economic matters.
Neither were the views of various individuals and organizations of definitive help to us. For example, the best study about e-archiving known to us attempted to analyze costs, but the information is somewhat dated. A large school of thought affirms that e-archives and even e-journal archives will be immensely expensive to develop and maintain, perhaps impossibly so. Some of the arguments include:
Huge Costs. Formal publishers' e-journal titles, i.e., those presented in fairly "standard" formats, will be very costly to archive because even those publishers do not provide clean, consistent, fully tagged data. Accordingly, the e-archive will have to perform significant repair and enhancement, particularly in the ingestion process; e.g., the creation of the Submission Information Package (SIP) will be particularly expensive. Furthermore, this reasoning goes, as the size, variety, and complexity of the content increases, associated costs will rise, as they will whenever formats need to be migrated and as storage size increases.
The universe of e-journals — which includes a great volume as well as diversity of subjects and formats, including Web sites, newsletters, dynamic publications, e-zines, and scholarly journals, and includes a huge variety of possible technical formats — will surely be difficult and costly to archive when one considers that universe as a whole.
Information Will Be Free. On the other hand, a great deal of today's "popular" scientific literature, promulgated by working scientists themselves, argues that electronic archiving is very cheap indeed. Proponents of this optimistic line of argument reason that colleges, universities, research laboratories, and the like already support the most costly piece of the action: that electronic infrastructure comprises computers, internal networks, and fast links to the external world, and institutions are obligated in any case aggressively to maintain their investments and frequently to update them. That being the case, the reasoning is that willing authors can put high quality material "out there," leaving it for search engines and harvesters to find. In such arguments, the value-adding services heretofore provided by editors, reviewers, publishers, and libraries are doomed to obsolescence and are withering away even as this report is being written.
Our guess is that the "truth" will be found to lie in between those two polarities, but of course that guess is a little glib and perhaps even more unfounded than the above arguments.
Even though during the planning year we were unable to make economic issues a topic of focused inquiry, we have begun to develop specific and detailed costs for building the YEA for e-journals in preparation for the next granting phase, and those calculations are starting to provide us with a sense of scale for such an operation. In addition, throughout the year, team members articulated certain general views about the economics of e-journal archives, which we share here below.
Five Cost Life-Cycle Stages of an e-Journal Archive
The task of archiving electronic journals may be divided into five parts: the difficult part (infrastructure development and startup), the easier part (maintenance), the sometimes tricky part (collaborations and standards), the messy part (comprehensiveness), and the part where it becomes difficult again (new technologies, migration).
- The difficult part (development and startup). Initial electronic archiving efforts involve such activities as establishing the data architecture, verifying a prototype, validating the assumptions, and testing the adequacy of the degree of detail of realization. The magnitude and complexity of the issues and the detail involved in e-journal archiving are considerable. That said, it does not lie beyond the scope of human imagination, and the big lesson we have learned in this planning year is that it is indeed possible to get one's arms around the problem, and that several different projects have discovered more or less the same thing in the same time period. In fact, Yale Library is already involved in other types of archiving projects related to several other digital initiatives. The greatest difficulties do not lie in having to invent a new technology, nor do they lie in coping with immense magnitudes. Rather, they reside in resolving a large, but not unimaginably large, set of problems in an adequate degree of detail to cope with a broad range of possibilities.
- The easier part (ongoing maintenance and problem resolution). Where we are encouraged is in believing that once the first structure-building steps have been taken, the active operationalization and maintenance of an e-journal archiving project, in partnership with one or more well-resourced and cooperative publishers, can become relatively straightforward, particularly as standards develop to which all parties can adhere. There will be costs, but after start-up many of these will be increasingly marginal costs to the act of publishing the electronic journal in the first place. For new data being created going forward, attaching appropriate metadata and conforming to agreed standards will require up-front investment of time and attention, especially retrofitting the first years of journals to standards newly enacted, but once that is done, the ongoing tasks will become more transparent. In theory, the hosting of the archive could be part and parcel of the operational side of the publishing, and the servers and staff involved in that case would most likely be the same people involved in the actual publication. Alternately, as we imagine it, the long-term archiving piece of business will be taken aboard by existing centers distributed among hosting universities with similar synergies of costs.
- The tricky part (collaboration and standards). Because different people and organizations in different settings have been working on electronic preservation issues for the last few years, there may already be appreciable numbers of similar but nonidentical sets of solutions coming to life. Working around the world to build sufficient communities of interest and standards to allow genuinely interoperable archives and real standards will take a great deal of "social work." Every archive will continue to devote some percentage of its operation to external collaborations driven by the desire to optimize functional interoperability.
- The messy part (comprehensiveness). There will be a fair number of journals that either choose not to cooperate or are financially or organizationally ill-equipped to cooperate in a venture of the scope imagined. It will be in the interest of the library and user communities generally to identify those under-resourced or recalcitrant organizations and find the means — financial, organizational, political — to bring as many of them aboard as possible. It may prove to be the case that 90 percent of formal publishers' journals can be brought aboard for a modest price, and the other 10 percent may require as much money or more to come in line with the broader community.
- The part where it becomes difficult — and probably very expensive — again (migration). The solutions we now envision will sustain themselves only as long as the current technical framework holds. When the next technological or conceptual revolution gives people powers of presentation they now lack and that do not allow themselves to be represented by the technical solutions we now envision, then we will require the next revolution in archiving. The good news at that point is that some well-made and well-observed standards and practices today should be able to be carried forward as a subset of whatever superset of practices need to be devised in the future. Elsevier Science has a foretaste of this in its current, very costly migration to XML.
Needless to say, the above overview is somewhat simplified. For example, in our planning year, we were surprised to find just how few of the 1,100+ Elsevier e-journal titles carried complex information objects, compared to what we expected to find. Complex media, data sets, and other electronic-only features exist that have yet to find their place as regular or dominant players in e-journals, and creating ways to deal with these types of digital information — let alone standard ways — will be costly, as are all initial structural activities (see #1 above).
Cost-Effective Collaboration and Organization for e-Archiving
That said, it appears that willing collaborators have yet a little time both to address and to solve the hefty problems of presenting and archiving complex digital information objects. To archive a single e-journal or small set of journals is to do relatively little. But to develop standards that will serve e-preservation well — let alone to facilitate access to the most simple of e-archives that begin to bloom like a hundred flowers — all the players will need to work together. We imagine an aggregation of archiving efforts, whether in physical co-location or at least virtual association and coordination.
But how might such archival universes be organized?
- Archives could be subject-based, arranged by discipline and subdiscipline. Such an arrangement would allow some specialization of features, easier cross-journal searching, and creation of a community of stakeholders.
- Archives could be format-based. This arrangement would probably overlap with subject-based arrangement in many fields, would be easier to operate and manage, but would sacrifice at least some functionality for users — an important consideration, given that archival retrieval is likely to occur in ways that put at least some demand on users to navigate unfamiliar interfaces.
- Archives could be publisher-based. Such an arrangement would offer real conveniences at the very outset, but would need close examination to assure that standards and interoperability are maintained against the natural interest of a given rights holder to cling to prerogatives and privileges.
- Archives could be nationally-based. Australia, Japan, Canada, Sweden, and other nations could reasonably feel that they have a mission to preserve their own scientific and cultural products and not to depend on others.
- Archives could be organized entrepreneurially by hosts. This is probably the weakest model, inasmuch as it would create the least coherence for users and searching.
Each of these alternate universes has its own gravitational force and all will probably come into existence in one form or another. Such multiplicity creates potentially severe problems of scalability and cost. One remedy could be for official archives to operate as service providers feeding other archives. Hence, a publisher's agreed archive could feed some of its journals to one subject-based archive and others to national archives.
One way to begin to anticipate and plan for this likely multiplicity would be to create a consortium now of interested parties to address the difficult issues such as redundancy, certification, economic models, collection of fees, standards, and so on. No one organization can solve these problems alone, but coordination among problem-solvers now and soon will be very cost-effective in the long run. In OCLC's proposal to create a digital preservation cooperative, and, on a larger scale in the Library of Congress's recent National Digital Information Infrastructure Preservation Program, we may be seeing the emergence of such movements. It may be possible to turn the Mellon planning projects into such an overarching group of groups.
Who Will Pay and How Will They Pay?
No preservation ambitions will be realized without a sustainable economic model. As we have noted above, the costs of archiving are much in dispute and our study will examine those costs in great detail in the next phase. For now, it would appear that the initial costs are high, although manageable, and the ongoing costs, at least for standard publisher's journals, could be relatively predictable and eventually stable over time.
If that is true, then various models for paying for the archiving process suggest themselves. This is an area about which there has been much soft discourse but in which there has been little experience, save perhaps for JSTOR whose staff have given the topic a great deal of thought.
Up-front payment. The most dramatic and simple way to finance the e-journal archives would be the "lifetime annuity model": that is, users (presumably institutional entities, such as libraries, professional societies, governments, or cultural institutions, but some speak of enhanced "page charges" from authors or other variants on current practices) pay for a defined quantum of storage and with that one-time payment comes an eternity of preservation. The up front payment would be invested partly in ongoing archival development and partly in an "endowment" or rainy day fund. The risk in this case is that inadequate funding may lead to future difficulties of operation.
Ongoing archival fees. An "insurance premium" on the other hand could give an ongoing supply of money, adjustable as costs change, and modest at all stages. This reduces the risk to the provider but increases the uncertainty for the beneficiary. The ongoing fee could be a visible part of a subscription fee or a fee for services charged by the archive.
The traditional library model. The library (or museum or archive) picks up the tab and is funded by third-party sources.
Fee for services operation. The archive provides certain services (special metadata, support for specialized archives) in return for payments.
Hybrid. If no single arrangement seems sufficient — as it likely will not — then a hybrid system likely will emerge, perhaps with one set of stakeholders sharing the up-front costs while another enters into agreement to provide ongoing funding for maintenance and potential access.
Much more could be said on the topic of who pays but at the moment most of it would be speculation. The choice of models will influence development of methods for paying fees and the agents who will collect those fees. Before making specific recommendations it will be important for our project to develop a much more specific sense of real costs of the e-archive. We imagine that we might want to develop both cost and charging models in conjunction with other libraries, i.e., prospective users of the archive. In Yale's case the collaborative effort might happen with our local electronic resource licensing consortium NERL.
Publishers and librarians have reluctantly grown accustomed to having licenses that articulate the terms and conditions under which digital publications may be used. These licenses are necessary because in their absence the uses to which digital files could be put would be limited by restrictions (and ambiguities) on reproduction and related uses that are intrinsic within copyright law. Licenses clarify ambiguities and often remove, or at least significantly reduce, limitations while also acknowledging certain restrictions on unlimited access or use.
A licensing agreement between a digital information provider and an archival repository presents several unique challenges not generally faced in the standard licensing agreement context between an information provider and an end-user. Discussed below are several of the issues that must be addressed in any final agreement:
- Term and termination. The perpetual nature of the intended agreement, even if "forever," is in fact, a relative rather than an absolute term. One has to think in funereal terms of "perpetual care" and of the minimum length of time required to make an archiving agreement reasonable as to expectations and investments. Some issues that need to be addressed are appropriate length of any such agreement, as well as provisions for termination of the agreement and/or "handing off" the archive to a third party. Underlying concerns of term and termination is the need to ensure that the parties' investments in the archive are sufficiently protected as well as that the materials are sufficiently maintained and supported.
- Sharing responsibility between the archive and the digital information provider. There are elements of a service level agreement that must be incorporated into the license because the rights and responsibilities are different in an archival agreement than in a normal license. That is, an archive is not the same as a traditional end-user; in many ways the archive is stepping into the shoes of the digital information provider in order (eventually) to provide access to end-users. The rights and responsibilities of the archive will no doubt vary depending on when the material will become accessible and on whether there are any differentiations between the level and timing of access by end-users. This issue will have an impact on the level of technical and informational support each party is required to provide to end-users and to each other, as well responsibility for content — including the right to withdraw or change information in the archive — and responsibilities concerning protecting against the unauthorized use of the material.
- Level and timing of access. While all licenses describe who are the authorized users, the parties to an archival agreement must try to anticipate and articulate the circumstances (i.e., "trigger events") under which the contents of the archive can be made available to readers, possibly without restriction. When the information will be transmitted to the archive and, more importantly, how that information is made available to end-users are also critical questions. Several models have been discussed and this may be an issue best addressed in detailed appendices reflecting particular concerns related to individual publications.
- Costs and fees. The financial terms of the agreement are much different from those of a conventional publisher-user license. Though it is difficult to conceive of one standard or agreed financial model, it is clear that an archival agreement will have a different set of financial considerations from a "normal" license. Arrangements must be made for the recovery of costs for services to end-users, as well as any sharing of costs between the archive and the digital information provider. These costs may include transmission costs, the development of archive and end-user access software, and hardware and other costs involved in preserving and maintaining the data.
- Submission of the materials to the archive. The issues of format of the deposited work ("submission") take on new considerations as there is a need for more information than typically comes with an online or even locally-held database. Describing the means for initial and subsequent transfers of digital information to the archive requires a balance between providing sufficient detail to ensure all technical requirements for receiving and understanding the material are met, while at the same time providing sufficient flexibility for differing technologies used in storing and accessing the materials throughout the life of the contract. One means of dealing with the submission issues is to provide in the agreement general language concerning the transmission of the materials, with reference to appendices that can contain precise protocols for different materials in different time periods. If detailed appendices are the preferred method for dealing with submission matters, mechanisms must be developed for modifying the specifics during the life of the agreement without triggering a formal renegotiation of the entire contract.
- Integrity of the archive. The integrity and comprehensiveness of the archive must be considered. The contract must address the question: "If the publisher 'withdraws' a publication, is it also withdrawn from the archive?"
YEA and Elsevier Science have come to basic agreement on what they would be comfortable with as a model license. In some areas alternatives are clearly available and other archival agencies working with other publishers will choose different alternatives. Reaching a general agreement was, however, surprisingly easy as the agreement flowed naturally out of the year-long discussions on what we were trying to accomplish. The current draft license is not supplied in this document because it has a number of "unpolished" areas and some unresolved details, but it could be submitted and discussed upon request.
The team made certain choices with regard to the contractual issues noted above:
- Term. The team opted for an initial ten-year term with subsequent ten-year renewals This provides the library with sufficient assurance that its investments will be protected and assures the publisher that there is a long-term commitment. The team also recognized that circumstances can change and has attempted to provide for what we hope will be an orderly transfer to another archival repository.
- Rights and responsibilities. The agreement includes statements of rights and responsibilities that are quite different from a traditional digital license. The publisher agrees, among other things, to conform to submission standards. The library agrees, among other things, to receive, maintain, and migrate the files over time.
- Trigger events. Discussions of "trigger events" provided some of the most interesting, if also frustrating, aspects of the year. In the end, the only trigger event that all completely agreed upon was that condition under which the digital materials being archived were no longer commercially available either from the original publisher or someone who had acquired them as assets for further utilization. Given that it is quite hard to imagine a circumstance in which journal files of this magnitude would be judged to have no commercial value and would not be commercially offered, does it makes sense to maintain such an archive at all? Will money be invested year after year as a precaution or protection against an event that will never occur? Though the team agreed it is necessary to proceed with long-term electronic archival agreements, clearly serious issues are at stake.
The team also identified a second side to the trigger question: if the archive were not going to be exposed to wide use by readers, how could the archival agent "exercise" it in order to assure its technical viability? This topic is discussed more fully in the "Trigger Events" section of the report. Briefly here, the team was concerned that a totally dark archive might become technically unusable over time and wanted to provide agreed upon applications that would make the archive at least "dim" and subject to some level of use, e.g., available to local authorized users. The second, perhaps more important, notion was that there would be archival uses that could be distinguishable from normal journal use. The team tried to identify such uses but so far have not received the feedback from the history of science community (for example) that we would have wished. Therefore, "archival uses" remain more theory than reality, but at the same time they represent a topic we are committed to exploring in the next phase of work. An alternative would be to have the archive serve as a service provider to former subscribers, but this changes the nature of the archive to being a "normal" host which could be a questionable consideration. These issues are not currently reflected in the draft license.
- Financial terms were viewed as neutral at this time, i.e., no money would change hands. In our current thinking, the publisher provides the files without charge and the archival agency accepts the perpetual archiving responsibility without financing from the publisher. Obviously, one could argue that the publisher should be financing some part of this activity. However, in the longer term it is probably more realistic to develop alternative financing arrangements that are independent of the publisher.
- Technical provisions. Early on, the team agreed on the OAIS model for submission and subsequent activities. The license reflects this in terms of the need to define metadata provided by the publisher. The specific metadata elements have not yet been finalized, however. This is also relevant in defining what use can be made by the archive of the metadata. Publishers such as Elsevier that have secondary publishing businesses want to be sure that those businesses are not compromised by an archive distributing abstracts for free, for example. The model license does not yet reflect this point but it is recognized as an issue.
- Withdrawal of content. The current draft license provides for appropriate notices when an item is withdrawn by the publisher. The team has discussed and will likely incorporate into the license the notion that the archive will "sequester" rather than remove a withdrawn item.
The model license is still evolving and not yet ready for signature. However, there are no identified points of contention — only points for further reflection and agreement on wording. All the participants were very much pleased with the team's ability to come to early understandings of licensing issues and to resolve some of these at the planning stage. This success arises out of close working relationships and communications over about a year-and-a-half of cooperative effort.
As part of its work, the Yale-Elsevier team began to investigate whether and how the uses of an archive of electronic journals would differ significantly from those of the active product distributed by the publisher. This investigation was launched to help determine what needed to be preserved and maintained in the archive; to inform the design of a discovery, navigation, and presentation mechanism for materials in the archive; and to determine the circumstances under which materials in the archive could be made available for research use without compromising the publisher's commercial interests.
The group reviewed traditional archival theory and practice and began preliminary consultations with historians of science and scholarly communication to understand past and contemporary uses of scientific journal literature. A number of issues became particularly significant in the group's discussions: the selection of documentation of long-term significance, the importance of topological and structural relationships within the content, and the importance of the archive as a guarantor of authenticity.
Selection and Appraisal
The first area in which there might be useful approaches is that of archival appraisal, i.e., the selection of those materials worth the resources needed for their long-term preservation and ongoing access. Archival appraisal considers the continuing need of the creating entity for documentation in order to carry out its mission and functions and to maintain its legal and administrative accountability, as well as other potential uses for the materials. These other uses generally fall into the category of support for historical research, although there may be others such as establishing and proving the existence of personal rights which may also be secondary to the original purpose of the documentation in question.
Archivists also consider the context of the documentation as well as its content in determining long-term significance. In some cases, the significance of the documentation lies in the particular content that is recorded; the presentation of that content is not critical to its usefulness or interpretation. The content of the documentation can be extracted, put into other applications, and made to serve useful purposes even as it is divorced from its original recording technology and form. In other cases, however, the role of documentation as evidence requires that the original form of the document and information about the circumstances under which it was created and used also be preserved in order to establish and maintain its authenticity and usefulness.
With these selection approaches in mind, a number of issues arose in the e-journal archiving context and in the work of the team. The first question was whether it was sufficient for the archive to preserve and provide access to "just" the content of the published material — primarily text and figures — in a standard format, which might or might not be the format in which the publisher distributed the content. Preserving only the content, insofar as that is possible, foregoes the preservation of any functionality that controlled and facilitated the discovery, navigation, and presentation of the materials on the assumption that functionality was of little or no long-term research interest. The decision to preserve content only would eliminate the need to deal with changing display formats, search mechanisms and indices, and linking capabilities.
While the group has adopted this narrow definition of the scope of the archive as a working assumption, such a narrow approach does preclude the study of the diplomatics of these documents — "digital paleography," as one of our advisors termed it. How essential to future researchers' interpretations of the use of these documents is it for them to know what tools contemporary users had available to them, e.g., indices that did not address particular components of the document, thus making them unfindable through the publisher's interface? At the conclusion of the planning period the team had not changed the main focus of its attention on content, but it was sufficiently intrigued by the issues of digital paleography that it will propose that this assumption be investigated more thoroughly in its implementation proposal.
The long-held approach in the archival profession governing how archives are organized, described, and provided to users once they become part of the repository's holdings is deeply informed by the principle of provenance and the rules that flow from it: respect des fonds (records of a creator should remain together) and original order (which has significance for the interpretation of records and should be preserved whenever possible). These principles reflect the nature of archival records. They are by-products created by an organizational entity in the course of carrying out its functions. The primary significance of the records is as evidence of those functions and activities. These principles reflect the needs of research for bodies of materials that are as strongly evidential as possible and reflect minimal interaction by custodial agencies other than the creator. The assumption is that solid historical research and interpretation require knowledge of the circumstances under which the materials were created and maintained and not just access to the raw content.
Access to archival materials is often characterized by two factors that take advantage of the provenance approach. Searches are often conducted to document a particular event or issue rather than for a known item; they may also be based on characteristics of the creators rather than on characteristics of the records themselves. Comprehensive and accurate recording of the circumstances of creation, including characteristics of records creators and the relationships among them, are central parts of archival description. The implications for developing an approach to downstream uses of e-journal literature include the potential need of contextual metadata regarding the authors and other circumstances affecting the publication of a given article/issue that are not found in a structured way in the published materials. Information regarding the context in which the article was submitted, reviewed, and edited for publication is important in studies of scholarly communication, especially as to questions of how institutional affiliations might be important in certain lines of inquiry and who had the power to accept or reject submissions.
Some of this information is explicitly disseminated in online products, e.g., in the form of members of an editorial board or descriptions of the purpose and audience of the journal, but it may be presented separately from any particular volume, issue, or article; may reflect only current (and not historic) information; and is rarely structured or encoded in such a way as to facilitate its direct use in scholarly studies. Other information about the context of creation and use that historians of science might find useful is not published; rather, it is found in the publisher's records of the review process and circulation figures. Capturing and linking of title-level publication information are additional areas of investigation that the team intends to pursue in its implementation proposal.
Preservation of Structural Information
The mass of archival records that repositories select for long-range retention and are responsible for, and the imperative of the principle of provenance to maintain and document the recordkeeping system in which the records were created and lived, combine to foster the archival practice of top-down, hierarchical, and collective description. This type of descriptive practice provides both a way of reflecting the arrangement of the original recordkeeping system and of allowing the archival agency to select for each body of records the level beyond which the costs of description outweigh the benefits of access, and completing its descriptive work just before that point is achieved.
This principle and practice highlight for scientific journals the importance of preserving the relationship among the materials that the publisher was distributing, especially the need to link articles that the publisher presented as a "volume," "special issue," or some other sort of chronological or topical grouping. These relationships represent another form of contextual information important to the study of scholarly communications, in terms of which articles were released simultaneously or in some other relationship to each other. While the team recognized the need to be aware of new forms of publishing that would not necessarily follow the traditional patterns adopted by the hard-copy print world, it asserted that those structures do need to be saved as long as they are used.
With respect to other methods of navigating among digitally presented articles, such as linking to articles cited, the team found that many of these capabilities existed not as part of the content, but as added functionality that might be managed by processes external to the content or to the publisher's product (e.g. CrossRef). The team felt that these capabilities should be preserved as part of the archive, necessitating the need to maintain an enduring naming scheme for unambiguous identification of particular pieces. The plan for the implementation project will include a closer look at the requirements for supporting important navigational capabilities.
Finally, the authenticity of any document that purports to be evidence rests in some part on a chain of custody that ensures that the document was created as described and that it has not been altered from its original form or content. Once an archival agency takes charge of documentation it is obligated to keep explicit records documenting the circumstances of its transfer or acquisition and any subsequent uses of it. Records are rarely removed, either for use or retrospective retention by an office, but when this is necessary the circumstances of that action need to be documented and available. This assumption, along with the unique nature and intrinsic value of the materials, leads to the circumstance of secure reading rooms for archival materials and all of the security paraphernalia associated with them, as well as to detailed recordkeeping of use and work performed on the records.
The assumption that the archival agency is responsible for preserving the "authentic" version of documentation suggests that transfer of content to the official archival agency should take place as soon as the publisher disseminates such content, and that once placed into the archive content will not be modified in any way. This includes instances of typographical errors, the release of inaccurate (and potentially dangerous) information, or the publication of materials not meeting professional standards for review, citation, and similar issues. Instead, the archive should maintain a system of errata and appropriate flagging and sequestering of such materials that were released and later corrected or withdrawn, ensuring that the record of what was distributed to the scholarly community, however flawed, would be preserved.
Issues related to authenticity also suggest that one circumstance under which transferred content could be released, even while the publisher retains a business interest in it, is when questions are raised as to the authenticity of content still available under normal business arrangements. Longer-term safeguards will need to be in place within the archival repository to ensure the authenticity of the content.
Other issues relating to the nature and mission of an archival repository appear elsewhere in this report, especially in the discussion of trigger events. The issues discussed in this section, however, are especially germane to the question of how anticipated use of preserved electronic journals should inform the selection of materials. The Yale-Elsevier team has found many archival use topics central to the definition and purpose of an archive for electronic journals and plans to pursue them more completely in the implementation project.
The Role of Metadata in an e-Archive
It is impossible to create a system designed to authenticate, preserve, and make available electronic journals for an extended period of time without addressing the topic of metadata. "Metadata" is a term that has been used so often in different contexts that it has become somewhat imprecise in meaning. Therefore, it is probably wise to begin a discussion of metadata for an archival system by narrowing the array of possible connotations. In the context of this investigation, metadata makes possible certain key functions:
- Metadata permits search and extraction of content from an archival entity in unique ways (descriptive metadata). Metadata does this by describing the materials (in our case journals and articles) in full bibliographic detail.
- Metadata also permits the management of the content for the archive (administrative metadata) by describing in detail the technical aspects of the ingested content (format, relevant transformations, etc.), the way content was ingested into the archive, and activities that have since taken place within the archive, thereby affecting the ingested item.
Taken together, both types of metadata facilitate the preservation of the content for the future (preservation metadata). Preservation ensures the retrievability of protected materials, their authentication, and their content.
Using metadata to describe the characteristics of an archived item is important for a number of reasons. With care, metadata can highlight the sensitivity to technological obsolescence of content under the care of an archival agency (i.e., items of a complex technical nature that are more susceptible to small changes in formats or browsers.) Metadata can also prevent contractual conflicts by pinpointing issues related to an archived item's governance while under the care of an archive; e.g., "the archive has permission to copy this item for a subscriber but not for a nonsubscriber." Finally, metadata can permit the archival agency to examine the requirements of the item during its life cycle within the archive; e.g., "this object has been migrated four times since it was deposited and it is now difficult to find a browser for its current format."
The Open Archival Information System (OAIS) model to which the YEA project has chosen to conform refers to metadata as preservation description information (PDI). There are four types of PDI within OAIS: 1) reference information, 2) context information, 3) provenance information, and 4) fixity information. Not all of these forms of PDI need be present in the Submission Information Package (SIP) ingested by the archive, but they all must be a part of the Archival Information Package (AIP) stored in the archive. This implies that some of these PDI elements are created during ingestion or input by the archive.
Reference Information refers to standards used to define identifiers of the content. While YEA uses reference information and supplies this context in appendices to our metadata element set, we do not refer to it as metadata. Context Information documents the relationships of the content to its environment. For YEA, this is part of the descriptive metadata. Provenance Information documents the history of the content including its storage, handling, and migration. Fixity Information documents the authentication mechanisms and provides authentication keys to ensure that the content object has not been altered in an undocumented manner. Both Provenance and Fixity are part of administrative metadata for YEA.
Given the focus YEA has chosen to place on a preservation model that serves as an archive as well as a guarantor for the content placed in its care, authenticity was an issue of importance for the group to explore. In its early investigations, the team was much struck by the detailed analysis of the InterPARES project on the subject of authenticity. While some of the InterPARES work is highly specific to records and manuscripts — and thus irrelevant to the journal archiving on which YEA is focusing — some general principles remain the same. It is important to record as much detail as possible about the original object brought under the care of the archive in order both to prove that a migrated or "refreshed" item is still the derivative of the original and to permit an analysis to be conducted in the future about when and how specific types of recorded information have changed or are being reinvented, or where totally new forms are emerging.
Finally, as YEA has examined the issue of metadata for a system designed to authenticate, preserve, and make available electronic journals for an extended length of time, we have tried to keep in mind that metadata will not just be static; rather, metadata will be interacted with, often by individuals who are seeking knowledge. To this end, we acknowledge the four issues identified by the International Federation of Library Associations and Institutions (IFLA) in the report on functional requirements for bibliographic records: metadata exist because individuals want to find, identify, select, or obtain informational materials.
The Metadata Analysis
YEA began its analysis of needed metadata for a preservation archive of electronic journals by conducting a review of extant literature and projects. In this process the team discovered and closely explored a number of models and schemes. The first document — and the one we returned to most strongly in the end — described the OAIS model, although OAIS provides only a general framework and leaves the details to be defined by implementers. We also examined the Making of America's testbed project white paper and determined it was compatible with OAIS. Next, we examined the 15 January 2001 RLG/OCLC Preservation Metadata Review document and determined that while all of the major projects described (CEDARS, NEDLIB, PANDORA) were compliant with the OAIS structure, none of them had the level of detail, particularly in contextual information, that we believed necessary for a long-term electronic journal archive. We also explored the InterPARES project (mentioned above) and found there a level of detail in contextual information that we had not seen delineated in the RLG/OCLC review of the other projects.
At the same time, the library and publisher participants in the project were exploring the extant metadata sets used by Elsevier Science to transport descriptions of their journal materials for their own document handling systems and customer interfaces. In addition to their EFFECT standard (see section describing Elsevier Science's Technical Systems and Processes), we also examined portions of the more detailed Elsevier Science "Full Length Article DTD 4.2.0." Due to the solid pre-existing work by Elsevier Science in this area and the thorough documentation of the metadata elements that Elsevier Science is already using, we were able to proceed directly to an analysis of the extant Elsevier metadata to determine what additional information might need to be created or recorded during production for and by YEA.
About halfway through the project year, the team made connections with British Library staff who were themselves just completing a metadata element set definition project and who generously shared with the team their draft version. While the British Library draft document was more expansive in scope than the needs of the YEA project (i.e., the British Library document covers manuscripts, films, and many other items beyond the scope of any e-journal focus), the metadata elements defined therein and the level of detail in each area of coverage were almost precisely on target for what the e-archiving team sought to create. Thus, with the kind consent of contacts at the British Library, the team began working with the draft, stripping away unneeded elements, and inserting some missing items.
In the fall of 2001, the YEA team committed to creating a working prototype or proof-of-concept which demonstrated it would indeed be possible to ingest data supplied by Elsevier Science into a minimalistic environment conducive to archival maintenance. The prototype-building activity briefly diverted the metadata focus from assembling a full set of needed elements for the archival system to defining a very minimal set of elements for use in the prototype. The technical explorations of the prototype eventually led us to simply use the metadata supplied by Elsevier and the prototype metadata element set was never used. The one remaining activity associated with metadata performed for the prototype was to map the Elsevier EFFECT metadata to Dublin Core so that it could be exposed for harvesting.
Once the prototype subset element set was identified, YEA returned to the question of a complete metadata element set for a working archive. As the British Library draft document was examined, reviewed, and assessed, many decisions were made to include or exclude elements, particularly descriptive metadata elements. These decisions were informed in part by the recurring theme of whether the presence of such an item of information would assist individuals performing inquiries of the archive. The questions related to uses of scholarly journal materials for archival explorations are dealt with more fully elsewhere in this report.
The full metadata element set was completed by YEA as a recommended set of metadata to be used in a future full archive construction. It is important to reiterate that our approach to producing this set of metadata was inclusive. In creating an archival architecture it is not enough to delineate the descriptive metadata that must be acquired from the publisher or created by the archive while leaving out the administrative metadata elements that permit the archive to function in its preserving role. Neither is it sufficient to focus on the administrative metadata aspects that are unique to an archive while setting aside the descriptive metadata elements, i.e., assuming they are sufficiently defined by other standards. Preservation metadata are the conflation of the two types of metadata and, in fact, both types of metadata work jointly to ensure the preservation of and continuing access to the materials under the care of the archive.
One other fact may be of interest to those reviewing the description of metadata elements for the YEA: where possible, we used external standards and lists as materials upon which the archive would depend. For example, we refer to the DCMI-Type Vocabulary as the reference list of the element called "resource type."
We certainly do not expect that the element set created by YEA will proceed into implementation in a future full archive construction without any further changes. It will undoubtedly be influenced by work done by other groups such as the E-Journal Archive DTD Feasibility Study prepared for the Harvard University Library e-journal archiving project. However, we now have a reference by which to assess whether proposed inclusions, exclusions, or modifications fit the structure we imagine an archive will need properly to preserve electronic journals.
Metadata in Phase II
In the next phase of the e-archiving project, the YEA desires further to define and refine metadata needed for a system designed to authenticate, preserve, and make available electronic journals for an extended period of time. We will need to connect with others working informally or formally to create a standard or standards for preservation metadata. As noted above, further investigations may influence a revision to the initial metadata set defined during the planning phase. Additionally, we intend to rework the element set into an XML schema for Open Archives Initiative (OAI) manifestation and harvesting. With our small prototype, we have demonstrated that OAI harvesting can occur from the simple Dublin Core metadata set to which we mapped the Elsevier EFFECT elements. However, OAI interaction can occur much more richly if a fuller dataset is in use, and we intend to accomplish this schema transformation to enable fuller interaction with the archive as it develops.
As the next phase moves forward, another avenue of exploration will be to assess and examine our metadata element choices in a live environment. We are most particularly interested in testing those elements included or excluded on the basis of assumptions made regarding the likelihood of archival inquiries targeting specific elements for exploration. Such choices can only be validated over time and with the interaction of individuals conducting authentic research into the history of their fields. Finally, we look forward to testing our element choices for administrative metadata under the stress of daily archive administration and maintenance. Only in such a live environment can an archive be truly confirmed as a functioning entity for preserving the materials under its care.
Elsevier Science is a major producer of scholarly communication and scientific journals that are distributed globally. The headquarters for production, along with the electronic warehouse, is located in Amsterdam, Netherlands. There the company maintains two office buildings and deploys several hundred staff to organize, produce, and distribute its content. The production of electronic scholarly information is a highly complex process that occurs in a distributed geographical environment involving many businesses beyond Elsevier Science. Changes to the manufacturing process can take years to percolate through the entire chain of assembly and are considered significant business risks. Consequently, Elsevier is moved to make changes to production only when compelling market demands exist. For example, the Computer Aided Production (CAP) workflow is now under modification because Science Direct, an internal customer of Elsevier Science, is experiencing market pressure to bring published items to its customers in a shorter time than ever before.
Prior to the creation of the Electronic Warehouse (EW) in 1995, Elsevier Science had no standard processes to create and distribute journals or content. The production of journals was based upon a loose confederation of many smaller publishing houses owned by Elsevier Science. Content was produced using methods that were extant when Elsevier acquired a given publisher. Consequently, prior to the creation of the EW there was no uniformity in the structure or style of content marketed under the name of Elsevier Science. Each publishing house set its own standards for creation and distribution. The lack of a central infrastructure for creating and distributing content also served as an impediment to the rapid distribution of scholarly communication to the market.
With the creation of networks in the early 1990s the perception of time delay amplified. Scientists began to use the network themselves to share communications with one another instantly. The scholarly community would no longer accept long delays between the submission of manuscripts to a publisher and their appearance in paper journals. Scientists and publishers realized that reporting research in electronic format could significantly close the time gap between publication and distribution of content to the scholarly community. The origin of Elsevier Science's Electronic Warehouse is rooted in this realization. Elsevier Science's early solution to the problem was to support a research project known as The University Licensing Program commonly referred to as TULIP.
The TULIP Project (1992-1996) grew out of a series of Coalition for Networked Information (CNI) meetings in which Elsevier Science, a CNI member, invited academic libraries to partner with it to consider how best to develop online delivery capabilities for scientific journals. The purpose of the project was to discuss the need to build large-scale systems and infrastructures to support the production and rapid delivery of such journals over a network to the scholarly community. Given a critical mass of interest from the university communities, Elsevier Science justified a large investment that would create a manufacturing function for converting paper journals into an electronic format for network distribution. This process became known as the PRECAP method for the creation of an electronic version of a journal. The creation of this conversion function served as the foundation for the present day EW. Near the end of the TULIP project plans for an EW were adopted by Elsevier Science in 1995 and built by the end of 1996. By 1997 the EW could produce over one thousand journals using a standard means of production.
The creation and success of the EW in producing and distributing journals was a very significant accomplishment for Elsevier Science because 1) many individual publishers had to be converted one by one, 2) standards for production were evolving from 1994 through 2000, and 3) suppliers who created content for the producers needed to be continuously trained and retooled to adhere to the evolving standards. At the same time, these suppliers met their obligation to produce content, on time, for Elsevier Science.
Elsevier Science maintains four production sites based in the United Kingdom (Oxford and Exeter), Ireland (Shannon), the United States (New York), and the Netherlands (Amsterdam). Each site provides content to the EW where this content is stored as an S300 dataset. The contents of each dataset represent an entire issue of a particular journal. The storage system at the EW originally used vanilla IBM technology, i.e., ADSTAR Distributed Storage Manager (ADSM), to create tape backup datasets of content stored on magnetic and optical storage. Access to the data was based only upon the file name of the S300 dataset. As of Summer 2001, the old hierarchical storage system was replaced by an all-magnetic disk-based system providing more flexibility and enabling faster throughput and production times.
The CAP Workflow
The following is a concise description and discussion of the Computer Aided Production (CAP) workflow. An item is accepted for publication by means of a peer review process. After peer review the item enters the CAP workflow via the Login Function in which a publication item identifier (PII) is assigned to the content. This is a tag that the EW uses to track the item through the production process, and it also serves as a piece of metadata used for the long-term storage of the item. Since this identifier is unique it could also be used as a digital object identifier for an information package in an OAIS archive. In addition to assigning the PII, the login process also obtains other metadata about the author and item such as the first author's name, address, e-mail address, and number of pages, tables, and figures in the item. This and other similar metadata are entered into a Production Tracking System (PTS) that is maintained by the Production Control system.
The item is then sent electronically to a supplier (Elsevier has sixteen suppliers, distributed on a worldwide basis). There the item undergoes media conversion, file structuring, copy editing, and typesetting. The output of this processing is a first generation (no corrections) SGML markup of the item, a PDF file, and artwork for the item. These units of work are then sent to the author for corrections. The author makes the necessary corrections, then sends the item to Production Control where information in the PTS system is updated. Thereafter, Production Control sends the item to an Issues Manager. Any problems found in the content are worked out between the author and the Issues Manager. If there are no problems, the supplier sends the content directly to Production Control.
The Issues Manager then passes the corrections on to the supplier and begins to compile the issue. This involves making decisions about proofs, cover pages, advertising, and building of indexes. On average, an Issues Manager is responsible for five to ten journals or about fifteen thousand pages a year. Once content is received, the supplier then creates a second-generation SGML and PDF file and new artwork, if necessary. This cycle is repeated until the complete issue is assembled. Once the issue is fully assembled the Issues Manager directs the supplier to create a distribution dataset called S300 which contains the entire issue. The supplier sends this file to the EW where the file serves as input for the creation of distribution datasets for customers such as Science Direct. At EW this dataset is added to an ADSM-based storage system that serves as a depository — not an archive — for all electronic data produced for the EW. The S300 dataset is also sent to a printer where a paper version of the issue is created and then distributed to customers. The paper version of the journal is also stored in a warehouse. Most printing occurs in the Netherlands and the United Kingdom.
The current issue-based workflow has two serious problems. The first is that production does not produce content for distribution in a timely fashion for customers like Science Direct, and the second is that issue-based processing generates high and low periods of work for suppliers. A steady stream of work passing through the manufacturing process would be more efficient for the suppliers and would result in a more timely delivery of content to Elsevier's customers such as Science Direct. The driving force behind a need for change, as mentioned above, is not EW but rather, Science Direct as an internal customer of the EW. The resolution to these workflow problems is to change the fundamental unit of work for production from an issue to an article, something Elsevier recognizes and is currently working toward.
The new article-based e-workflow being developed by Science Direct will streamline interactions between authors, producers, and suppliers. At a management level automation of key functions will yield the following efficiencies: 1) in the e-workflow model, Web sites will be created to automate the electronic submission of articles to an editorial office and to establish an electronic peer review system, and 2) the peer review system will interface with a more automated login and tracking system maintained by the EW.
The new Production Tracking System can then be used by the EW, suppliers, and customers to manage the production and distribution processes more efficiently. Functionally, the EW would also produce two additional intermediary datasets called S100 and S200. These datasets could be sent to the EW for distribution to customers at the time of creation by the supplier and before an S300 dataset was sent to the EW. For example, the physics community, which uses letter journals, would directly benefit by this change in production. Under the e-workflow model, the supplier could immediately upon creation send an S100 dataset that contained a first generation version of the letter or item (i.e., no author corrections) directly to the EW for distribution to a Science Direct Web site. In addition, Science Direct would also be able to distribute content at the article level in the form of an S200 dataset that contained second generation or correct SGML and PDF data. This content would be sent to a Web site before an S300 dataset, representing the entire issue that was sent to the EW by the supplier. It is interesting to note that the EW does not save intermediary datasets once an S300 dataset is created. Pilot projects have been launched to test the e-workflow model.
Finally, it should be noted that as the use of the EW developed and evolved over time, it became apparent — for operational and customer support reasons — that some additional support systems would be needed. For example, one of these systems facilitates Elsevier's ability to support customers in auditing the completeness of their collections. Another tracks the history of publications that Elsevier distributes.
In the early 1990s Elsevier Science recognized that production and delivery of electronic content could best be facilitated by conversion of documents to an SGML format. SGML is a tool that enables the rendering of a document to be separated from the content structure of a document. This division is achieved through the use of a document type definition (DTD) and a style sheet. The DTD is a tool by which the structure of a document can be defined through the use of mark-up tags. In addition, the DTD defines a grammar or set of rules that constrain how these tags can be used to mark up a document. A style sheet defines how the content should be rendered, i.e., character sets, fonts, and type styles. Together, these two tools make documents portable across different computer systems and more easily manipulated by database applications. In addition, the separation of content from rendering is also critical to the long-term preservation of electronic scholarly information. That said, the evolution of production and distribution of content by Elsevier Science or the EW has been tightly coupled to 1) the development of a universal DTD for their publications, 2) the successful adoption of a DTD by EW suppliers, and 3) the emergence of the Portable Document Format known as PDF. On average it took two years for all suppliers (at one time greater than two hundred) to integrate a new DTD into production. As inferred from Table I below, by the time one DTD was fully implemented by all suppliers another version of the DTD was being released.