Lots of Copies Keep Stuff Safe
LOCKSS
Mellon Foundation
2/6/2001
[updated 2/14/01]

LOCKSS
building a
Digital Preservation
Internet Appliance

Librarians Keep Paper Publications Accessible
Distribute & house copies worldwide
Loan copies to libraries on request
Readers find a copy easily
It is hard to find & destroy all copies

Librarians Currently Ensure Documents Are Not “Unpublished”
Publisher takeovers, buyouts, etc.
Malicious act
Natural disaster
Being lost
Official edict
Simply by taking actions to support their local communities

LOCKSS Technical Requirements
Be affordable
Cheap PC, Open-source software
Low administration “appliance”
Have low probability of failure
Many replicas, Resists attack, No secrets
Scale to enormous rates of publishing
Preserve access
Links resolve, Searches work
Conform to publishers access controls
Libraries take custody of content

LOCKSS
Provides a simple web cache that
Never gets flushed
Holds authorized content
The cache
Pre-fetches content as published
Continuously validates against other caches
Repairs gaps from publisher and other caches
Persistence via redundancy
Not via media archiving

LOCKSS Works For Content
Any format
gif, jpeg, html, video, audio
Delivered through HTTP
More or less immutable
Not intended for dynamic content
Good match for peer-reviewed articles

Slide 8

LOCKSS Project Status
Support
National Science Foundation
Sun Microsystems Labs
Stanford University Libraries
Mellon Foundation
Software
Technical design complete
Prototype working
Alpha test ends winter 2001
Beta test starts spring 2001

Slide 10

Alpha Test Lessons
LOCKSS is feasible
~15 caches, 10 months, ~160MB of Science Online
Collected content, detected & repaired deliberate damage
Survived fire, relocation, flaky hardware
Basic mechanisms work
Mixed multi/unicast communication
Over-replicated fault tolerance by “opinion poll”
Linux-based “internet appliance”
But work needed before beta
Administrator GUI
Repair damage from other caches
Hardening against attack

Slide 12

Slide 13

30 Beta Test Publishers
American Association for the Advancement of Science, American Physical Society, Federation of American Societies for Experimental Biology, Biophysical Society, Annual Reviews, Rockefeller University Press, The Endocrine Society, American Society for Biochemistry and Molecular Biology, American Association for Clinical Chemistry, National Academy of Sciences, British Medical Journal, American Psychiatric Publishing Inc., Oxford University Press, Company of Biologists Ltd, New England Journal of Medicine, American Society for Clinical Investigation, Radiological Society of North America, Society for General Microbiology, The Histochemical Society, American Thoracic Society, BMJ Publishing Group, American Society of Neuroradiology, Lipid Research Inc., American Society for Investigative Pathology,   American Society of Plant Physiologists, The Royal College of Psychiatrists, Society for the Study of Reproduction, American Society for Microbiology, Cold Spring Harbor Lab Press, American Society for Pharmacology & Experimental Therapeutics

Beta Test Plans Include
Libraries
~60 widely distributed & varyingly configured caches
Test security, usability, performance
Journals
Not using real journal’s URLs
Simulating content [Science, PNAS, JBC,  BMJ, a few US Gov Docs] on shadow servers
 Isolate LOCKSS data streams & measure traffic
Test the system by turning off the publisher

LOCKSS
If it works
will provide access to content
for many future generations
Disclaimer: monolithic, homogeneous solutions are likely to fail, many digital preservation approaches are required