1
|
- Michael L. Nelson, Joan A. Smith, Martin Klein
- Old Dominion University
- Norfolk VA
- www.cs.odu.edu/~{mln,jsmit,mklein}
- DLF Spring 2006
- Austin TX
- April 10-12, 2006
|
2
|
- Get a lot of $
- Buy a lot of disks, machines, tapes, etc.
- Hire an army of staff
- Load a small amount of data
- “Look upon my archive ye Mighty, and despair!”
|
3
|
- Lazy Preservation
- Let Google, IA et al. preserve your website
- Just-In-Time Preservation
- Find a “good enough” replacement web page
- Web Server Based Preservation
- Use Apache modules to create archival-ready resources
- Shared Infrastructure Preservation
- Push your content to sites that might preserve it
|
4
|
- Can we (re)use existing installed network infrastructure for
preservation purposes?
|
5
|
- Inject the contents of an OAI-PMH repository directly into:
- Email (SMTP)
- Usenet News (NNTP)
- Instrument existing email, news servers
- Use mod_oai (www.modoai.org) to do resource harvesting
- complex object formats (e.g. MPEG-21 DIDL) used to encode the resources
as “lumps of XML”
- results are generalizable to any repository system
- Analyze testbed, simulate very large collections
|
6
|
- Website with 72 files
- HTML, PDF, PNG, JPEG, GIF
- 1KB - 1.5 MB
- Used a script to harvest the MPEG-21 DIDLs, and then:
- attach to outbound email mesgs
- post to a moderated newsgroup (repository.odu.test1)
|
7
|
|
8
|
|
9
|
|
10
|
|
11
|
- 30 days of traffic
- 505,987 mesgs
- 4081 unique hosts
- daily
- mean: 16,866
- std dev: 5147
|
12
|
|
13
|
|
14
|
|
15
|
|
16
|
- Repository
- 100,000 items
- 1MB/item
- 100 daily additions
- 400 daily updates
- Time
- Email
- granularity=1
- follows ODU power law example
- News
- servers hold contents for 30 days
|
17
|
|
18
|
|
19
|
|
20
|
- We’ve examined the worst case scenario
- large, active repository
- sending contents by-value
- Optimizations / Alternatives
- smaller, less dynamic repositories
- sending contents by-reference
- use for repository discovery, not for content interchange
- instead of sending “GetRecord” results, send “Identify” results and
let interested parties return to your site with proper harvesters
|
21
|
- Shared, existing infrastructure can be used to push content to unknown
preservation partners
- exploiting not just hardware infrastructure, but human communication
patterns for resource discovery as well
- While not possessing ideal DL/Archival capabilities, these methods are
congruent with standard web practices
- Gmail, Google Groups, etc. will always have more disks than you…
|