Notes
Slide Show
Outline
1
Repository Synchronization Using NNTP and SMTP
  • Michael L. Nelson, Joan A. Smith, Martin Klein
  • Old Dominion University
  • Norfolk VA
  • www.cs.odu.edu/~{mln,jsmit,mklein}


  • DLF Spring 2006
  • Austin TX
  • April 10-12, 2006
2
Preservation: Fortress Model
    • Get a lot of $
    • Buy a lot of disks, machines, tapes, etc.
    • Hire an army of staff
    • Load a small amount of data
    • “Look upon my archive ye Mighty, and despair!”
3
Alternate Models
of Preservation
  • Lazy Preservation
    • Let Google, IA et al. preserve your website
  • Just-In-Time Preservation
    • Find a “good enough” replacement web page
  • Web Server Based Preservation
    • Use Apache modules to create archival-ready resources
  • Shared Infrastructure Preservation
    • Push your content to sites that might preserve it
4
Shared, Existing Infrastructure
  • Can we (re)use existing installed network infrastructure for preservation purposes?


5
Experiment & Simulation
  • Inject the contents of an OAI-PMH repository directly into:
    • Email (SMTP)
    • Usenet News (NNTP)
  • Instrument existing email, news servers
  • Use mod_oai (www.modoai.org) to do resource harvesting
    • complex object formats (e.g. MPEG-21 DIDL) used to encode the resources as “lumps of XML”
    • results are generalizable to any repository system
  • Analyze testbed, simulate very large collections
6
Test Repository
  • Website with 72 files
    • HTML, PDF, PNG, JPEG, GIF
    • 1KB - 1.5 MB
  • Used a script to harvest the MPEG-21 DIDLs, and then:
    • attach to outbound email mesgs
    • post to a moderated newsgroup (repository.odu.test1)
7
Email
8
Adding Email
Attachments / Headers
9
Email Headers
10
SMTP Overhead
11
Email Traffic @ mail.cs.odu.edu
  • 30 days of traffic
    • 505,987 mesgs
    • 4081 unique hosts
    • daily
      • mean: 16,866
      • std dev: 5147
12
News
13
News Posting
14
News Overhead
15
News Policies
16
Simulation Parameters
  • Repository
    • 100,000 items
    • 1MB/item
    • 100 daily additions
    • 400 daily updates
  • Time
    • 2000 days (5.5 years)



  • Email
    • granularity=1
    • follows ODU power law example
  • News
    • servers hold contents for 30 days

17
NNTP Results
18
Email Results
(Without Memory)
19
Email Results
(With Memory)
20
Discussion
  • We’ve examined the worst case scenario
    • large, active repository
    • sending contents by-value
  • Optimizations / Alternatives
    • smaller, less dynamic repositories
    • sending contents by-reference
    • use for repository discovery, not for content interchange
      • instead of sending “GetRecord” results, send “Identify” results and let interested parties return to your site with proper harvesters
21
Summary
  • Shared, existing infrastructure can be used to push content to unknown preservation partners
    • exploiting not just hardware infrastructure, but human communication patterns for resource discovery as well
  • While not possessing ideal DL/Archival capabilities, these methods are congruent with standard web practices
    • Gmail, Google Groups, etc. will always have more disks than you…