Notes
Slide Show
Outline
1
 
2
VRC Funding
  • Part of a 4(5)-year NSF-funded project
    • supported by the Digital Libraries Initiative, Phase 2 (Grant No. IIS-9905955, the Prism Project)
  • Also partially funded by a grant from          The Andrew W. Mellon Foundation
    • Political Communications Web Archiving http://www.crl.edu/content/PolitWeb.htm
  • For updates:
    • http://irisresearch.library.cornell.edu/VRC/
3
Current Team
    • Anne R. Kenney, Research Advisor
    • Nancy Y. McGovern, Project Manager
    • Richard Entlich, Sr. Researcher
    • William R. Kehoe, Technology Coordinator
    • Ellie Buckley, Digital Research Specialist

4
Research
  • "Preservation Risk Management for Web Resources: Virtual Remote Control in Cornell's Project Prism"
    • by Kenney, McGovern, et al, in DLib Magazine, January 2002
      • http://www.dlib.org/dlib/january02/kenney/01kenney.html
  • "Virtual Remote Control:
  • Building a Preservation Risk Management Toolbox for Web Resources"
    • by McGovern, Kenney, et al, in DLib Magazine, April 2004
      • http://www.dlib.org/dlib/april04/mcgovern/04mcgovern.html


5
Virtual…
  • because VRC develops models to represent essential features of selected Web sites
  • that enable ongoing monitoring over time
  • to identify, respond to, and mitigate potential risks to the site integrity and longevity
6
Remote…
  • because VRC is intended for use by cultural heritage institutions
  • interested in the longevity of Web resources
  • residing on remote servers –
    • not owned or managed by the monitoring institution

7
Control…
  • because at the most proactive end of the VRC approach
  • a monitoring organization may act to protect another organization's resources
  • by agreement or implicit consent
  • through notification and/or action
8
Purpose
  • Develop a model for research libraries (adaptable to other contexts)
  • Support spectrum from passive monitoring to active capture
  • Lifecycle support: selection to capture
  • Understand nature of Web resources
  • Promulgate good practice
9
Types of Web Resources
  • Two types of initiatives for monitoring and/or capture of:
  • Web-based publications [Web site as a means]
  • All (or a subset) of a Web site consisting of pages within a boundary defined by a URL - or a portion of one [Web site as an end] (VRC)
10
Nature of Risks
  • Two perspectives on Web-based risk:
  • potential liability of an institution based upon the content of its Web site, or a Web site for which it is responsible
  • potential threats to the integrity and longevity of a Web resource (VRC)
11
Types of Risks
  • Include:
  • technological obsolescence
  • security weaknesses and breaches
  • human-error in developing/maintaining sites
  • organizational issues; benign neglect
  • power and technology failures
  • inadequate backup and secondary systems
12
Risk Factors
  • Organizational Context
  • Combination of indicators
  • Monitoring (change/loss over time)
  • Triggers (events, organizational, upgrades)
  • Degradation of site management indicators
13
VRC Stages
  • Identification
  • Analysis
  • Appraisal
  • Strategy
  • Detection
  • Response
14
Human – Tool Scenario
  • 1.  Identification
    • Human: identify Web resources of interest
    • Toolbox: verify list, expand list
  • 2.   Analysis
    • Toolbox: crawl sites, generate characterizations
    • Human: accept/revise characterizations
  • 3.  Appraisal
    • Human: define/review attributes of value
    • Toolbox: support appraisal, capture results
15
Human – Tool Scenario
  • 4. Strategy
    • Human: develop/review strategies
    • Toolbox: plot appraisals, compile strategies
  • 5. Detection
    • Human: define risk parameters
    • Toolbox: identify/assess risks; propose responses
  • 6. Response
    • Toolbox: propose risk response based on rules; automatic response for some risk categories
    • Human: monitor automated responses;      select response based on recommended actions
16
Risk Display Grid
17
Monitoring Layers
18
Server-level Monitoring
  • Potential multi-site impact
  • Server vulnerabilities put site content at risk
    • deletion or modification
  • Patches and new versions of Microsoft IIS and Apache server released frequently
  • Apache http server 1.3 security updates
    • to version 1.3.26 on June 18, 2002
    • to version 1.3.27 on October 3, 2002
19
Server-level Monitoring
20
VRC Toolbox
  • Identify tools for each stage (adopt, adapt, define, devise)
  • Leverage existing; apply to longevity
  • Analyze steps - automated and manual
  • Formalize protocol
  • Provide a framework to map existing, plug gaps with developments
21
VRC Toolkit
  • Development steps:


    • extensive literature review
    • development of tool categories
    • definition of categories and test protocols
    • survey existing tools for evaluation
    • select representative for testing
    • highlight findings in category summaries
22
Web Crawling
  • traversing Web sites via links
  • a capability common to most tools, but with different purposes and results
  • the VRC toolkit needs more than just Web crawlers
23
Tool Categories
  •   Link checkers
  • Site monitors
  • Web crawlers
  • Site managers
  • Change Detectors
  • Site Mappers (includes visualization)
  • HTML Validators
24
 
25
 
26
 
27
Management Issues
  • Frequency of capture – determined by
    • nature of sites/pages
    • events: technological, organizational
    • resources
  • Informed crawling
  • Valuable vs. archival
28
Mandate
  • Fully document the site by capturing all changes to the pages/sites
  • Capture significant changes to pages/sites
  • Record periodic versions of the site
  • Capture one-time copy of pages/sites
29
Current Activities
  • VRC Preservation Risk Management Program:
    • Map stages to tool requirements
    • Apply to potential organizational scenarios
    • Enable risk/response scenario development
  • Toolkit:
    • Revise and populate tool inventory
    • VRC Control Site


30
Future Projects
  • Develop approach for building human sexuality collection: capturing Web blogs and other Internet communications
  • State Government Web site case study
  • Demonstrators for toolkit scenarios