Notes
Slide Show
Outline
1
mod_oai:
Metadata Harvesting
for Everyone
  • Michael L. Nelson, Herbert Van de Sompel, Xiaoming Liu, Aravind Elango


  • {mln,aelango}@cs.odu.edu
  • {herbertv,liu_x}@lanl.gov


  • DLF 2004 Fall Forum
  • Baltimore MD
  • October 25-27, 2004
2
Outline
  • mod_oai
    • crawling vs. harvesting
    • complex objects & OAI-PMH
    • how mod_oai works
    • scenarios
    • demos
  • More information
    • http://www.modoai.org/
    • http://www.openarchives.org/


3
 
4
 
5
mod_oai
  • Goal: integrate OAI-PMH functionality into the web server itself…
  • mod_oai: an Apache 2.0 module to automatically answer OAI-PMH requests for an http server
    • written in C
    • respects values in .htaccess, httpd.conf
  • Result: web harvesting with OAI-PMH semantics (e.g., from, until, sets)
      • www.foo.edu/modoai?ListIdentifiers&metdataPrefix=oai_dc&from=2004-09-15&set=video:mpeg
6
OAI-PMH data model
7
OAI-PMH and complex models
  • OAI-PMH record == modeled representation of the resource
  • Can be selectively harvested via OAI-PMH ~ datestamp, set
  • Resource can be:
    • simple object (1 file)
    • compound object (multiple files)
  • OAI-PMH records can contain:
    • Typical metadata
    • Actual resource(s)
      • By-Value – base64 encoded
      • By-Reference – http address of resource
      • both
    • Identifiers of metadata and resource(s), unambiguously mapped to the identified data
    • A variety of secondary information
8
Complex Objects & OAI-PMH
  • LANL Repository
    • OAI-PMH as a Repository Access Protocol to access metadata and content represented as DIDLs
  • APS/LANL/LoC Mirroring
    • OAI-PMH transfer of APS content represented in application neutral format (DIDLs)
  • LANL DSpace Plug-in
    • Exposes MPEG-21 DIDL documents through built-in DSpace OAI-PMH infrastructure
9
How mod_oai works
  • Install on an Apache 2.0 server
    • compile & edit httpd.conf




10
OAI-PMH characteristics:
Typical Repository
11
 
12
OAI-PMH characteristics: mod_oai
13
OAI-PMH Concepts
14
http_header
15
Use Cases
  • Regular Web Crawling
    • use ListIdentifiers to discover URLs
    • add new URLs to the list of URLs to be crawled
  • Harvesting Resources w/ OAI-PMH
    • use ListRecords to extract the entire resource as an MPEG-21 DIDL AIP
16
Regular Crawling: ListIdentifiers
  •    harvester issues a ListIdentifiers, finds the updates, and does HTTP GETs on just the updates
17
Resource Harvesting: ListRecords
  •    harvester issues a ListRecords, and gets the updates in DIDLs (http headers + by-value or by-ref
  •     resources)
18
Demo
  • Repository Explorer
    • http://oai.dlib.vt.edu/cgi-bin/Explorer/oai2.0/testoai
    • http://oai.dlib.vt.edu/cgi-bin/Explorer/oai2.0/testoai?archive=http://whiskey.cs.odu.edu/modoai
  • Direct URLs
    • http://whiskey.cs.odu.edu/modoai?verb=Identify
    • http://whiskey.cs.odu.edu/modoai?verb=ListMetadataFormats
    • http://whiskey.cs.odu.edu/modoai?verb=ListIdentifiers&metadataPrefix=oai_dc
    • http://whiskey.cs.odu.edu/modoai?verb=ListRecords&metadataPrefix=http_header
    • http://whiskey.cs.odu.edu/modoai?verb=ListRecords&metadataPrefix=oai_didl
19
Datestamps and Etags
  • Procedure
    • 16 harvests over 1 month of 465,374 .dk domains
    • 5,543,470 possible downloads
    • 5,182,034 successful downloads
    • 599,143 changes
20
Errors in Datestamps and Etags
Indicating Change
21
mod_oai…
  • is:
    • a simple way to more efficiently harvest web pages
    • a possible impact on robots.txt
    • fully OAI-PMH compliant
      • works with existing harvesters
  • is not:
    • yet suitable for dynamic files
    • a replacement for
      • DSpace
      • Fedora
      • eprints.org
      • other digital libraries / repositories / cms