1
|
- Michael L. Nelson, Herbert Van de Sompel, Xiaoming Liu, Aravind Elango
- {mln,aelango}@cs.odu.edu
- {herbertv,liu_x}@lanl.gov
- DLF 2004 Fall Forum
- Baltimore MD
- October 25-27, 2004
|
2
|
- mod_oai
- crawling vs. harvesting
- complex objects & OAI-PMH
- how mod_oai works
- scenarios
- demos
- More information
- http://www.modoai.org/
- http://www.openarchives.org/
|
3
|
|
4
|
|
5
|
- Goal: integrate OAI-PMH functionality into the web server itself…
- mod_oai: an Apache 2.0 module to automatically answer OAI-PMH requests
for an http server
- written in C
- respects values in .htaccess, httpd.conf
- Result: web harvesting with OAI-PMH semantics (e.g., from, until, sets)
- www.foo.edu/modoai?ListIdentifiers&metdataPrefix=oai_dc&from=2004-09-15&set=video:mpeg
|
6
|
|
7
|
- OAI-PMH record == modeled representation of the resource
- Can be selectively harvested via OAI-PMH ~ datestamp, set
- Resource can be:
- simple object (1 file)
- compound object (multiple files)
- OAI-PMH records can contain:
- Typical metadata
- Actual resource(s)
- By-Value – base64 encoded
- By-Reference – http address of resource
- both
- Identifiers of metadata and resource(s), unambiguously mapped to the
identified data
- A variety of secondary information
|
8
|
- LANL Repository
- OAI-PMH as a Repository Access Protocol to access metadata and content
represented as DIDLs
- APS/LANL/LoC Mirroring
- OAI-PMH transfer of APS content represented in application neutral
format (DIDLs)
- LANL DSpace Plug-in
- Exposes MPEG-21 DIDL documents through built-in DSpace OAI-PMH
infrastructure
|
9
|
- Install on an Apache 2.0 server
- compile & edit httpd.conf
|
10
|
|
11
|
|
12
|
|
13
|
|
14
|
|
15
|
- Regular Web Crawling
- use ListIdentifiers to discover URLs
- add new URLs to the list of URLs to be crawled
- Harvesting Resources w/ OAI-PMH
- use ListRecords to extract the entire resource as an MPEG-21 DIDL AIP
|
16
|
- harvester issues a
ListIdentifiers, finds the updates, and does HTTP GETs on just the
updates
|
17
|
- harvester issues a ListRecords,
and gets the updates in DIDLs (http headers + by-value or by-ref
- resources)
|
18
|
- Repository Explorer
- http://oai.dlib.vt.edu/cgi-bin/Explorer/oai2.0/testoai
- http://oai.dlib.vt.edu/cgi-bin/Explorer/oai2.0/testoai?archive=http://whiskey.cs.odu.edu/modoai
- Direct URLs
- http://whiskey.cs.odu.edu/modoai?verb=Identify
- http://whiskey.cs.odu.edu/modoai?verb=ListMetadataFormats
- http://whiskey.cs.odu.edu/modoai?verb=ListIdentifiers&metadataPrefix=oai_dc
- http://whiskey.cs.odu.edu/modoai?verb=ListRecords&metadataPrefix=http_header
- http://whiskey.cs.odu.edu/modoai?verb=ListRecords&metadataPrefix=oai_didl
|
19
|
- Procedure
- 16 harvests over 1 month of 465,374 .dk domains
- 5,543,470 possible downloads
- 5,182,034 successful downloads
- 599,143 changes
|
20
|
|
21
|
- is:
- a simple way to more efficiently harvest web pages
- a possible impact on robots.txt
- fully OAI-PMH compliant
- works with existing harvesters
- is not:
- yet suitable for dynamic files
- a replacement for
- DSpace
- Fedora
- eprints.org
- other digital libraries / repositories / cms
|