Notes
Slide Show
Outline
1
Tools and
Findings of the Emory MetaCombine Project
  • Martin Halbert & Aaron Krowne
  • Emory University
  • DLF Spring 2006 Forum
  • Tuesday, April 11, 2006
  • Austin, TX
2
Overview
  • Some project context and goals
    • Problems with scholarly portal technologies
    • MetaCombine Project goals
  • Open Source Tools Developed
    • Clustering and classification tools
    • Focused web crawlers
  • Next Steps
3
Scholarly Portals
  • A new services created and maintained by libraries for learning communities, either for single campuses or for multiple institutions
  • An emerging field (Examples: OAIster, UIUC portals, Emory portals, AmWest)
  • Usually seek to implement metasearch, through simultaneous searches across multiple OAI repositories harvested
4
Problems in First Generation
Portals based on OAI-PMH
  • Metadata problems
    • Inadequate (not enough subjects, etc)
    • Inconsistent (fields not standardized)
  • Important information “realms” not addressed
    • Web pages
    • Selective parts of library catalogs
    • Archives not exposed via OAI-PMH
5
Critiques from Scholars
  • “I want to be able to browse (not just keyword search) the holdings of a digital library to understand what’s in it.”
  • “Any assemblage of information is subject to bias; I want to know who selected the sources to include in this database.”
  • “I am a scholar of a specific subject; I don’t want to have to wade through everything in the universe.  Can’t you just show me the stuff that I’m interested in?”
6
Emory Metadata Harvesting Projects: 2001-2004
  • Partnered with teams of scholars to understand how to best design portal services for research needs, especially in specific subject domains / area studies
  • Developed  and adapted open source software (OSS) tools for metadata harvesting, data providers, indexing, and searching
  • Explored models for inter-institutional cooperation in creating metadata aggregation networks using OAI-PMH
7
MetaCombine Project
Research Questions
  • Can standardized subject taxonomies be assigned and/or derived for ad hoc information aggregations? (To answer the need: “I want to browse the collection.”)
  • What are the most effective interfaces for browsing such taxonomies?
  • Can institutions collaborate on metadata remediation activities through loosely coupled digital library frameworks?
  • Undertaken 2004-2006, sponsored by a grant from the Andrew W. Mellon Foundation
8
Classification and Clustering
  • Classification: assigning information to one or more access points in a predesignated subject taxonomy
  • Clustering: semantically analyzing a body of information to see what patterns can be found in the corpus, and then create a subject taxonomy based on these patterns
9
MetaCombine Findings
  • Clustering and classification tools still need refining, but are beginning to be genuinely useful for portal functions provided you invest in required expertise
  • Different groups of scholars want to rank and present results according to different criteria, especially when in a metasearch context where information from different “realms” is being retrieved
  • Loosely coupled DL frameworks can be useful for metadata remediation and enhancement, but need standards for exchanging information consistently
10
MetaCombine Technologies Developed
  • Clustering tools, using new NMF algorithms
  • Classification tools, using training set based on Encyclopedia of Southern Culture
  • Web services framework for inter-institutional exchange and remediation of metadata, built using OAI-PMH
  • Visualization tools for graphical browsing of subject clusters
  • (Through affiliated IMLS work) Search engine that can rank results according to quality metrics of various groups of scholars
11
Software Tools Created (all OSS)
  • Major Tool Groups
    • Core clustering
    • Visualization and editing
    • Focused crawling
    • Web services
    • Other interfacing tools
  • Federated Framework Model (OCKHAM-xform)
12
Core Clustering
  • "Ab initio" (unsupervised) topic discovery/structuring
  • Useful when you must organize resources but:
    • You lack an ontology
    • You have specialized/nuanced collections
    • There is no training set available
  • Developed by Aaron Krowne, Steve Ingram
  • Our system based on recent NMF algorithm
  • Flat and two hierarchical methods
  • Depends on Sparselib++
  • Also developed support tools:
    • Phrase finder
    • Vectorizer

13
Sample cluster guidance report
14
Visualization
  • Navigating/interpreting clustering results
  • dev. Steve Ingram (sup. Aaron Krowne)
  • Java-based system
  • based on Prefuse viz. library
  • hierarchical, "drilling down"
  • use MDS techniques to project to 2D (PCA and NMF)
15
Visualization Interface
16
Scheme Editor
  • Massaging/expert tweaking of clustering results
  • Dev. Steve Ingram (sup. Aaron Krowne)
  • Java-based
  • Esp. naming clusters (into "real" categories)
  • Merging/deleting clusters
  • Works with XML clustering organization files
17
Scheme Editor Interface
18
Focused Crawling
19
Focused Crawling
  • "Focused crawling system" (FCS)
  • dev. by Aaron Krowne, Saurabh Pathak, in cooperation with Donna Bergmark
  • Built on Heritrix
  • Purpose: efficient, topic-driven discovery of web resources
  • "Focused" with a classifier (BOW)
  • Based on BOW module for Heritrix
20
Focused Crawling (cont.)
  • Why build:
    • needed something that worked
    • need something unencumbered by IP
    • need something easier for digital librarians to use
  • Guided bootstrapping:
    • Ability to utilize phrase/keyword lists
    • Gleaning seeds through search engine (Google)
    • Seeding through Open Directory (also for negative set)
    • Seeding/training via OAI repositories
    • Development of phrase lists with phrase finder

21
Web Services
  • Clustering and classification
  • SOAP web services as wrapper around actual machine learning systems
  • dev. Urvashi Gadi (sup. Aaron Krowne)
  • Key benefits:
    • Don't need to install machine learning tools
    • Don't need to have the computation resources
    • Can use training sets to which you may not have direct access
22
Web Services (cont.)
  • I/O:
    • input: OAI repository base URL
    • output: new "static" temporary repository at new URL w/set structure
    • alt. output: XML organization file
  • Modules (in PHP):
    • Server module (depends upon BOW or NMF tool)
    • Client module
23
CWIS Interfacing
  • CWIS clustered scheme import script
  • Demonstrates porting of clusterings to mainstream DL software
  • I/O:
    • Input: clustered organization XML file
    • Updates CWIS database to add categories, classifications
    • Must also import records
24
CWIS Interfacing (cont.)
25
OCKHAM-xform Model
  • Use OAI repositories as atomic, portable objects
  • These objects can be transformed to useful ends
  • The transformed results can be loaded into digital libraries for new purposes
  • Enhanced services can be built upon them
  • MetaCombine web services, OAICopy fit into this framework (developed in the OCKHAM NSDL/DLF project)
26
Transform Workflow
27
OAIcopy
  • "Glue" tool of OCKHAM-xform
  • I/O:
    • Input: OAI repository (base URL)
    • Output: "static" OAI repository (OAI-XMLFile-based) in a subdirectory
  • "Save" OAI repository output of Metacombine Web Services (classification) clustering
  • Single, intuitive command: oaicopy < base_url > < local_directory >
  • Afterwards: new OAI repository at http://your_server/web_root/subdir_name/
  • Other uses:
    • Repository caching (with no database required)
    • Repository upgrading (1.0/1.1 -> 2.0)
28
Coming Experiments in Workflow
  • "Pipelining" tools/facilities
  • Digital librarian-friendly management tools
  • More collaborative transform services, e.g.
    • Thumbnail generation
    • Collection-level metadata augmentation
    • Date normalization
    • UTF sanitiziation
29
Next Steps
  • New funding from Mellon Foundation will go toward developing production level versions of these services over the next 2 years
  • Emory will develop a new portal for Southern Studies based on these technologies
  • Emory will collaborate with other institutions in the DLF Aquifer project on frameworks for metadata remediation and enhancement – workshop to be held in Atlanta in July 2006.