1	Tools and Findings of the Emory MetaCombine Project Martin Halbert & Aaron Krowne Emory University DLF Spring 2006 Forum Tuesday, April 11, 2006 Austin, TX
2	Overview Some project context and goals Problems with scholarly portal technologies MetaCombine Project goals Open Source Tools Developed Clustering and classification tools Focused web crawlers Next Steps
3	Scholarly Portals A new services created and maintained by libraries for learning communities, either for single campuses or for multiple institutions An emerging field (Examples: OAIster, UIUC portals, Emory portals, AmWest) Usually seek to implement metasearch, through simultaneous searches across multiple OAI repositories harvested
4	Problems in First Generation Portals based on OAI-PMH Metadata problems Inadequate (not enough subjects, etc) Inconsistent (fields not standardized) Important information “realms” not addressed Web pages Selective parts of library catalogs Archives not exposed via OAI-PMH
5	Critiques from Scholars “I want to be able to browse (not just keyword search) the holdings of a digital library to understand what’s in it.” “Any assemblage of information is subject to bias; I want to know who selected the sources to include in this database.” “I am a scholar of a specific subject; I don’t want to have to wade through everything in the universe. Can’t you just show me the stuff that I’m interested in?”
6	Emory Metadata Harvesting Projects: 2001-2004 Partnered with teams of scholars to understand how to best design portal services for research needs, especially in specific subject domains / area studies Developed and adapted open source software (OSS) tools for metadata harvesting, data providers, indexing, and searching Explored models for inter-institutional cooperation in creating metadata aggregation networks using OAI-PMH
7	MetaCombine Project Research Questions Can standardized subject taxonomies be assigned and/or derived for ad hoc information aggregations? (To answer the need: “I want to browse the collection.”) What are the most effective interfaces for browsing such taxonomies? Can institutions collaborate on metadata remediation activities through loosely coupled digital library frameworks? Undertaken 2004-2006, sponsored by a grant from the Andrew W. Mellon Foundation
8	Classification and Clustering Classification: assigning information to one or more access points in a predesignated subject taxonomy Clustering: semantically analyzing a body of information to see what patterns can be found in the corpus, and then create a subject taxonomy based on these patterns
9	MetaCombine Findings Clustering and classification tools still need refining, but are beginning to be genuinely useful for portal functions provided you invest in required expertise Different groups of scholars want to rank and present results according to different criteria, especially when in a metasearch context where information from different “realms” is being retrieved Loosely coupled DL frameworks can be useful for metadata remediation and enhancement, but need standards for exchanging information consistently
10	MetaCombine Technologies Developed Clustering tools, using new NMF algorithms Classification tools, using training set based on Encyclopedia of Southern Culture Web services framework for inter-institutional exchange and remediation of metadata, built using OAI-PMH Visualization tools for graphical browsing of subject clusters (Through affiliated IMLS work) Search engine that can rank results according to quality metrics of various groups of scholars
11	Software Tools Created (all OSS) Major Tool Groups Core clustering Visualization and editing Focused crawling Web services Other interfacing tools Federated Framework Model (OCKHAM-xform)
12	Core Clustering "Ab initio" (unsupervised) topic discovery/structuring Useful when you must organize resources but: You lack an ontology You have specialized/nuanced collections There is no training set available Developed by Aaron Krowne, Steve Ingram Our system based on recent NMF algorithm Flat and two hierarchical methods Depends on Sparselib++ Also developed support tools: Phrase finder Vectorizer
13	Sample cluster guidance report
14	Visualization Navigating/interpreting clustering results dev. Steve Ingram (sup. Aaron Krowne) Java-based system based on Prefuse viz. library hierarchical, "drilling down" use MDS techniques to project to 2D (PCA and NMF)
15	Visualization Interface
16	Scheme Editor Massaging/expert tweaking of clustering results Dev. Steve Ingram (sup. Aaron Krowne) Java-based Esp. naming clusters (into "real" categories) Merging/deleting clusters Works with XML clustering organization files
17	Scheme Editor Interface
18	Focused Crawling
19	Focused Crawling "Focused crawling system" (FCS) dev. by Aaron Krowne, Saurabh Pathak, in cooperation with Donna Bergmark Built on Heritrix Purpose: efficient, topic-driven discovery of web resources "Focused" with a classifier (BOW) Based on BOW module for Heritrix
20	Focused Crawling (cont.) Why build: needed something that worked need something unencumbered by IP need something easier for digital librarians to use Guided bootstrapping: Ability to utilize phrase/keyword lists Gleaning seeds through search engine (Google) Seeding through Open Directory (also for negative set) Seeding/training via OAI repositories Development of phrase lists with phrase finder
21	Web Services Clustering and classification SOAP web services as wrapper around actual machine learning systems dev. Urvashi Gadi (sup. Aaron Krowne) Key benefits: Don't need to install machine learning tools Don't need to have the computation resources Can use training sets to which you may not have direct access
22	Web Services (cont.) I/O: input: OAI repository base URL output: new "static" temporary repository at new URL w/set structure alt. output: XML organization file Modules (in PHP): Server module (depends upon BOW or NMF tool) Client module
23	CWIS Interfacing CWIS clustered scheme import script Demonstrates porting of clusterings to mainstream DL software I/O: Input: clustered organization XML file Updates CWIS database to add categories, classifications Must also import records
24	CWIS Interfacing (cont.)
25	OCKHAM-xform Model Use OAI repositories as atomic, portable objects These objects can be transformed to useful ends The transformed results can be loaded into digital libraries for new purposes Enhanced services can be built upon them MetaCombine web services, OAICopy fit into this framework (developed in the OCKHAM NSDL/DLF project)
26	Transform Workflow
27	OAIcopy "Glue" tool of OCKHAM-xform I/O: Input: OAI repository (base URL) Output: "static" OAI repository (OAI-XMLFile-based) in a subdirectory "Save" OAI repository output of Metacombine Web Services (classification) clustering Single, intuitive command: oaicopy < base_url > < local_directory > Afterwards: new OAI repository at http://your_server/web_root/subdir_name/ Other uses: Repository caching (with no database required) Repository upgrading (1.0/1.1 -> 2.0)
28	Coming Experiments in Workflow "Pipelining" tools/facilities Digital librarian-friendly management tools More collaborative transform services, e.g. Thumbnail generation Collection-level metadata augmentation Date normalization UTF sanitiziation
29	Next Steps New funding from Mellon Foundation will go toward developing production level versions of these services over the next 2 years Emory will develop a new portal for Southern Studies based on these technologies Emory will collaborate with other institutions in the DLF Aquifer project on frameworks for metadata remediation and enhancement – workshop to be held in Atlanta in July 2006.