1	Go Fish! Experiments with Topical Metadata Enhancement in the American West Project DLF Spring Forum Austin, TX Tuesday, 11 April 2006 Bill Landis, Metadata Coordinator, CDL
2	Overview of Presentation American West Project (CDL) context for topic enhancement Clustering: experimentation & for real Issues for ongoing work/cogitation Discussion/Questions for Speakers
3	Contexts :: American West Project Assembled testbed collection of OAI harvested metadata (approx. 250K objects) 8 content providing partners: CDL, Collaborative Digitization Program, Harvard, LC, Universities of IN/MI/VA/WA CDL Common Framework repository Manage, remediate, enrich metadata objects Access platform (XTF indexes, Struts Action Framework) Explore CDL infrastructure needs for OAI harvesting and metadata enhancement
4	Contexts :: Interface Exploration Faceted hierarchical browse surfaced as primary access mechanism Inspired by Marti Hearst’s Flamenco project work at SIMS-UCB Need to “tailor” specific metadata elements for browse to function efficiently AmWest focus initially on Topic, Geographic locations, Dates, [Genres] CDL goal to do user assessment on faceted browse (teachers, undergrads, ?)
5	Clustering :: American West Project Experimentation Enough metadata in OAI records to get good results? Explore process/workload Harvested approx. 360K records from AmWest-likely OAI sets from partners and other data providers Did 7 topic model runs on this prototype “collection” Used dc:title, dc:description, dc:subject only from harvested records “For real” clustering Approx. 240K records from AmWest partners only Did 4 topic model runs using the same DC elements noted above
6	Clustering :: Preprocessing convert to lowercase remove punctuation delete words with £ 2 characters replace collocations (e.g. war_relocation_authority) apply stemming (‘rivers’ à ‘river’) delete words starting with a digit (e.g., dates) delete words in stopword list delete infrequent words (< 10 occurrences)
7	Clustering :: Issues :: Stopwords Words that don’t contribute topically should be in a stopword list Issue of what to do with proper nouns? States: remove – captured by Geographical browse facet Localities: remove only if they impact topic clustering (e.g., Seattle, Port Townsend) Personal and corporate names: probably best to remove since their importance as known entities is fairly subjective
8	Clustering :: Human Mediation Reviewed quality of topics using Web-based topic browser Does list of most likely words in topic make sense? Can we give this topic a label (LCSH, other)? Do the most likely objects in this topic seem reasonable? Do the topics assigned to each object seem reasonable? Assigned topic labels Experimental round: LCSH -- too big/granular to work with for deriving approx. 150 topics “For real” round: TGM II, Subjects -- worked quite well for this particular application with a few stretches
9	Clustering :: Relevance of Bag o’ Words
10	Clustering :: Bag o’ Words = Topic Label
11
12
13
14
15	Clustering :: American West Stats
16	"And the reason why we’re..." And the reason why we’re doing all this work?
17
18
19
20	Issues Thought we could push all normalization/ enhancement activities up to point of ingest – NOT! Out-of-scope materials have dramatic impact on surfacing topics through clustering “Curators” of digital content need a mechanism for subsetting out unwanted materials from OAI sets prior to any topic-surfacing activities The better one is able to articulate the scope of a “collection,” and the more focused that “collection” is on the needs of a tangible audience, the more successful clustering will likely be
21	Issues Sustainability??? With AmWest, we have topically enriched metadata records, but no clear process at this point for reharvesting or adding additional materials - YIKES! Clustering ÜClassifying on ingest? CDL Experiment (fingers crossed): Can we build a classifier into our ingest routine for harvested sets to filter records matching 1+ of the 147 topics/word bags we’re calling “The American West”? Maybe at some point need to re-cluster to test validity of those those topic/word bags and see if additional valid topics can be surfaced?
22	Issues Metadata for “cultural heritage material” is all over the map and will likely continue to be so Best practices are great, but they won’t directly (maybe indirectly) get us the consistent metadata we need to drive our specific applications/implementations Why do this at all? Who is the audience? This is a bigger question that the Academic/Digital Library community really needs to explore At what level of application does this buck produce a bang? Collections to end all collections Collections built by front-line librarians to meet very specific, assessable faculty teaching/research needs Bit of both?
23	Parting Thoughts Some flavors of it are more straightforward than others (e.g., dates, maybe geographic location) Some flavors of it don’t rely as much on “collection” context (a date normalized for indexing is a date normalized for indexing …) Need a more common understanding of the balance between what we can do globally/collaboratively, and what makes sense only more locally Clustering::global::shared topic label/bag o’ words pairs (e.g., Logging (LCSH) = logging lumber lumber_company company loggers logs mill industry crew business timber steam shingle town donkeys log camp railroad donkey_engine sawmill …) Classification::local::remediating specific metadata records for specific purposes Aquifer a possible experimentation platform for arriving at a better understanding of this balance?