Notes
Slide Show
Outline
1
How (Not) to Use a Semi-automated Clustering Tool
  • Kat Hagedorn
  • University of Michigan
  • April 11, 2006
2
Update on UM’s efforts
  • Built three research portals
    • DLF <http://www.hti.umich.edu/cgi/b/bib-idx?c=imls>
    • MODS <http://www.hti.umich.edu/m/mods>
    • Aquifer <http://www.hti.umich.edu/a/aquifer>
  • Improvements for search / display
    • Integration of MODS format records
    • Simple vs. advanced searching
    • Inclusion of thumbnails
3
The need to cluster
  • Want to offer more than search within a generic, large corpus of data
  • How to partition the data?
  • Emory’s MetaCombine tool promising as a topical clustering agent
  • (Also interested in clustering by format, access restriction, OAI software used, etc.)
4
Clustering vs. classification
  • Clustering is main focus
    • Huge amount of data
    • Needed a tool to “find the topic”
    • Preferably a disjunctive tool (placing files under more than one topic)
  • Classification is secondary focus
    • Have potential classification (UM’s browse)
    • Marrying to current system nigh on impossible
5
Results: duration
  • First tried with small repository of ~5500 records (amnh)
  • Took around 25 minutes
  • Multiple tries with larger repository of ~270K records (dlps)
  • Took around 12 hours
6
Results: cluster names
  • Examples of set names from clustering UM’s metadata
    • Good: “europe”, “mechanical”, “architecture”
    • Not so good: “general”, “michigan”, “build”
    • Favorite: “southern literari literature fine messenger”
  • Granted…
    • Only asked for 20 clusters
    • Didn’t cluster hierarchically
7
Caveats
  • Metadata will always be difficult to cluster
  • Using a tool developed as a Web service, with obvious benefits
  • Expect necessity of mapping set names to real topical cluster names
8
What we need
  • Running the tool locally, with a local WSDL instance, would save lots (and lots) of time
  • Better set names…does this mean a better algorithm?
  • Ability to cluster by any criteria, not just topic, i.e., a post-processing module
  • Disjunctive clustering, meaning (so as not to hog storage) filename (not file) clustering