1
|
- Kat Hagedorn
- University of Michigan
- April 11, 2006
|
2
|
- Built three research portals
- DLF <http://www.hti.umich.edu/cgi/b/bib-idx?c=imls>
- MODS <http://www.hti.umich.edu/m/mods>
- Aquifer <http://www.hti.umich.edu/a/aquifer>
- Improvements for search / display
- Integration of MODS format records
- Simple vs. advanced searching
- Inclusion of thumbnails
|
3
|
- Want to offer more than search within a generic, large corpus of data
- How to partition the data?
- Emory’s MetaCombine tool promising as a topical clustering agent
- (Also interested in clustering by format, access restriction, OAI
software used, etc.)
|
4
|
- Clustering is main focus
- Huge amount of data
- Needed a tool to “find the topic”
- Preferably a disjunctive tool (placing files under more than one topic)
- Classification is secondary focus
- Have potential classification (UM’s browse)
- Marrying to current system nigh on impossible
|
5
|
- First tried with small repository of ~5500 records (amnh)
- Took around 25 minutes
- Multiple tries with larger repository of ~270K records (dlps)
- Took around 12 hours
|
6
|
- Examples of set names from clustering UM’s metadata
- Good: “europe”, “mechanical”, “architecture”
- Not so good: “general”, “michigan”, “build”
- Favorite: “southern literari literature fine messenger”
- Granted…
- Only asked for 20 clusters
- Didn’t cluster hierarchically
|
7
|
- Metadata will always be difficult to cluster
- Using a tool developed as a Web service, with obvious benefits
- Expect necessity of mapping set names to real topical cluster names
|
8
|
- Running the tool locally, with a local WSDL instance, would save lots
(and lots) of time
- Better set names…does this mean a better algorithm?
- Ability to cluster by any criteria, not just topic, i.e., a
post-processing module
- Disjunctive clustering, meaning (so as not to hog storage) filename (not
file) clustering
|