Notes

Slide Show

Outline

1	Automated Subject Indexing of Document Collections DLF Spring Forum 2006 David Newman University of California, Irvine
2	I. Subject Indexing Keyword inconsistency Suboptimal subject categories Artificial domain or institutional boundaries Different subject headings for different collections (LCSH, MESH) Controlling subject heading vocabulary Managing a dynamic collection Designing a good user interface Integrating automatically generated topics into existing systems
3	Keyword Inconsistency Example from the Pennsylvania Gazette: The top-3 keywords are: advertisement (16,000 articles) real-estate (7,000 articles) runaways (5,000 articles)
4	Suboptimal Subject Categories
5	II. Classification and Clustering Classifiers Clusterers
6	What are they?
7	How do they work?
8	How do they work?
9	How do they work?
10	How do they work?
11	Clustering: The “Bag-of-Words”
12	Preprocessing
13	Preprocessing
14	Preprocessing
15	Preprocessing
16	Preprocessing
17	Clustering Algorithms
18	The Pennsylvania Gazette
19	The Pennsylvania Gazette
20	Metadata Enhancement
21	Topic Trends
22	III. Lessons Learned Clustering algorithms are similar All algorithms start with same bag-of-words representation Preferable to use algorithm that can assign multiple topics per single document Quality of result is highly dependent on preprocessing Clustering is limited Short documents (or limited metadata) can be difficult to categorize All methods produce junk topics When to freeze topics? Human input is required Clustering algorithm is automated, but everything else isn’t Preprocessing (tokenization and stopword removal) is key Need human to interpret topics and assign labels Need to choose number of topics
23	How Automated is Automated?