Notes
Slide Show
Outline
1
Automated Subject Indexing of Document Collections
  • DLF Spring Forum 2006




  • David Newman
  • University of California, Irvine
2
I. Subject Indexing
    • Keyword inconsistency


    • Suboptimal subject categories


    • Artificial domain or institutional boundaries
      • Different subject headings for different collections (LCSH, MESH)


    • Controlling subject heading vocabulary
      • Managing a dynamic collection


    • Designing a good user interface
      • Integrating automatically generated topics into existing systems

3
Keyword Inconsistency

  • Example from the Pennsylvania Gazette:


  • The top-3 keywords are:
    • advertisement (16,000 articles)
    • real-estate (7,000 articles)
    • runaways (5,000 articles)
4
Suboptimal Subject Categories
5
II. Classification and Clustering


    • Classifiers





    • Clusterers
6
What are they?
7
How do they work?
8
How do they work?
9
How do they work?
10
How do they work?
11
Clustering: The “Bag-of-Words”
12
Preprocessing
13
Preprocessing
14
Preprocessing
15
Preprocessing
16
Preprocessing
17
Clustering Algorithms
18
The Pennsylvania Gazette
19
The Pennsylvania Gazette
20
Metadata Enhancement
21
Topic Trends
22
III. Lessons Learned
    • Clustering algorithms are similar
      • All algorithms start with same bag-of-words representation
      • Preferable to use algorithm that can assign multiple topics per single document
      • Quality of result is highly dependent on preprocessing


    • Clustering is limited
      • Short documents (or limited metadata) can be difficult to categorize
      • All methods produce junk topics
      • When to freeze topics?


    • Human input is required
      • Clustering algorithm is automated, but everything else isn’t
      • Preprocessing (tokenization and stopword removal) is key
      • Need human to interpret topics and assign labels
      • Need to choose number of topics



23
How Automated is Automated?