11
Clustering: The “Bag-of-Words”
Clustering
Clustering of text documents starts with the bag-of-words representation of the collection.  Producing this representation is called preprocessing.
1
1
doc6
1
2
1
1
1
doc7
1
2
2
1
1
1
d12
1
1
1
1
d11
1
1
1
d10
1
1
1
doc9
1
1
1
doc8
1
1
1
1
1
1
1
doc5
1
1
2
doc4
1
1
1
1
1
2
doc3
1
1
1
1
doc2
2
1
1
doc1
disease
war
disposession
reservation
navajo
chumash
maize
dust
use
land
hunting
cattle
rights
water
• Which pairs of documents are similar?
• Do the documents cluster into subject areas?
• Could I categorize a new document?
Collection of Am West docs