11
Clustering: The “Bag-of-Words”
Clustering
Clustering of text documents starts with the bag-of-words representation of the
collection.
Producing this representation is called preprocessing.
1
1
doc6
1
2
1
1
1
doc7
1
2
2
1
1
1
d12
1
1
1
1
d11
1
1
1
d10
1
1
1
doc9
1
1
1
doc8
1
1
1
1
1
1
1
doc5
1
1
2
doc4
1
1
1
1
1
2
doc3
1
1
1
1
doc2
2
1
1
doc1
disease
war
disposession
reservation
navajo
chumash
maize
dust
use
land
hunting
cattle
rights
water
•
Which pairs of
documents are
similar?
•
Do the documents
cluster into subject
areas?
•
Could I categorize
a new document?
Collection of
Am West
docs