22
III. Lessons Learned
lClustering algorithms are similar
lAll algorithms start with same bag-of-words representation
lPreferable to use algorithm that can assign multiple topics per single document
lQuality of result is highly dependent on preprocessing
l
lClustering is limited
lShort documents (or limited metadata) can be difficult to categorize
lAll methods produce junk topics
lWhen to freeze topics?
l
lHuman input is required
lClustering algorithm is automated, but everything else isn’t
lPreprocessing (tokenization and stopword removal) is key
lNeed human to interpret topics and assign labels
lNeed to choose number of topics
l
l
Lessons Learned
Clustering algorithms can be useful for enhancing metadata or subject indexing text document collections.