Metadata Enhancement

III. Lessons Learned

lClustering algorithms are similar

lAll algorithms start with same bag-of-words representation

lPreferable to use algorithm that can assign multiple topics per single document

lQuality of result is highly dependent on preprocessing

lClustering is limited

lShort documents (or limited metadata) can be difficult to categorize

lAll methods produce junk topics

lWhen to freeze topics?

lHuman input is required

lClustering algorithm is automated, but everything else isn’t

lPreprocessing (tokenization and stopword removal) is key

lNeed human to interpret topics and assign labels

lNeed to choose number of topics

Lessons Learned

Clustering algorithms can be useful for enhancing metadata or subject indexing text document collections.