Diagnostics

From IMC wiki
Jump to: navigation, search

From: http://article.gmane.org/gmane.comp.ai.mallet.devel/1483/

See also: http://people.cs.umass.edu/~wallach/publications/wallach09rethinking.pdf

"The tuning question is an interesting one, since the whole purpose of topic modeling has usually been to be useful: one can tune to reduce LL/token or tune to try to make the top-ten-lists more intuitive, but those might not always work in the same directions." http://sappingattention.blogspot.co.uk/2013/01/keeping-words-in-topic-models.html

The statistics defined in the file are as follows. Suggestions are welcome!

  • Word count (total number of word tokens assigned to the topic).
    • Small topics are often illogical, large topics are often overly general.
  • Word length (for each word, count the number of characters).
    • Topics with lots of very short words tend to be problematic.
    • This metric normalizes word length against the average word length of top words over all topics, so negative numbers mean short words, positive numbers mean long words.
  • Coherence (probability of words given higher-ranked words).
    • This metric picks out illogical combinations.
  • Distance from uniformity.
    • Higher values indicate more probability concentrated on a few words, lower values indicate more dispersed probability.
  • Distance from corpus.
    • Higher values indicate more specific topics, topics with lower values look like what you would get by counting all the words in the corpus, regardless of topic.
  • Effective number of words.
    • The inverse of the sum of squared probabilities.
    • Higher values indicate less concentration on top words.
    • This metric is similar to distance from uniformity.
  • Token/document difference.
    • Higher values indicate burstiness -- one of the top words appears many times in a small number of documents (ie fewer docs than expected given token count).
  • Documents at rank 1.
    • Vacuous or overly general topics often occur a small amount in many documents.
    • This metric counts, out of the documents that contain a given topic, how many times that topic is the single most common topic in a document.
    • Low numbers indicate possibly uninteresting topics.