"The tuning question is an interesting one, since the whole purpose of topic modeling has usually been to be useful: one can tune to reduce LL/token or tune to try to make the top-ten-lists more intuitive, but those might not always work in the same directions." http://sappingattention.blogspot.co.uk/2013/01/keeping-words-in-topic-models.html
The statistics defined in the file are as follows. Suggestions are welcome!
- Word count (total number of word tokens assigned to the topic).
- Small topics are often illogical, large topics are often overly general.
- Word length (for each word, count the number of characters).
- Topics with lots of very short words tend to be problematic.
- This metric normalizes word length against the average word length of top words over all topics, so negative numbers mean short words, positive numbers mean long words.
- Coherence (probability of words given higher-ranked words).
- This metric picks out illogical combinations.
- Distance from uniformity.
- Higher values indicate more probability concentrated on a few words, lower values indicate more dispersed probability.
- Distance from corpus.
- Higher values indicate more specific topics, topics with lower values look like what you would get by counting all the words in the corpus, regardless of topic.
- Effective number of words.
- The inverse of the sum of squared probabilities.
- Higher values indicate less concentration on top words.
- This metric is similar to distance from uniformity.
- Token/document difference.
- Higher values indicate burstiness -- one of the top words appears many times in a small number of documents (ie fewer docs than expected given token count).
- Documents at rank 1.
- Vacuous or overly general topics often occur a small amount in many documents.
- This metric counts, out of the documents that contain a given topic, how many times that topic is the single most common topic in a document.
- Low numbers indicate possibly uninteresting topics.