Latent Dirichlet allocation
LDA (see the paper) is the simplest proper topic model. It assumes that topics are generated first, before the documents, and that topics are distributions over a fixed vocabulary. It is a generative statistical model and the idea is that documents can have multiple topics and each word's creation can be attributed to one of them.
Given words indexed in a vocabulary
- represent words as vectors with the only component equal to 1 being the one at the word index, that is, such thatfor word i in the vocabulary andfor all the rest of words
- represent documents as sequences of N wordsso thatis the i-th word in it
- represent a corpus as a collection of M documents
We look for a probabilistic model that assigns high probability to members of the corpus and also to other similar documents; the LDA is a generative model of a corpus, based on the idea that documents are random mixtures over latent topics and topics are distributions over words.
For each document
, the following generative process is assumed (see the linked page for the distributions mentioned):
- 1.Choose N from a Poisson distribution(other assumptions are allowed)
- 2.Choosefrom a Dirichlet distribution
- 3.For each of the N words,
- Choose a topicfrom a Multinomial distribution
- Choose a wordfrom, a multinomial conditioned on topic
-vector with components
The Dirichlet distribution is chosen because of convenience: it is in the exponential family, and is conjugate to the multinomial.
, we have
where the first is the joint distribution of topic mixture
, set of N topics
and N words
is equal to
for the unique j such that
and summing over the topics, we get the marginal distribution of a document
And the probability of a corpus is the product over the documents
Plate representation of LDA
- andare corpus-level parameters, assumed to be sampled once when generating a corpus
- are the document-level variables, assumed to be sampled once per document
- are the word-level variables, assumed to be sampled once per word in document
In the figure, plate representation of a unigram model.
It's a very simplistic model: the words of every document are drawn independently from a single multinomial distribution, so that the probability of the document is
In the figure, plate representation of a mixture of unigram models.
In this one, the unigram model gets augmented with a discrete topic variable c: each document is generated by first choosing a topic c and then generating N words independently from the conditional multinomial
, so that the probability of a document is
In contrast, LDA allows documents to exhibit multiple topics to different degrees.
PLSI (see the page on PLSA) posits that a document label d and a word
are conditionally independent given an unobserved topic c:
It relaxes the mixture of unigrams assumption that each document is generated from only one topic. The last bit
serves as the mixture weights of topics for document d. But, given that d is a label in the training set, the model learns the topic mixtures
only for the documents on which it is trained and this is why PLSA is not a generative model: there is no natural way to assign probabilities to unknown documents. Furthermore, the number of parameters grows linearly with the number of training documents, so it is quite prone to overfitting.
LDA overcomes this problem by treating the topics mixture weights as a
-parametrical hidden random variable rather than a large set of individual parameters in the training set and this easily generalises to new documents and the parameters do not grow with the size of the training corpus.
Learning the distributions (set of topics, their word probabilities, the topic of each word, the topic mixture of each document) is a problem of Bayesian inference. Exact inference is intractable, approximate inference can be obtained in several ways though.