Latent Dirichlet allocation

What is

LDA (see the paper) is the simplest proper topic model. It assumes that topics are generated first, before the documents, and that topics are distributions over a fixed vocabulary. It is a generative statistical model and the idea is that documents can have multiple topics and each word's creation can be attributed to one of them.
It is a technique mostly used in NLP, that is, for text, but not only. Apart from the original paper, Wikipedia's page on the topic is quite extensive.

How it works

Given words indexed in a vocabulary
1,,V{1, \ldots, V}
,
  • represent words as vectors with the only component equal to 1 being the one at the word index, that is, such that
    wi=1w^i = 1
    for word i in the vocabulary and
    wj=0w^j = 0
    for all the rest of words
  • represent documents as sequences of N words
    wˉ=(w1,,wN)\bar w = (w_1, \ldots, w_N)
    so that
    wiw_i
    is the i-th word in it
  • represent a corpus as a collection of M documents
    D=wˉ1,,wˉMD = { \bar w_1, \ldots, \bar w_M }
We look for a probabilistic model that assigns high probability to members of the corpus and also to other similar documents; the LDA is a generative model of a corpus, based on the idea that documents are random mixtures over latent topics and topics are distributions over words.
For each document
wˉD\bar w \in D
, the following generative process is assumed (see the linked page for the distributions mentioned):
  1. 1.
    Choose N from a Poisson distribution
    P(ξ)P(\xi)
    (other assumptions are allowed)
  2. 2.
    Choose
    θ\theta
    from a Dirichlet distribution
    D(α)D(\alpha)
  3. 3.
    For each of the N words
    wiw_i
    ,
    • Choose a topic
      cic_i
      from a Multinomial distribution
      M(θ)M(\theta)
    • Choose a word
      wiw_i
      from
      P(wici,β)P(w_i | c_i, \beta)
      , a multinomial conditioned on topic
      cic_i
θ\theta
varies in a (
k1k-1
)-dimensional simplex has PDF
P(θα)=Γ(i=1k)Πi=1kΓ(αi)θ1α11θkαk1 ,P(\theta | \alpha) = \frac{\Gamma(\sum_{i=1}^k)}{\Pi_{i=1}^k \Gamma(\alpha_i)} \theta_1^{\alpha_1-1} \cdots \theta_k^{\alpha_k-1} \ ,
with
α\alpha
being a
kk
-vector with components
αi>0\alpha_i>0
.
The Dirichlet distribution is chosen because of convenience: it is in the exponential family, and is conjugate to the multinomial.
Given
α\alpha
and
β\beta
, we have
P(θ,cˉ,wˉα,β)=P(θα)Πi=1NP(ciθ)P(wici,β) ,P(\theta, \bar c, \bar w | \alpha, \beta) = P(\theta | \alpha) \Pi_{i=1}^N P(c_i | \theta) P(w_i | c_i, \beta) \ ,
where the first is the joint distribution of topic mixture
θ\theta
, set of N topics
cˉ\bar c
and N words
wˉ\bar w
;
P(ciθ)P(c_i | \theta)
is equal to
θj\theta_j
for the unique j such that
cij=1c_i^j=1
.
Integrating over
θ\theta
and summing over the topics, we get the marginal distribution of a document
P(wˉα,β)=P(θα)Πi=1NciP(ciθ)P(wici,β)dθP(\bar w | \alpha, \beta) = \int P(\theta | \alpha) \Pi_{i=1}^N \sum_{c_i} P(c_i | \theta) P(w_i | c_i, \beta) d \theta
And the probability of a corpus is the product over the documents
P(Dα,β)=Πd=1MP(θdα)Πi=1NdciP(cdiθd)P(wdicdi,β)dθdP(D | \alpha, \beta) = \Pi_{d=1}^M \int P(\theta_d | \alpha) \Pi_{i=1}^{N_d} \sum_{c_i} P(c_{d_i}|\theta_d) P(w_{d_i}|c_{d_i},\beta) d \theta_d
Plate representation of LDA
  • α\alpha
    and
    β\beta
    are corpus-level parameters, assumed to be sampled once when generating a corpus
  • θd\theta_d
    are the document-level variables, assumed to be sampled once per document
  • wdiw_{d_i}
    are the word-level variables, assumed to be sampled once per word in document

Comparison to other latent variable methods

Unigram model

In the figure, plate representation of a unigram model.
It's a very simplistic model: the words of every document are drawn independently from a single multinomial distribution, so that the probability of the document is
P(wˉ)=Πi=1NP(wi)P(\bar w) = \Pi_{i=1}^N P(w_i)

Mixture of unigram models

In the figure, plate representation of a mixture of unigram models.
In this one, the unigram model gets augmented with a discrete topic variable c: each document is generated by first choosing a topic c and then generating N words independently from the conditional multinomial
P(wz)P(w | z)
, so that the probability of a document is
P(wˉ)=cΠi=1NP(ciz)P(\bar w) = \sum_c \Pi_{i=1}^N P(c_i | z)
In contrast, LDA allows documents to exhibit multiple topics to different degrees.

PLSI

PLSI (see the page on PLSA) posits that a document label d and a word
wiw_i
are conditionally independent given an unobserved topic c:
P(dwi)=P(d)cP(wic)P(cd)P(d | w_i) = P(d) \sum_c P(w_i | c) P(c | d)
It relaxes the mixture of unigrams assumption that each document is generated from only one topic. The last bit
P(cd)P(c| d)
serves as the mixture weights of topics for document d. But, given that d is a label in the training set, the model learns the topic mixtures
P(cd)P(c | d)
only for the documents on which it is trained and this is why PLSA is not a generative model: there is no natural way to assign probabilities to unknown documents. Furthermore, the number of parameters grows linearly with the number of training documents, so it is quite prone to overfitting.
LDA overcomes this problem by treating the topics mixture weights as a
kk
-parametrical hidden random variable rather than a large set of individual parameters in the training set and this easily generalises to new documents and the parameters do not grow with the size of the training corpus.

Inference

Learning the distributions (set of topics, their word probabilities, the topic of each word, the topic mixture of each document) is a problem of Bayesian inference. Exact inference is intractable, approximate inference can be obtained in several ways though.

References

  1. 1.
    D M Blei, A Y Ng, M I Jordan, Latent dirichlet allocation, Journal of machine Learning research, 2003
Last modified 7mo ago