Topic Modelling Using Non-Negative Matrix
factorization
                 Anjusha C
                 MA18M008
            Department of Mathematics
                   IIT Madras
                 15/05/2020
What is topic modelling?
    I Topic modelling is a method for finding topics from a
      collection of documents that best represents the information in
      the collection.
    I Topics are ’repeating pattern of co-occurring terms in a corpus’
How to mathematically quantify them?
    I Consider 2 documents:
        I Doc A : On ’Biology’
        I Doc B : On ’Linear Algebra’
      Consider the words : ’decompose’,’matrix’,’mitochondria’ and
      ’eigen vector’. Now, the word ’decompose’ can occur in both
      documents. But its relative frequency of occurrence in Doc A
      will be more than Doc B. Similarly, ’matrix’ might occur in
      both docs but the relative frequency in Doc B would be more
      than Doc A.
    I Consider the set of words occuring in the documents:
      (w1 , w2 , ...., wn ). Each document can be represented as a
      vector where the ith entry is the frequency of the word wi .
        I Doc A = (νA1 , νA2 , ......, nuAn )
        I Doc B = (νB1 , νB2 , ......, nuBn )
    I These vectors are the two topics. One corresponding to
      Biology and the other to Linear Algebra.
Multiple topics in the same document
   Each document is a linear combination of the topic vectors (basis
   vectors).
Term-document matrix and term-weighting
               Souce: Introduction to Information Retrival
    I Stop word removal/Tf-Idf scoring
Linear Dimensionality Reduction
   AIM : extract these topic vectors from the term-document matrix.
                                Souce:
   http://derekgreene.com/slides/topic-modelling-with-scikitlearn.pdf
NMF
  I Previous approach : there can be negative entries in W and H
  I NMF assumption : Non-negative entries - constrained
    optimization
  I AIM : Find W and H such that V ≈ WH, W ≥ 0, H ≥ 0,
    given factorization rank k, (i.e, number of topics)
Cost functions
   How to check the goodness of approximation V ≈ WH ?
    I Frobenius norm : ||V − WH||2 = ij (Vij − (WH)ij )2
                                       P
    I K-L divergence :
                    P           Vij
      D(V ||WH) = ij (Vij log (WH)  ij
                                       − Vij + (WH)ij )
Method
   I Gradient descent
   I Naive GD : No guarantee of convergence
   I Problem convex on only W or H at a time and not on both
     simultaneously.
   I So, have to optimize W keeping H fixed and then optimize H
     keeping W fixed and keep alternating until convergence.
     (Alternating Least Squares approach)
Lee Sung multiplicative updates
    I Put forward by Daniel D Lee and H Sebastian Sung
      (Link : Lee-Sung paper)
    I These G-D updates are guaranteed to converge
    I Frobenius norm update :
                                      (W T V )aµ
                        Haµ = Haµ
                                     (W T WH)aµ
                                      (VH T )ia
                        Wia = Wia
                                     (WHH T )ia
    I K-L divergence update :
                                 P
                                 Wia Viµ /(WH)iµ
                                 i
                     Haµ = Haµ     P                     (1)
                               P      k Wka
                                µ Haµ Viµ /(WH)iµ
                     Wia = Wia      P                    (2)
                                       ν Haν
Implementation
   Dataset : ’20 newsgroups’ dataset is used. Newsgroups are
   discussion groups on Usenet, which was popular in the 80s and 90s
   before the web really took off. This dataset includes 18,000
   newsgroups posts with 20 topics.
   Only 4 topics are chosen : ’alt.atheism’, ’talk.religion.misc’,
   ’comp.graphics’ and ’sci.space’. The dataset is vectorised (the
   document-term matrix is constructed)
NMF in Scikit learn
   The ’show topics’ function picks up the top 8 words (words with
   highest frequency in each topic vector). The words with the highest
   frequencies are the ones that are specific to the topic.
   The rows are the topics
   The time taken is approximately 5.92 s.
Lee-Sung update : Frobenius norm
                         The
time taken is 21.27 s.
Error plot for Frobenius norm
Lee-Sung update : K-L divergence
The time taken is 54.04 s.
Error plot for Frobenius norm
References
    I Daniel D Lee, H Sebastian Sung, Algorithms for non-negative
      matrix factorization, Advances in neural information processing
      systems, 2001
    I Nicolas Gillis, The Why and How of Non-negative Matrix
      Factorization, arXiv:1401.5226v2,2014
    I Christopher D Manning, Introduction to Information Retrieval,
      Cambridge University Press,2009
    I NMF tutorial by Racheal Thomas
    I http://derekgreene.com/slides/topic-modelling-with-
      scikitlearn.pdf