Topic Modelling Using Non-Negative Matrix Factorization: Anjusha C MA18M008

Topic modeling uses non-negative matrix factorization (NMF) to find topics that best represent the information in a collection of documents. NMF decomposes a document-term matrix into two non-negative matrices whose multiplication approximates the original matrix. It assumes documents are mixtures of topics and topics are mixtures of words. The Lee-Sung multiplicative update algorithm is guaranteed to converge during NMF to minimize cost functions like Frobenius norm or Kullback-Leibler divergence. The 20 Newsgroups dataset is used to test NMF topic modeling in Scikit-learn.

Uploaded by

Anjusha Nair

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

236 views21 pages

Topic Modelling Using Non-Negative Matrix Factorization: Anjusha C MA18M008

Uploaded by

Anjusha Nair

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 21

Topic Modelling Using Non-Negative Matrix

factorization

Anjusha C
MA18M008

Department of Mathematics
IIT Madras

15/05/2020
What is topic modelling?

I Topic modelling is a method for finding topics from a

collection of documents that best represents the information in
the collection.
I Topics are ’repeating pattern of co-occurring terms in a corpus’
How to mathematically quantify them?
I Consider 2 documents:
I Doc A : On ’Biology’
I Doc B : On ’Linear Algebra’
Consider the words : ’decompose’,’matrix’,’mitochondria’ and
’eigen vector’. Now, the word ’decompose’ can occur in both
documents. But its relative frequency of occurrence in Doc A
will be more than Doc B. Similarly, ’matrix’ might occur in
both docs but the relative frequency in Doc B would be more
than Doc A.
I Consider the set of words occuring in the documents:
(w1 , w2 , ...., wn ). Each document can be represented as a
vector where the ith entry is the frequency of the word wi .
I Doc A = (νA1 , νA2 , ......, nuAn )
I Doc B = (νB1 , νB2 , ......, nuBn )
I These vectors are the two topics. One corresponding to
Biology and the other to Linear Algebra.
Multiple topics in the same document

Each document is a linear combination of the topic vectors (basis

vectors).
Term-document matrix and term-weighting

Souce: Introduction to Information Retrival

I Stop word removal/Tf-Idf scoring
Linear Dimensionality Reduction

AIM : extract these topic vectors from the term-document matrix.

Souce:
http://derekgreene.com/slides/topic-modelling-with-scikitlearn.pdf
NMF

I Previous approach : there can be negative entries in W and H

I NMF assumption : Non-negative entries - constrained
optimization
I AIM : Find W and H such that V ≈ WH, W ≥ 0, H ≥ 0,
given factorization rank k, (i.e, number of topics)
Cost functions

How to check the goodness of approximation V ≈ WH ?

I Frobenius norm : ||V − WH||2 = ij (Vij − (WH)ij )2
P

I K-L divergence :
P Vij
D(V ||WH) = ij (Vij log (WH) ij
− Vij + (WH)ij )
Method

I Gradient descent
I Naive GD : No guarantee of convergence
I Problem convex on only W or H at a time and not on both
simultaneously.
I So, have to optimize W keeping H fixed and then optimize H
keeping W fixed and keep alternating until convergence.
(Alternating Least Squares approach)
Lee Sung multiplicative updates
I Put forward by Daniel D Lee and H Sebastian Sung
(Link : Lee-Sung paper)
I These G-D updates are guaranteed to converge
I Frobenius norm update :

(W T V )aµ
Haµ = Haµ
(W T WH)aµ
(VH T )ia
Wia = Wia
(WHH T )ia
I K-L divergence update :
P
Wia Viµ /(WH)iµ
i
Haµ = Haµ P (1)
P k Wka
µ Haµ Viµ /(WH)iµ
Wia = Wia P (2)
ν Haν
Implementation
Dataset : ’20 newsgroups’ dataset is used. Newsgroups are
discussion groups on Usenet, which was popular in the 80s and 90s
before the web really took off. This dataset includes 18,000
newsgroups posts with 20 topics.

Only 4 topics are chosen : ’alt.atheism’, ’talk.religion.misc’,

’comp.graphics’ and ’sci.space’. The dataset is vectorised (the
document-term matrix is constructed)
NMF in Scikit learn

The ’show topics’ function picks up the top 8 words (words with
highest frequency in each topic vector). The words with the highest
frequencies are the ones that are specific to the topic.
The rows are the topics
The time taken is approximately 5.92 s.
Lee-Sung update : Frobenius norm
The
time taken is 21.27 s.
Error plot for Frobenius norm
Lee-Sung update : K-L divergence
The time taken is 54.04 s.
Error plot for Frobenius norm
References

I Daniel D Lee, H Sebastian Sung, Algorithms for non-negative

matrix factorization, Advances in neural information processing
systems, 2001
I Nicolas Gillis, The Why and How of Non-negative Matrix
Factorization, arXiv:1401.5226v2,2014
I Christopher D Manning, Introduction to Information Retrieval,
Cambridge University Press,2009
I NMF tutorial by Racheal Thomas
I http://derekgreene.com/slides/topic-modelling-with-
scikitlearn.pdf

Using Topic Modeling Methods For Short-Text Data: A Comparative Analysis
No ratings yet
Using Topic Modeling Methods For Short-Text Data: A Comparative Analysis
14 pages
Dynamic Topic Analysis in Academic Journals Using Convex Non Negative Matrix Factorization Method
No ratings yet
Dynamic Topic Analysis in Academic Journals Using Convex Non Negative Matrix Factorization Method
11 pages
FALLSEM2024-25 BCSE409L TH VL2024250101881 2024-11-15 Reference-Material-I
No ratings yet
FALLSEM2024-25 BCSE409L TH VL2024250101881 2024-11-15 Reference-Material-I
68 pages
2EL1730 ML Lecture11 NMF - Annotated
No ratings yet
2EL1730 ML Lecture11 NMF - Annotated
41 pages
A High-Performance Parallel Algorithm
No ratings yet
A High-Performance Parallel Algorithm
11 pages
Topic Modelling Using NLP
No ratings yet
Topic Modelling Using NLP
18 pages
Non-Negative Matrix Factorization (NMF) : Benjamin Wilson
No ratings yet
Non-Negative Matrix Factorization (NMF) : Benjamin Wilson
43 pages
NMF for Audiovisual Analysis Experts
No ratings yet
NMF for Audiovisual Analysis Experts
189 pages
Chapter 4
No ratings yet
Chapter 4
43 pages
Non-Negative Matrix Factorization (NMF) : Benjamin Wilson
No ratings yet
Non-Negative Matrix Factorization (NMF) : Benjamin Wilson
43 pages
Course Name: Advanced Information Retrieval
No ratings yet
Course Name: Advanced Information Retrieval
6 pages
A Survey On Neural Topic Models
No ratings yet
A Survey On Neural Topic Models
24 pages
Group Matrix Factorization For Scalable Topic Modeling
No ratings yet
Group Matrix Factorization For Scalable Topic Modeling
10 pages
Wordembedding
No ratings yet
Wordembedding
25 pages
IR Lecture 4b
No ratings yet
IR Lecture 4b
57 pages
Topic Model For LDA
No ratings yet
Topic Model For LDA
9 pages
Categorizing Research Papers by Topics Using Latent Dirichlet Allocation Model
No ratings yet
Categorizing Research Papers by Topics Using Latent Dirichlet Allocation Model
5 pages
IR Lecture 4b
No ratings yet
IR Lecture 4b
57 pages
Eai 13-7-2018 159623
No ratings yet
Eai 13-7-2018 159623
16 pages
Matrix Factorization and Its Applications
No ratings yet
Matrix Factorization and Its Applications
17 pages
First
No ratings yet
First
27 pages
NSTM 2008.13537v3
No ratings yet
NSTM 2008.13537v3
15 pages
Semantic Processing for Data Scientists
No ratings yet
Semantic Processing for Data Scientists
10 pages
Non-Negative Matrix Factorization, A New Tool For Feature Extraction: Theory and Applications
No ratings yet
Non-Negative Matrix Factorization, A New Tool For Feature Extraction: Theory and Applications
8 pages
Approximation Algorithms For Orthogonal Non-Negative Matrix Factorization
No ratings yet
Approximation Algorithms For Orthogonal Non-Negative Matrix Factorization
12 pages
Machine Learning For Data Science Unit-5
No ratings yet
Machine Learning For Data Science Unit-5
10 pages
A Survey On Neural Topic Models: Methods, Applications, and Challenges
No ratings yet
A Survey On Neural Topic Models: Methods, Applications, and Challenges
30 pages
Clustering of Bio Medical Scientific Papers
No ratings yet
Clustering of Bio Medical Scientific Papers
5 pages
Text Mining and Classifica1on: Karianne Bergen
No ratings yet
Text Mining and Classifica1on: Karianne Bergen
54 pages
Advance Machine Learning - Final - Report
No ratings yet
Advance Machine Learning - Final - Report
88 pages
Adhikary e Murty - 2012 - Feature Selection For Unsupervised Learning
No ratings yet
Adhikary e Murty - 2012 - Feature Selection For Unsupervised Learning
8 pages
Document Clustering Through Non-Negative Matrix Factorization-A Case Study of Hadoop For Computational Time Reduction of Large Scale Documents.
No ratings yet
Document Clustering Through Non-Negative Matrix Factorization-A Case Study of Hadoop For Computational Time Reduction of Large Scale Documents.
10 pages
Non Numeric Clustering Seminar
No ratings yet
Non Numeric Clustering Seminar
26 pages
Non-Negative Matrix Factorization: Marshall Tappen 6.899
No ratings yet
Non-Negative Matrix Factorization: Marshall Tappen 6.899
26 pages
Neural Network Matrix Facorization
No ratings yet
Neural Network Matrix Facorization
7 pages
A Gentle Introduction To Topic Modeling Using Pyth
No ratings yet
A Gentle Introduction To Topic Modeling Using Pyth
10 pages
Algorithms For Non-Negative Matrix Factorization
No ratings yet
Algorithms For Non-Negative Matrix Factorization
7 pages
Non-Negative Matrix Factorization
No ratings yet
Non-Negative Matrix Factorization
21 pages
NLP Ir
No ratings yet
NLP Ir
24 pages
Ieee TNN 10
No ratings yet
Ieee TNN 10
13 pages
Topic: Non-Negative Matrix Factorisation: Assignment - 2
No ratings yet
Topic: Non-Negative Matrix Factorisation: Assignment - 2
6 pages
Feature Engineering Guide
100% (2)
Feature Engineering Guide
44 pages
Unit 2, Part 2:topic Modeling
No ratings yet
Unit 2, Part 2:topic Modeling
26 pages
MLA TAB Lecture2
No ratings yet
MLA TAB Lecture2
84 pages
NMF: A Guide for Data Scientists
No ratings yet
NMF: A Guide for Data Scientists
19 pages
Introduction to IR Models
No ratings yet
Introduction to IR Models
22 pages
Algorithmic Aspects of Machine Learning Part 2
No ratings yet
Algorithmic Aspects of Machine Learning Part 2
20 pages
5.2 Feature Engineering
No ratings yet
5.2 Feature Engineering
57 pages
Wete 2203.01570v2
No ratings yet
Wete 2203.01570v2
17 pages
Vector Space Model
No ratings yet
Vector Space Model
6 pages
Text Mining Notes Full
No ratings yet
Text Mining Notes Full
2 pages
Kim 2013
No ratings yet
Kim 2013
35 pages
Graph Regularized Non-Negative Matrix Factorization For Data Representation
No ratings yet
Graph Regularized Non-Negative Matrix Factorization For Data Representation
14 pages
1861 Algorithms For Non Negative Matrix Factorization
No ratings yet
1861 Algorithms For Non Negative Matrix Factorization
7 pages
Experiments With Non Parametric Topic Models
No ratings yet
Experiments With Non Parametric Topic Models
10 pages
Semantic Technology-Assisted Review STAR Document
No ratings yet
Semantic Technology-Assisted Review STAR Document
14 pages
Assignment
No ratings yet
Assignment
1 page
JEST 2013 Question N Solution
67% (3)
JEST 2013 Question N Solution
21 pages
Current Job Updates - 2020
No ratings yet
Current Job Updates - 2020
12 pages
Sources of Monopoly Power: Perfect Competition
No ratings yet
Sources of Monopoly Power: Perfect Competition
2 pages
Abstract Compiled ACRS 2017
50% (2)
Abstract Compiled ACRS 2017
1,133 pages
Hostel Management Office, Iit Madras
No ratings yet
Hostel Management Office, Iit Madras
1 page
Approch Specific
No ratings yet
Approch Specific
2 pages
Insurance Policy Holder Details
No ratings yet
Insurance Policy Holder Details
2 pages
Effects of Non-Associativity in Time Dependent Quantum Systems
No ratings yet
Effects of Non-Associativity in Time Dependent Quantum Systems
15 pages
First Ever Program To Be Written and Executed...
No ratings yet
First Ever Program To Be Written and Executed...
1 page
For SC / ST Students: Fee Structure For VASANTH 2016 Semester
No ratings yet
For SC / ST Students: Fee Structure For VASANTH 2016 Semester
2 pages
1 Title: D DX 1 2 0 R Ra 2 1 2 0 A D DX 2
No ratings yet
1 Title: D DX 1 2 0 R Ra 2 1 2 0 A D DX 2
6 pages
Arc Spectrum
100% (2)
Arc Spectrum
5 pages
What Are The Main Objectives of Project Management
100% (2)
What Are The Main Objectives of Project Management
3 pages
05 - OSI Model - Summarized
No ratings yet
05 - OSI Model - Summarized
12 pages
Elise Schwartz
No ratings yet
Elise Schwartz
1 page
g12 - First Term Revision
No ratings yet
g12 - First Term Revision
10 pages
17 Samss 502
No ratings yet
17 Samss 502
28 pages
Football - Digital Income Engine
100% (1)
Football - Digital Income Engine
5 pages
Fig.1.14 Logical Connections Between Layers of The TCP/IP Protocol Suite
No ratings yet
Fig.1.14 Logical Connections Between Layers of The TCP/IP Protocol Suite
34 pages
VIP Portal Development Proposal
No ratings yet
VIP Portal Development Proposal
10 pages
21CSC204J-DAA Unit 2
No ratings yet
21CSC204J-DAA Unit 2
104 pages
Register 6 - Iu Interface Control Plane SCCP and RANAP
100% (1)
Register 6 - Iu Interface Control Plane SCCP and RANAP
94 pages
Bcom Prof
No ratings yet
Bcom Prof
85 pages
Exam Timetable
No ratings yet
Exam Timetable
1 page
International Conference on Cyber Security Privacy and Networking ICSPN 2022 1st edition by Nadia Nedjah,Gregorio MartÃnez PÃ©rez 9783031220180 3031220188 - The latest ebook edition with all chapters is now available
100% (7)
International Conference on Cyber Security Privacy and Networking ICSPN 2022 1st edition by Nadia Nedjah,Gregorio MartÃnez PÃ©rez 9783031220180 3031220188 - The latest ebook edition with all chapters is now available
86 pages
Resume Vinoth Subramanian Latest Full
No ratings yet
Resume Vinoth Subramanian Latest Full
2 pages
Master Data - Material Master Data
No ratings yet
Master Data - Material Master Data
82 pages
BCS Credit Card Issue 1
No ratings yet
BCS Credit Card Issue 1
4 pages
"Library Management System": Bachelor of Computer Applications From C.C.S University, Meerut (2018-2021)
No ratings yet
"Library Management System": Bachelor of Computer Applications From C.C.S University, Meerut (2018-2021)
74 pages
Thesis Format in Microsoft Word
100% (2)
Thesis Format in Microsoft Word
8 pages
Plantilla Caso Práctico SEO-sem
No ratings yet
Plantilla Caso Práctico SEO-sem
9 pages
The Shikshak Tyit Sem 5 Internet of Things Question Papers N18 A19 N19 N22 A23
100% (2)
The Shikshak Tyit Sem 5 Internet of Things Question Papers N18 A19 N19 N22 A23
8 pages
Digital Integrated Circuits: Introduction To TTL
No ratings yet
Digital Integrated Circuits: Introduction To TTL
19 pages
Computer Shortcuts
50% (2)
Computer Shortcuts
2 pages
Risk Management Dissertation Help
100% (2)
Risk Management Dissertation Help
8 pages
8051 Timer Programming Guide
No ratings yet
8051 Timer Programming Guide
62 pages
VDOnMp2puZPezpJ1VDlnvIPrp0bOjRjd PDF
No ratings yet
VDOnMp2puZPezpJ1VDlnvIPrp0bOjRjd PDF
112 pages
Development of Thermal Insulation Material Using C
No ratings yet
Development of Thermal Insulation Material Using C
7 pages
PAN Application Status NSDL Final
No ratings yet
PAN Application Status NSDL Final
10 pages
SO Designation
No ratings yet
SO Designation
2 pages
ML Projrct Article 2
No ratings yet
ML Projrct Article 2
6 pages
Manual k3 2017 Eng
No ratings yet
Manual k3 2017 Eng
43 pages