0% found this document useful (0 votes)

77 views9 pages

Document Classification Using Machine Learning: What Is Document Classifier?

Text classification can be used for a wide variety of tasks such as sentiment analysis, topic detection, intent identification, and much more. But when it comes to classification, many often ask whether it’s better to analyze documents as a whole, or if it’s more convenient to preprocess these documents and divide them into smaller units before doing the analysis. Unfortunately, there is not a one-size-fits-all answer. Which approach is more appropriate for classification depends on your data an

Uploaded by

manasa kc

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

77 views9 pages

Document Classification Using Machine Learning: What Is Document Classifier?

Uploaded by

manasa kc

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 9

CONFIDENTIAL & RESTRICTED

Document Classification Using

Machine Learning

What is Document Classifier?

Document classification is an example of Machine Learning (ML)
in the form of Natural Language Processing (NLP). By classifying
text, we are aiming to assign one or more classes or categories to a
document, making it easier to manage and sort. This is especially
useful for publishers, news sites, blogs or anyone who deals with a
lot of content.
CONFIDENTIAL & RESTRICTED

Document Classification vs Text

Classification?
Text classification can be used for a wide variety of tasks such as
sentiment analysis, topic detection, intent identification, and
much more. But when it comes to classification, many often ask
whether it’s better to analyze documents as a whole, or if it’s more
convenient to preprocess these documents and divide them into
smaller units before doing the analysis. Unfortunately, there is not
a one-size-fits-all answer. Which approach is more appropriate for
classification depends on your data and your goals with the
analysis.

Four different levels of scope that can be applied in text

classification:

 Document level obtains the relevant categories of a full

document.
 Paragraph level obtains the relevant categories of a single
paragraph.
 Sentence level obtains the relevant categories of a single
sentence.
 Sub-sentence level obtains the relevant categories of sub-
expressions within a sentence (also known as opinion units).

Steps for Document Classification

1. The dataset
The quality of the tagged dataset is by far the most important
CONFIDENTIAL & RESTRICTED

component of a statistical NLP classifier. The dataset needs to be

large enough to have an adequate number of documents in each
class. For 500 possible document categories, you may require 100
documents per category so a total of 50,000 documents may be
required.

The dataset also needs to be of a high enough quality in terms of

how distinct the documents in the different categories are from
each other to allow clear delineation between the categories.

2. Preprocessing
In our dataset we should given equal importance to each and every
word when creating document vectors. We could do some
preprocessing and decide to give different weighting to words
based on their importance to the document in question. A
common methodology used to do this is TF-IDF (term frequency —
inverse document frequency). The TF-IDF weighting for a word
increases with the number of times the word appears in the
document but decreases based on how frequently the word
appears in the entire document set.

3. Classification Algorithm and Strategy

We classify the document by comparing the number of matching
terms in the document vectors. In the real world numerous more
complex algorithms exist for classification such as Support Vector
Machines (SVMs), Naive Bayes and Decision Trees.
CONFIDENTIAL & RESTRICTED

Real World Example of Document

Classifier
We create a Document Classifier for BBC Datasets (Two news
article datasets)

1. Data input and pre-processing

We are loading this dataset locally, as a CSV file, and adding a
column encoding the category as an integer (categorical variables
are often better represented by integers than strings).

2. Data exploration
Before diving head-first into training machine learning models, we
should become familiar with the structure and characteristics of
CONFIDENTIAL & RESTRICTED

our dataset: these properties might inform our problem-solving

approach.

A first step would be to look at some random examples, and the

number of examples in each class:

3. TF-IDF
Here, we see that the number of articles per class is roughly
balanced, which is helpful! If our dataset were imbalanced, we
would need to carefully configure our model or artificially balance
the dataset, for example by undersampling or oversampling each
class.

To further analyze our dataset, we need to transform each article’s

text to a feature vector, a list of numerical values representing
some of the text’s characteristics. This is because most ML models
cannot process raw text, instead only dealing with numerical
values.
CONFIDENTIAL & RESTRICTED

One common approach for extracting features from text is to use

the bag of words model: a model where for each document, an
article in our case, the presence (and often the frequency) of words
is taken into consideration, but the order in which they occur is
ignored.

Specifically, for each term in our dataset, we will calculate a

measure called Term Frequency, Inverse Document Frequency,
abbreviated to tf-idf. This statistic represents words’ importance in
each document. We use a word’s frequency as a proxy for its
importance: if “football” is mentioned 25 times in a document, it
might be more important than if it was only mentioned once.

4. Chi Square
Each of our 2225 documents is now represented by 14415 features,
representing the tf-idf score for different unigrams and bigrams.

This representation is not only useful for solving our classification

task, but also to familiarize ourselves with the dataset. For
example, we can use the chi-squared test to find the terms are the
most correlated with each of the categories:
CONFIDENTIAL & RESTRICTED

5. dimensionality reduction

Dimensionality-reduction techniques:
These methods project a high-dimensional vector into a lower
number of dimensions, with different guarantees on this
projection according to the method used:

 Principal Component Analysis (PCA) and Truncated Singular

Value Decomposition (Truncated SVD) are two popular
method given their scalability to a large number of dimensions,
but they perform poorly on some classes of problems (when the
correlations between the data are non-linear).
 Kernel PCA, self-organizing maps and auto-encoders are often
used when the correlations between the features’ dimensions
are non-linear.
CONFIDENTIAL & RESTRICTED

6. Model training and evaluation

One common mistake when evaluating a model is to train and test
it on the same dataset: this is problematic because you this will not
evaluate how well the model works in realistic conditions, on
unseen data, and models that overfit to the data will seem to
perform better.

It is common practice to split the data in three parts:

1. A training set that the model will be trained on.

2. A validation set used for finding the optimal parameters.
3. A test set to evaluate the model’s performance.
CONFIDENTIAL & RESTRICTED

7. Result of Training
The results for the RandomForest model show a large variance, the
sign of a model that is overfitting to its training data. Running
cross-validation is vital, because results from a single train/test
split might be misleading. We also notice that
both MultinomialNB (Naive Bayes) and LogisticRegressionperform
extremely well, with LogisticRegression having a slight advantage
with a median accuracy of around 97%!

NLP Module 3
No ratings yet
NLP Module 3
66 pages
Text Classification MLND Project Report Prasann Pandya
No ratings yet
Text Classification MLND Project Report Prasann Pandya
17 pages
ITD253 L6 TextClassificationClustering
No ratings yet
ITD253 L6 TextClassificationClustering
39 pages
Text Classification - Movie Review - News Wires
No ratings yet
Text Classification - Movie Review - News Wires
5 pages
Text Classification for ML Experts
No ratings yet
Text Classification for ML Experts
19 pages
Bag - of - Words NLP
100% (1)
Bag - of - Words NLP
23 pages
Unit 3
No ratings yet
Unit 3
27 pages
Qta Lse Day4 PDF
No ratings yet
Qta Lse Day4 PDF
59 pages
Mla Unit-5'2
No ratings yet
Mla Unit-5'2
74 pages
Lect 05
No ratings yet
Lect 05
17 pages
Natural Language Processing-Section
No ratings yet
Natural Language Processing-Section
38 pages
IR - Group1
No ratings yet
IR - Group1
27 pages
Python Text Classification Guide
No ratings yet
Python Text Classification Guide
34 pages
Irs Lab Week-4
No ratings yet
Irs Lab Week-4
2 pages
Winter Semester 2023-24 CSE3015 ETH AP2023246000714 Quiz-I-Question-Paper
No ratings yet
Winter Semester 2023-24 CSE3015 ETH AP2023246000714 Quiz-I-Question-Paper
74 pages
CAT King Study Material 4
No ratings yet
CAT King Study Material 4
32 pages
Text Classification
No ratings yet
Text Classification
7 pages
L2 Cse256 Fa24 TC
No ratings yet
L2 Cse256 Fa24 TC
65 pages
IR Unit 2 (1,2)
No ratings yet
IR Unit 2 (1,2)
76 pages
Category AI Model
No ratings yet
Category AI Model
7 pages
ML7 - Text Classification
No ratings yet
ML7 - Text Classification
13 pages
Unit 3
No ratings yet
Unit 3
123 pages
Lecture5 421
No ratings yet
Lecture5 421
115 pages
Irs Unit 4 CH 1
No ratings yet
Irs Unit 4 CH 1
58 pages
ML Week 3
No ratings yet
ML Week 3
6 pages
Machine Learning: Classification & Naive Bayes
No ratings yet
Machine Learning: Classification & Naive Bayes
20 pages
Jadavpur University: Assignment Submission
No ratings yet
Jadavpur University: Assignment Submission
9 pages
What Is Text Classification - Exxact
No ratings yet
What Is Text Classification - Exxact
12 pages
DATA - FA 2024 - Dist
No ratings yet
DATA - FA 2024 - Dist
85 pages
Machen e Learning
No ratings yet
Machen e Learning
9 pages
03 ML Essentials
No ratings yet
03 ML Essentials
52 pages
Machine Learning Model ENG
No ratings yet
Machine Learning Model ENG
16 pages
ML Unit 2
No ratings yet
ML Unit 2
31 pages
Data Mining Numericals
No ratings yet
Data Mining Numericals
38 pages
Sentiment Analysis To Measure The Users Opinion by Using Machine Learning Techniques
No ratings yet
Sentiment Analysis To Measure The Users Opinion by Using Machine Learning Techniques
15 pages
Text Classification Lecture Notes
No ratings yet
Text Classification Lecture Notes
26 pages
127 1498038923 - 21-06-2017 PDF
No ratings yet
127 1498038923 - 21-06-2017 PDF
9 pages
Introduction to Classification in AI
No ratings yet
Introduction to Classification in AI
66 pages
Importance of Machine Learning
No ratings yet
Importance of Machine Learning
39 pages
UNIT-4 Information Retrieval Notes
No ratings yet
UNIT-4 Information Retrieval Notes
16 pages
Module 5.pptx - 20250608 - 201231 - 0000
No ratings yet
Module 5.pptx - 20250608 - 201231 - 0000
43 pages
NLP Techniques for ML Experts
No ratings yet
NLP Techniques for ML Experts
97 pages
NLP NB
No ratings yet
NLP NB
52 pages
Ascertaining Polarity of Public Opinions On Bangladesh Cricket Through Sentiment Analysis
No ratings yet
Ascertaining Polarity of Public Opinions On Bangladesh Cricket Through Sentiment Analysis
51 pages
Academic Internship Final Report
No ratings yet
Academic Internship Final Report
11 pages
Types of Data Represented As Strings
No ratings yet
Types of Data Represented As Strings
2 pages
Malignant Comment Classifier Guide
No ratings yet
Malignant Comment Classifier Guide
30 pages
Lecture3 Linear Classifiers
No ratings yet
Lecture3 Linear Classifiers
36 pages
Fake Product Review Monitoring & Removal and Sentiment Analysis of Genuine Reviews
No ratings yet
Fake Product Review Monitoring & Removal and Sentiment Analysis of Genuine Reviews
4 pages
Fake Product Review Monitoring & Removal and Sentiment Analysis of Genuine Reviews
No ratings yet
Fake Product Review Monitoring & Removal and Sentiment Analysis of Genuine Reviews
4 pages
Top 10 NLP Question - Answer
No ratings yet
Top 10 NLP Question - Answer
16 pages
A Survey On Different Types of Approaches To Text Categorization
No ratings yet
A Survey On Different Types of Approaches To Text Categorization
3 pages
Machine Learning For Data Science Unit-4
No ratings yet
Machine Learning For Data Science Unit-4
16 pages
A Comparative Analysis of Logistic Regression, Random Forest and KNN Models For The Text Classification
No ratings yet
A Comparative Analysis of Logistic Regression, Random Forest and KNN Models For The Text Classification
16 pages
Prof. Mohammed Tanzeem Agra
No ratings yet
Prof. Mohammed Tanzeem Agra
33 pages
Machine Learning, NLP - Text Classification Using Scikit-Learn, Python and NLTK
No ratings yet
Machine Learning, NLP - Text Classification Using Scikit-Learn, Python and NLTK
9 pages
Lecture 1 - Introduction To Building Automation
No ratings yet
Lecture 1 - Introduction To Building Automation
20 pages
Grade 4 Math Worksheet
100% (1)
Grade 4 Math Worksheet
3 pages
Conveyor Systems Guide
No ratings yet
Conveyor Systems Guide
10 pages
Drier Manual PDF
No ratings yet
Drier Manual PDF
261 pages
Grade 10 - Base Question-Answer
No ratings yet
Grade 10 - Base Question-Answer
2 pages
SSP0665AEN
100% (1)
SSP0665AEN
36 pages
Tds - Sepur 114 FR + DK 001 - Rev02docx
No ratings yet
Tds - Sepur 114 FR + DK 001 - Rev02docx
4 pages
Lsft0a1 Class Test 1
No ratings yet
Lsft0a1 Class Test 1
9 pages
PHP Session and Cookies
100% (1)
PHP Session and Cookies
15 pages
Enercare Lubricant Analysis: Rating Summary Table For Lombardia (9208485)
No ratings yet
Enercare Lubricant Analysis: Rating Summary Table For Lombardia (9208485)
7 pages
Unit7Pointer & Array PDF
No ratings yet
Unit7Pointer & Array PDF
25 pages
Optimizing Premium Reserving Methods
No ratings yet
Optimizing Premium Reserving Methods
13 pages
En 13121-3 (2016) (E)
No ratings yet
En 13121-3 (2016) (E)
10 pages
Teachers' Talk and Students' Attitude
No ratings yet
Teachers' Talk and Students' Attitude
11 pages
12792apni Kaksha
No ratings yet
12792apni Kaksha
3 pages
Vertical Projectile Motion
No ratings yet
Vertical Projectile Motion
11 pages
Mass Spectrometry PDF
No ratings yet
Mass Spectrometry PDF
4 pages
Assignment II Fourier Series
No ratings yet
Assignment II Fourier Series
2 pages
Patente Blumrich
No ratings yet
Patente Blumrich
9 pages
Database Management System2
No ratings yet
Database Management System2
16 pages
Progress OpenEdge 10.1A02 Service Pack Release Notes
No ratings yet
Progress OpenEdge 10.1A02 Service Pack Release Notes
55 pages
Welding TWI
No ratings yet
Welding TWI
16 pages
Nand Foundation Academy, Shegaon: Xii - A Div
No ratings yet
Nand Foundation Academy, Shegaon: Xii - A Div
9 pages
Sentence Structure Categories Guide
No ratings yet
Sentence Structure Categories Guide
38 pages
Coding Challenges for Job Seekers
No ratings yet
Coding Challenges for Job Seekers
20 pages
ISOM 491 Session 2 28aug2015
No ratings yet
ISOM 491 Session 2 28aug2015
115 pages
Psychological Review: VOL. 80, No. 4 JULY 1973 On The Psychology of Prediction
No ratings yet
Psychological Review: VOL. 80, No. 4 JULY 1973 On The Psychology of Prediction
15 pages
OCI Foundations Associate
No ratings yet
OCI Foundations Associate
8 pages
Precommissioning & Commissioning Method Statement For Fire Hose Racks & Fire Hose Reel
50% (2)
Precommissioning & Commissioning Method Statement For Fire Hose Racks & Fire Hose Reel
3 pages
Comprehensive Guide to Alkynes
No ratings yet
Comprehensive Guide to Alkynes
8 pages

Document Classification Using Machine Learning: What Is Document Classifier?

Uploaded by

Document Classification Using Machine Learning: What Is Document Classifier?

Uploaded by

CONFIDENTIAL & RESTRICTED

Document Classification Using

What is Document Classifier?

Document Classification vs Text

Four different levels of scope that can be applied in text

 Document level obtains the relevant categories of a full

Steps for Document Classification

component of a statistical NLP classifier. The dataset needs to be

The dataset also needs to be of a high enough quality in terms of

3. Classification Algorithm and Strategy

Real World Example of Document

1. Data input and pre-processing

our dataset: these properties might inform our problem-solving

A first step would be to look at some random examples, and the

To further analyze our dataset, we need to transform each article’s

One common approach for extracting features from text is to use

Specifically, for each term in our dataset, we will calculate a

This representation is not only useful for solving our classification

 Principal Component Analysis (PCA) and Truncated Singular

6. Model training and evaluation

It is common practice to split the data in three parts:

1. A training set that the model will be trained on.

You might also like