0% found this document useful (0 votes)
77 views9 pages

Document Classification Using Machine Learning: What Is Document Classifier?

Text classification can be used for a wide variety of tasks such as sentiment analysis, topic detection, intent identification, and much more. But when it comes to classification, many often ask whether it’s better to analyze documents as a whole, or if it’s more convenient to preprocess these documents and divide them into smaller units before doing the analysis. Unfortunately, there is not a one-size-fits-all answer. Which approach is more appropriate for classification depends on your data an

Uploaded by

manasa kc
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
77 views9 pages

Document Classification Using Machine Learning: What Is Document Classifier?

Text classification can be used for a wide variety of tasks such as sentiment analysis, topic detection, intent identification, and much more. But when it comes to classification, many often ask whether it’s better to analyze documents as a whole, or if it’s more convenient to preprocess these documents and divide them into smaller units before doing the analysis. Unfortunately, there is not a one-size-fits-all answer. Which approach is more appropriate for classification depends on your data an

Uploaded by

manasa kc
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 9

CONFIDENTIAL & RESTRICTED

Document Classification Using


Machine Learning

What is Document Classifier?


Document classification is an example of Machine Learning (ML)
in the form of Natural Language Processing (NLP). By classifying
text, we are aiming to assign one or more classes or categories to a
document, making it easier to manage and sort. This is especially
useful for publishers, news sites, blogs or anyone who deals with a
lot of content.
CONFIDENTIAL & RESTRICTED

Document Classification vs Text


Classification?
Text classification can be used for a wide variety of tasks such as
sentiment analysis, topic detection, intent identification, and
much more. But when it comes to classification, many often ask
whether it’s better to analyze documents as a whole, or if it’s more
convenient to preprocess these documents and divide them into
smaller units before doing the analysis. Unfortunately, there is not
a one-size-fits-all answer. Which approach is more appropriate for
classification depends on your data and your goals with the
analysis.

Four different levels of scope that can be applied in text


classification:

 Document level obtains the relevant categories of a full


document.
 Paragraph level obtains the relevant categories of a single
paragraph.
 Sentence level obtains the relevant categories of a single
sentence.
 Sub-sentence level obtains the relevant categories of sub-
expressions within a sentence (also known as opinion units).

Steps for Document Classification


1. The dataset
The quality of the tagged dataset is by far the most important
CONFIDENTIAL & RESTRICTED

component of a statistical NLP classifier. The dataset needs to be


large enough to have an adequate number of documents in each
class. For 500 possible document categories, you may require 100
documents per category so a total of 50,000 documents may be
required.

The dataset also needs to be of a high enough quality in terms of


how distinct the documents in the different categories are from
each other to allow clear delineation between the categories.

2. Preprocessing
In our dataset we should given equal importance to each and every
word when creating document vectors. We could do some
preprocessing and decide to give different weighting to words
based on their importance to the document in question. A
common methodology used to do this is TF-IDF (term frequency —
 inverse document frequency). The TF-IDF weighting for a word
increases with the number of times the word appears in the
document but decreases based on how frequently the word
appears in the entire document set.

3. Classification Algorithm and Strategy


We classify the document by comparing the number of matching
terms in the document vectors. In the real world numerous more
complex algorithms exist for classification such as Support Vector
Machines (SVMs), Naive Bayes and Decision Trees.
CONFIDENTIAL & RESTRICTED

Real World Example of Document


Classifier
We create a Document Classifier for BBC Datasets (Two news
article datasets)

1. Data input and pre-processing


We are loading this dataset locally, as a CSV file, and adding a
column encoding the category as an integer (categorical variables
are often better represented by integers than strings).

2. Data exploration
Before diving head-first into training machine learning models, we
should become familiar with the structure and characteristics of
CONFIDENTIAL & RESTRICTED

our dataset: these properties might inform our problem-solving


approach.

A first step would be to look at some random examples, and the


number of examples in each class:

3. TF-IDF
Here, we see that the number of articles per class is roughly
balanced, which is helpful! If our dataset were imbalanced, we
would need to carefully configure our model or artificially balance
the dataset, for example by undersampling or oversampling each
class.

To further analyze our dataset, we need to transform each article’s


text to a feature vector, a list of numerical values representing
some of the text’s characteristics. This is because most ML models
cannot process raw text, instead only dealing with numerical
values.
CONFIDENTIAL & RESTRICTED

One common approach for extracting features from text is to use


the bag of words model: a model where for each document, an
article in our case, the presence (and often the frequency) of words
is taken into consideration, but the order in which they occur is
ignored.

Specifically, for each term in our dataset, we will calculate a


measure called Term Frequency, Inverse Document Frequency,
abbreviated to tf-idf. This statistic represents words’ importance in
each document. We use a word’s frequency as a proxy for its
importance: if “football” is mentioned 25 times in a document, it
might be more important than if it was only mentioned once.

4. Chi Square
Each of our 2225 documents is now represented by 14415 features,
representing the tf-idf score for different unigrams and bigrams.

This representation is not only useful for solving our classification


task, but also to familiarize ourselves with the dataset. For
example, we can use the chi-squared test to find the terms are the
most correlated with each of the categories:
CONFIDENTIAL & RESTRICTED

5. dimensionality reduction

Dimensionality-reduction techniques:
These methods project a high-dimensional vector into a lower
number of dimensions, with different guarantees on this
projection according to the method used:

 Principal Component Analysis (PCA) and Truncated Singular


Value Decomposition (Truncated SVD) are two popular
method given their scalability to a large number of dimensions,
but they perform poorly on some classes of problems (when the
correlations between the data are non-linear).
 Kernel PCA, self-organizing maps and auto-encoders are often
used when the correlations between the features’ dimensions
are non-linear.
CONFIDENTIAL & RESTRICTED

6. Model training and evaluation


One common mistake when evaluating a model is to train and test
it on the same dataset: this is problematic because you this will not
evaluate how well the model works in realistic conditions, on
unseen data, and models that overfit to the data will seem to
perform better.

It is common practice to split the data in three parts:

1. A training set that the model will be trained on.


2. A validation set used for finding the optimal parameters.
3. A test set to evaluate the model’s performance.
CONFIDENTIAL & RESTRICTED

7. Result of Training
The results for the RandomForest model show a large variance, the
sign of a model that is overfitting to its training data. Running
cross-validation is vital, because results from a single train/test
split might be misleading. We also notice that
both MultinomialNB (Naive Bayes) and LogisticRegressionperform
extremely well, with LogisticRegression having a slight advantage
with a median accuracy of around 97%!

You might also like