0% found this document useful (0 votes)

25 views9 pages

Machen e Learning

This document discusses a project utilizing the Naive Bayes classifier to classify 20,000 Netnews articles into 20 categories, addressing the challenges of efficiently organizing vast amounts of text data. The approach involves preprocessing text data through stop word removal and TF-IDF vectorization, while acknowledging assumptions and constraints that may affect model performance. The results indicate strong classification performance, particularly with distinct vocabularies, although challenges arise with overlapping categories.

Uploaded by

Surendra Arjun

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

25 views9 pages

Machen e Learning

Uploaded by

Surendra Arjun

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 9

Bayesian learning for classifying Netnews text articles

Zakaria Khitirishvili
Report Due Date: 6/21/2024
June 11, 2024

1 Problem Statement
In the digital age, the volume of text data generated daily is immense, making it chal-
lenging to organize and classify information efficiently. One common application is the clas-
sification of news articles into predefined categories. Accurate classification aids in better
content organization, improved search functionality, and enhanced user experience. Tradi-
tional methods struggle with the high-dimensional and sparse nature of text data, requiring
advanced machine learning techniques.

2 Solution
The solution involves using the Naive Bayes classifier, a probabilistic machine learning al-
gorithm well-suited for text classification due to its simplicity and effectiveness. This
project will classify 20,000 Netnews articles into 20 distinct categories using the sparse
word count features model to represent the text data. The algorithm will be implemented
in Python, leveraging libraries like scikit-learn for the Naive Bayes classifier and more.
Traditional methods like manual classification or rule-based systems are not scalable for
large datasets and cannot handle the variability and nuances in text data effectively. The
Naive Bayes classifier, on the other hand, provides a robust approach to model the
probabilistic relationships between words and document classes, offering a scalable and
automated solution.

Input: Preprocessed text data represented as feature vectors.

Output: Predicted class labels for each document.

Target Variable: Newsgroup category labels for each document.

The dataset consists of 20,000 newsgroup messages, with 1,000 documents from each of
the 20 newsgroups. This data will be preprocessed to convert text into numerical features
1
Bayesian learning for classifying Netnews text articles CSCI 575

using a bag-of-words model. The dataset is available for download from a provided link.
Features: the sparse word count features model values representing the textual content.

3 Assumptions, Constraints and Implications

The Naive Bayes classifier makes several key assumptions that impact its performance.
It assumes that features (words) are conditionally independent given the class label, simpli-
fying computation but potentially overlooking contextual dependencies in real-world text.
The model also assumes a fixed vocabulary derived from the training data, meaning any
words in the test data not present in the training set will be ignored, possibly losing some
information. Furthermore, the evaluation presumes that the class distribution in the dataset
is balanced; if not, the model might favor majority classes, leading to misleading accuracy
metrics. Lastly, the model assumes that the training data is representative of the overall
distribution of text documents in each category; otherwise, it may fail to generalize well to
new, unseen data, resulting in poor test performance.

Limited computational resources (CPU/GPU, memory) may restrict the size of the
dataset or the complexity of the model, which calls for simpler models or smaller datasets
that might not fully capture the problem’s complexity. The quality of the input text data,
including issues like noise, irrelevant information, and inconsistent formatting, can impact
model performance, requiring extensive and time-consuming preprocessing such as removing
stop words. The length and complexity of text documents can affect the performance of
the TF-IDF vectorizer and the Naive Bayes classifier, with very short or very long documents
possibly leading to overfitting or underfitting.

The model is likely to perform well on classes with distinct vocabularies, providing high
accuracy, precision, recall, and F1-scores, but may struggle with overlapping classes or nu-
anced language, resulting in degraded performance. Its ability to generalize to new data
depends on the representativeness of the training data and the validity of the independence
assumption. While this approach is scalable to reasonably large datasets, it may struggle
with very large datasets or high dimensions, requiring more scalable methods or some sort
of dimensionality reduction techniques.

2
Bayesian learning for classifying Netnews text articles CSCI 575

4 Solution implementation
The solution utilized the 20 Newsgroups dataset, with tasks consisting of approximately
20,000 documents categorized into 20 different newsgroups. This dataset was provided as
a zip file for the course assignment and stored locally, with each category represented by
a directory containing 1000 text files. For model training and evaluation, pairs of specific
categories were selected with tri-class classification and class-vs-class classification,
allowing the model to be trained and tested on distinct subsets of the data.

The text data underwent preprocessing to ensure high-quality input before feature
ex- traction. This preprocessing included removing stop words (common, uninformative words
like ”the” and ”is”), tokenizing the text (splitting it into individual words or tokens), and
standardizing the text by handling punctuation and converting it to lowercase. The primary
feature learning technique applied was Term Frequency-Inverse Document Frequency (TF-
IDF) vectorization, which transforms text documents into numerical vectors that reflect the
importance of each word relative to the entire dataset. TF-IDF combines Term
Frequency (TF), which measures how often a word appears in a document, with
Inverse Document Frequency (IDF), which measures how unique or rare a word is
across all documents, thus weighting frequent but less informative words lower.

The model chosen for classification was the Multinomial Naive Bayes classifier, well-
suited for text classification due to its efficiency and effectiveness in handling high-
dimensional data. This algorithm leverages the probabilistic relationships between words
and class labels. To streamline the process, a pipeline was created using ”make-pipeline”
from scikit-learn, which sequentially combines the TF-IDF vectorizer and the Naive Bayes
classifier, ensuring consistent application of text data transformation and model training.
The dataset was split into training and test sets using an 80-20 split, with the training set
used to fit the model and the test set used to evaluate its performance. The model was
trained on 5 groups of (1 vs 1) and (3 vs 3) of categories, and its effectiveness was assessed
using performance metrics such as accuracy, precision, recall, and F1-score.

3
Bayesian learning for classifying Netnews text articles CSCI 575

5 Analysis of Performance
We used following performance metrics:

Precision: Precision measures the proportion of correctly predicted positive observations

to the total predicted positives.

Recall: Recall measures the proportion of correctly predicted positive observations to all
observations in the actual class.

F1-score: The F1-score is the harmonic mean of precision and recall, providing a single
metric that balances both concerns.

Accuracy: Accuracy is the proportion of correctly predicted observations to the total ob-
servations.

Macro Average (macro avg): Macro averaging calculates the metric independently for
each class and then takes the average (hence treating all classes equally).

Weighted Average (weighted avg): Weighted averaging calculates the metric indepen-
dently for each class and then takes the average weighted by the number of instances in each
class.

In Table 1 below, precision, recall, F1-score, and acuracy are all 1.00, indicating perfect
classification performance for these two categories. This suggests that the model has no
difficulty distinguishing between ’alt.atheism’ and ’comp.graphics’, likely because these
categories have distinct vocabularies.

Precision Recall F1-score Support

alt.atheism 1.00 1.00 1.00 199
comp.graphics 1.00 1.00 1.00 201
Accuracy 1.00 400
Macro avg 1.00 1.00 1.00 400
Weighted avg 1.00 1.00 1.00 400

Table 1: Results for categories: (’alt.atheism’, ’comp.graphics’)

We used confusion matrix to calculate the performance metrics show in tables. Figure 1
below shows the confusion matrix for table 1. As we can see model did very well to correctly
distinguish the 2 group.

4
Bayesian learning for classifying Netnews text articles CSCI 575

Figure 1: Confusion Matrix graph

In the Table 2 below, precision, recall, F1-score, and accuracy have almost perfect
scores, with a slight drop in precision for ’alt.atheism’ (0.99). This small dip indicates a very
minor challenge in distinguishing between these two categories but overall excellent
performance.

Precision Recall F1-score Support

alt.atheism 0.99 1.00 1.00 199
comp.sys.ibm.pc.hardware 1.00 1.00 1.00 201
Accuracy 1.00 400
Macro avg 1.00 1.00 1.00 400
Weighted avg 1.00 1.00 1.00 400

Table 2: Results for categories: (’alt.atheism’, ’comp.sys.ibm.pc.hardware’)

Confusion matrix for Table 2 is shown below in Figure 2. We observe that precision here
wasn’t 100% because for 1 example during testing, our model predicted ’alt.atheism’ when
it actually was ’comp.sys.ibm.pc.hardware’.

5
Bayesian learning for classifying Netnews text articles CSCI 575

Figure 2: Confusion Matrix graph

In Table 3 below, alt.atheism achieves perfect scores across precision, recall, and F1-
score, indicating the model has no difficulty classifying documents in this category. In
comp.graphics, precision is high at 0.98, but recall is slightly lower at 0.93, resulting in an
F1-score of 0.96. This suggests that while most predicted ’comp.graphics’ documents are
correct, some true ’comp.graphics’ documents are missed. comp.sys.ibm.pc.hardware shows
good performance with a precision of 0.94 and a high recall of 0.98, leading to an F1-score
of 0.96. This indicates that most ’comp.sys.ibm.pc.hardware’ documents are correctly iden-
tified, though there are a few false positives. The overall accuracy is 0.97, indicating that
the model performs very well across these categories. Both macro and weighted averages
for precision, recall, and F1-score are 0.97, demonstrating balanced performance across cat-
egories.

Precision Recall F1-score Support

alt.atheism 1.00 1.00 1.00 217
comp.graphics 0.98 0.93 0.96 197
comp.sys.ibm.pc.hardware 0.94 0.98 0.96 186
Accuracy 0.97 600
Macro avg 0.97 0.97 0.97 600
Weighted avg 0.97 0.97 0.97 600

Table 3: Results for categories: (’alt.atheism’, ’comp.graphics’, ’comp.sys.ibm.pc.hardware’)

Similar to before, Figure 3 below shows confusion matrix for Table 3. We see that
model had hardest time when it was looking at ’comp.graphics’ but predicted it wrong as
’comp.sys.ibm.pc.hardware’ 12 times.

6
Bayesian learning for classifying Netnews text articles CSCI 575

Figure 3: Confusion Matrix graph

In Table 4 below, alt.atheism: again, achieves near-perfect scores. comp.graphics main-

tains high precision at 0.99 but lower recall at 0.93, leading to an F1-score of 0.96. Similar to
the previous set, true ’comp.graphics’ documents are sometimes missed. comp.sys.mac.hardware
shows strong performance with precision at 0.93 and a high recall of 0.98, resulting in an
F1-score of 0.96, indicating effective classification with few false positives. The overall ac-
curacy remains high at 0.97, showing the model’s effectiveness. Both macro and weighted
averages are 0.97, reflecting consistent performance across the categories.

Precision Recall F1-score Support

alt.atheism 0.99 1.00 0.99 217
comp.graphics 0.99 0.93 0.96 197
comp.sys.mac.hardware 0.93 0.98 0.96 186
Accuracy 0.97 600
Macro avg 0.97 0.97 0.97 600
Weighted avg 0.97 0.97 0.97 600

Table 4: Results for categories: (’alt.atheism’, ’comp.graphics’, ’comp.sys.mac.hardware’)

Figure 4 below shows confusion matrix for Table 4. Where we observe that our model
struggeled most with when we had ’comp.graphics’ and predicted wrong as ’comp.sys.mac.hardware’

7
Bayesian learning for classifying Netnews text articles CSCI 575

13 times.

Figure 4: Confusion Matrix graph

Overall we see that when topics are similar for different groups, model gets lower score
because it has harder time to distinguish between them. Conversly, if topic in categories
(i.e words) are more different, then model gets high score since its easier to distinguish
them. We observe this accross the board when running all possible pairs of (1 vs 1) and (3
vs 3).

6 Summary
The immense volume of daily text data poses a challenge for efficient classification,
essential for better content organization and search functionality. This project addressed
the problem using the Naive Bayes classifier to categorize 20,000 Netnews articles into 20
categories. The solution involved extensive preprocessing, including stop word removal,
tokenization, and text standardization, followed by TF-IDF vectorization to transform text
data into numerical features.
Key assumptions include feature independence and a fixed vocabulary, which simplify
computation but may overlook contextual dependencies. Constraints such as limited com-
putational resources, data quality issues, and the availability of labeled data impact per-
formance, necessitating thorough preprocessing. Despite these challenges, the Naive Bayes

8
Bayesian learning for classifying Netnews text articles CSCI 575

classifier, combined with TF-IDF vectorization, offers a scalable and interpretable solution,
performing well on distinct vocabularies but struggling with overlapping classes.
The implementation used the 20 Newsgroups dataset, with each category stored as text
files. The model was evaluated on pairs and groups of categories using metrics like accu-
racy, precision, recall, and F1-score, demonstrating its effectiveness and scalability for text
classification tasks.

Text Classification
No ratings yet
Text Classification
7 pages
Naïve Bayes for CS Students
No ratings yet
Naïve Bayes for CS Students
55 pages
Text Classification MLND Project Report Prasann Pandya
No ratings yet
Text Classification MLND Project Report Prasann Pandya
17 pages
Irs Lab Week-4
No ratings yet
Irs Lab Week-4
2 pages
Mla Unit-5'2
No ratings yet
Mla Unit-5'2
74 pages
Comparison of Text Classifiers On News Articles
No ratings yet
Comparison of Text Classifiers On News Articles
5 pages
Lecture 12 Dr. Lamiaa
No ratings yet
Lecture 12 Dr. Lamiaa
21 pages
NaiveBayes N Text Analytics
No ratings yet
NaiveBayes N Text Analytics
20 pages
ML CLassification Naive Bayes
No ratings yet
ML CLassification Naive Bayes
6 pages
MLP Week 6 NaiveBayesImplementation - Ipynb - Colaboratory
No ratings yet
MLP Week 6 NaiveBayesImplementation - Ipynb - Colaboratory
5 pages
17 Result Analysis NLP
No ratings yet
17 Result Analysis NLP
13 pages
Tackling The Poor Assumptions of Naive Bayes Text Classifiers
No ratings yet
Tackling The Poor Assumptions of Naive Bayes Text Classifiers
8 pages
Naive Bayes Algorithm For Classification Tasks: Sana Badagan 1MS24RAI09
No ratings yet
Naive Bayes Algorithm For Classification Tasks: Sana Badagan 1MS24RAI09
31 pages
Report Rohun Sjmoon
No ratings yet
Report Rohun Sjmoon
6 pages
Lab5 Example Fall 23
No ratings yet
Lab5 Example Fall 23
4 pages
Naive Bayes
No ratings yet
Naive Bayes
38 pages
ITD253 L6 TextClassificationClustering
No ratings yet
ITD253 L6 TextClassificationClustering
39 pages
Improved Naive Bayes With Optimal Correlation Factor For Text Classification
No ratings yet
Improved Naive Bayes With Optimal Correlation Factor For Text Classification
10 pages
NLP NB
No ratings yet
NLP NB
52 pages
LM3 - Naive Bayes Model
No ratings yet
LM3 - Naive Bayes Model
21 pages
News Classification Using Machine Learning
No ratings yet
News Classification Using Machine Learning
5 pages
Naive Bayes Classifiers - Parta
No ratings yet
Naive Bayes Classifiers - Parta
17 pages
Lecture Feb20&25
No ratings yet
Lecture Feb20&25
11 pages
News Classifier Using Multinomial Naive Bayes
No ratings yet
News Classifier Using Multinomial Naive Bayes
15 pages
Lecture3 Linear Classifiers
No ratings yet
Lecture3 Linear Classifiers
36 pages
Naive Bayes Etc.
No ratings yet
Naive Bayes Etc.
1 page
Naive Bayes Classifier in Machine Learning Javatpoint
No ratings yet
Naive Bayes Classifier in Machine Learning Javatpoint
23 pages
Lecture13 Nbayes
No ratings yet
Lecture13 Nbayes
56 pages
Paper 1 - 1662-Article Text-12759-12507-10-20210526
No ratings yet
Paper 1 - 1662-Article Text-12759-12507-10-20210526
2 pages
Naïve Bayes for Computer Science Students
No ratings yet
Naïve Bayes for Computer Science Students
38 pages
BAI601 Module 3 PDF
No ratings yet
BAI601 Module 3 PDF
19 pages
Multinomial NB
No ratings yet
Multinomial NB
52 pages
Assignment 9 Swayam Course
No ratings yet
Assignment 9 Swayam Course
3 pages
Top Machine Learning Informations About Different Algorithms
No ratings yet
Top Machine Learning Informations About Different Algorithms
63 pages
Naïve Bayesian Classifier
No ratings yet
Naïve Bayesian Classifier
15 pages
IEEE-paper (1) Original
No ratings yet
IEEE-paper (1) Original
3 pages
Unit 3
No ratings yet
Unit 3
27 pages
IEEE-paper On NLP
No ratings yet
IEEE-paper On NLP
3 pages
Bayes Classifier
No ratings yet
Bayes Classifier
35 pages
Lecture5 421
No ratings yet
Lecture5 421
115 pages
Chapter Veera 6
No ratings yet
Chapter Veera 6
4 pages
NLP Module 3
No ratings yet
NLP Module 3
66 pages
Multimedia Application L8
No ratings yet
Multimedia Application L8
68 pages
Naive Bayes in scikit-learn Guide
No ratings yet
Naive Bayes in scikit-learn Guide
4 pages
Unit-3 AML (Bayesian Concept Learning)
No ratings yet
Unit-3 AML (Bayesian Concept Learning)
40 pages
Text Classification
No ratings yet
Text Classification
11 pages
Text Classification Techniques
No ratings yet
Text Classification Techniques
17 pages
UNIT5
No ratings yet
UNIT5
23 pages
L2 Cse256 Fa24 TC
No ratings yet
L2 Cse256 Fa24 TC
65 pages
L5 TextClassification Updated
No ratings yet
L5 TextClassification Updated
179 pages
Science Research Journal
No ratings yet
Science Research Journal
7 pages
IR Project Report Aniket (1641012047)
No ratings yet
IR Project Report Aniket (1641012047)
22 pages
03 ML Essentials
No ratings yet
03 ML Essentials
52 pages
Naive Bayes and Sentiment Classification
No ratings yet
Naive Bayes and Sentiment Classification
23 pages
Text Classification & Naive Bayes
No ratings yet
Text Classification & Naive Bayes
4 pages
Multinomial Naive Bayes For Text Categorization Revisited: (Amk14, Eibe, Bernhard, Geoff) @cs - Waikato.ac - NZ
No ratings yet
Multinomial Naive Bayes For Text Categorization Revisited: (Amk14, Eibe, Bernhard, Geoff) @cs - Waikato.ac - NZ
12 pages
Report On Naive Bayes
No ratings yet
Report On Naive Bayes
5 pages
CSC 325 AI Lecture08 Supervised Learning Fall2024 DR Raheel 20022025 034558pm
No ratings yet
CSC 325 AI Lecture08 Supervised Learning Fall2024 DR Raheel 20022025 034558pm
29 pages
Text Classification
No ratings yet
Text Classification
53 pages
Unit 1
No ratings yet
Unit 1
23 pages
Unit 1.1
No ratings yet
Unit 1.1
1 page
Parallel and Distributed Systems
No ratings yet
Parallel and Distributed Systems
8 pages
Unit
No ratings yet
Unit
25 pages
Naveen CSP
No ratings yet
Naveen CSP
48 pages
Safe Drinking Water and Toilet Facility in Public
No ratings yet
Safe Drinking Water and Toilet Facility in Public
7 pages
Monthly Status of Desktops, Laptops & Tablets in Power Point
No ratings yet
Monthly Status of Desktops, Laptops & Tablets in Power Point
2 pages
Sai Krishna Water Pollution
No ratings yet
Sai Krishna Water Pollution
49 pages
Sri Lanka Economic Crisis WORD
No ratings yet
Sri Lanka Economic Crisis WORD
1 page
Material Kitz Consumption Breakdown R1
No ratings yet
Material Kitz Consumption Breakdown R1
5 pages
Novel - Kurt Vonnegut - 2 B R O 2 B
No ratings yet
Novel - Kurt Vonnegut - 2 B R O 2 B
13 pages
04 - Giua HK 12
No ratings yet
04 - Giua HK 12
6 pages
SUN Chemical
No ratings yet
SUN Chemical
1 page
First Quarter Exam in English 9
No ratings yet
First Quarter Exam in English 9
11 pages
M.ed Dessertations2
No ratings yet
M.ed Dessertations2
104 pages
April 2022 Full Math Corrections
No ratings yet
April 2022 Full Math Corrections
26 pages
Stainless Steel 1.4401 316
No ratings yet
Stainless Steel 1.4401 316
3 pages
Maths Question Paper Maths
No ratings yet
Maths Question Paper Maths
6 pages
Threadedhalfcoupling PDF
No ratings yet
Threadedhalfcoupling PDF
1 page
Adapt-Ptrc 2018 User Manual
No ratings yet
Adapt-Ptrc 2018 User Manual
198 pages
Interface Designer V5 EN
No ratings yet
Interface Designer V5 EN
140 pages
Asphalt Distress: Paver
No ratings yet
Asphalt Distress: Paver
47 pages
Reading Comprehension: Dealing With The Comprehension Passage
No ratings yet
Reading Comprehension: Dealing With The Comprehension Passage
9 pages
Al Aamerah Housing Project Infrastructure - Updated Program
No ratings yet
Al Aamerah Housing Project Infrastructure - Updated Program
2 pages
Control Flow Statements - DPP - 01 - Shreshth GATE 2025 Computer Science Weekday (Hinglish)
No ratings yet
Control Flow Statements - DPP - 01 - Shreshth GATE 2025 Computer Science Weekday (Hinglish)
5 pages
Industrial Vacuum Dryer Solutions
No ratings yet
Industrial Vacuum Dryer Solutions
3 pages
Ganong Et Al - A Meta-Analytic Review of Family Structure Stereotypes - 1990
No ratings yet
Ganong Et Al - A Meta-Analytic Review of Family Structure Stereotypes - 1990
12 pages
Arts 6 Quarter 4 Module 1
100% (2)
Arts 6 Quarter 4 Module 1
9 pages
432 1167 1 PB
No ratings yet
432 1167 1 PB
8 pages
Informal Letter
No ratings yet
Informal Letter
14 pages
Management Skills Sheet 1 Answers
No ratings yet
Management Skills Sheet 1 Answers
7 pages
DeepETA - How Uber Predicts Arrival Times Using Deep Learning
No ratings yet
DeepETA - How Uber Predicts Arrival Times Using Deep Learning
18 pages
Electromagnetic Spectrum Worksheet 1
No ratings yet
Electromagnetic Spectrum Worksheet 1
3 pages
03 Potential Failure Modes For Dams Part 2
No ratings yet
03 Potential Failure Modes For Dams Part 2
79 pages
Kitchen Safety in Hospital
No ratings yet
Kitchen Safety in Hospital
6 pages
Measurements of Physical Quantity
No ratings yet
Measurements of Physical Quantity
8 pages
Effect of Waste Tyre Rubber Additive On Concrete Mixture Strength
No ratings yet
Effect of Waste Tyre Rubber Additive On Concrete Mixture Strength
8 pages
MBA Marketing Management Course
No ratings yet
MBA Marketing Management Course
20 pages
STD 10 Math Probability All Types Ques.
No ratings yet
STD 10 Math Probability All Types Ques.
15 pages