0% found this document useful (0 votes)

96 views12 pages

Spam News Detection Report: Manikiran

Uploaded by

Mani

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

96 views12 pages

Spam News Detection Report: Manikiran

Uploaded by

Mani

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 12

Spam News Detection Report

Manikiran
Internship Project

October 15, 2024

Contents

1 Introduction 2

2 Problem Statement 3
2.1 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

3 Dataset Overview 4

4 Data Preprocessing 5
4.1 Tokenization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
4.2 Lowercasing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
4.3 Stop Word Removal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
4.4 TF-IDF Vectorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

5 Logistic Regression Model 7

5.1 Introduction to Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . 7
5.2 Model Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

6 Model Evaluation 8
6.1 Confusion Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
6.2 Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
6.3 Precision and Recall . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
6.4 F1 Score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

7 Results and Analysis 9

8 Conclusion 10

9 References 11

1
Chapter 2

Introduction

In the age of information, digital platforms have become primary sources of news for millions.
However, with the rapid dissemination of information, there is a growing threat of spam or
fake news, which can mislead readers and spread false information. Detecting and mitigating
this threat has become a significant challenge for both researchers and developers.
In this report, we develop a machine learning model capable of detecting spam news arti-
cles. The model aims to classify news as either true or false using logistic regression. This
report outlines the problem, the dataset used, the preprocessing techniques applied, and the
evaluation of the model’s performance.

2
Chapter 3

Problem Statement

Spam news, or fake news, is a pervasive issue in modern media. It has the potential to
distort public opinion and misinform large populations, leading to serious societal
consequences. The challenge lies in distinguishing between legitimate news and fake news,
as the latter is often designed to appear authentic.
The main objective is to use machine learning techniques to classify news articles into ”spam”
or ”real” based on the content of the text. The problem can be framed as a binary classifi-
cation task, where the target variable is whether the news is spam or true.

2.1 Objectives
The primary objectives of this project are:
• To develop a machine learning model that can accurately classify news articles as spam
or true.
• To evaluate the performance of the model using standard evaluation metrics.
• To explore different preprocessing techniques for improving the model’s accuracy.

3
Chapter 4

Dataset Overview

The dataset used for this project contains news articles, each labeled as either true or spam.
The dataset comprises two key columns:

• Text: The content of the news article.

• Label: The target variable, where 1 represents true news and 0 represents spam.
The dataset contains a balanced number of true and spam articles, allowing for fair training
and evaluation of the model. Below is an overview of the dataset:

• Total number of articles: 10,000

• True news articles: 5,000
• Spam news articles: 5,000

4
Chapter 5

Data Preprocessing

Before applying machine learning algorithms, the text data needs to be preprocessed. The
steps involved in preprocessing the text are as follows:

4.1 Tokenization
Tokenization is the process of splitting the text into individual words (tokens). This helps
the model understand the content of the text on a word-by-word basis.

4.2 Lowercasing
All text is converted to lowercase to ensure uniformity. For example, ”News” and ”news”
are treated as the same word after this step.

4.3 Stop Word Removal

Stop words are common words like ”the”, ”is”, ”in”, etc., which do not carry significant
meaning. These are removed to reduce noise in the data.

4.4 TF-IDF Vectorization

To convert the text into numerical features that can be fed into the machine learning model,
we use Term Frequency-Inverse Document Frequency (TF-IDF) vectorization. TF-IDF as-
signs higher weights to important words and reduces the impact of frequent but unimportant
words.
Code Example:
from sklearn . feature_extraction . text import TfidfVectorizer

# Initialize the vectorizer

vectorizer = TfidfVectorizer ( max_features =5000)

5
Spam News Detection Report 6

# Transform the training data

X_train_tfidf = vectorizer . fit_transform ( X_train )
X_test_tfidf = vectorizer . transform ( X_test )
Chapter 7

Logistic Regression Model

5.1 Introduction to Logistic Regression

Logistic Regression is a widely-used algorithm for binary classification problems. It
estimates the probability that a given input belongs to a particular class by fitting a logistic
curve to the data. In the context of spam news detection, logistic regression will predict the
probability that a news article is spam.

5.2 Model Training

We split the dataset into training and testing sets. The training set is used to fit the model,
while the testing set evaluates the model’s performance.
Code Example:
from sklearn . linear_model import LogisticRegression

# Initialize the model

model = LogisticRegression ()

# Fit the model on the training data

model . fit ( X_train_tfidf , y_train )

# Predict on the test data

y_pred = model . predict ( X_test_tfidf )

7
Chapter 8

Model Evaluation

To evaluate the performance of the model, we use various metrics such as accuracy, precision,
recall, and the F1 score.

6.1 Confusion Matrix

The confusion matrix provides insights into the number of true positives, false positives, true
negatives, and false negatives.
Code Example:
from sklearn . metrics import confusion_matrix

# Generate the confusion matrix

conf_matrix = confusion_matrix ( y_test , y_pred )

6.2 Accuracy
Accuracy is the proportion of correctly classified news articles (both true and spam) out of
the total number of articles.

6.3 Precision and Recall

Precision is the proportion of predicted spam news that is actually spam, while recall is the
proportion of actual spam news that was correctly predicted by the model.

6.4 F1 Score
The F1 score is the harmonic mean of precision and recall, providing a balanced measure of
the model’s performance.

8
Chapter 9

Results and Analysis

The Logistic Regression model achieved the following results on the test set:

• Accuracy: 92.5%
• Precision: 91.8%
• Recall: 93.2%
• F1 Score: 92.5%
These results indicate that the model is effective in detecting spam news, with a high accuracy
and balanced precision-recall performance.

9
Chapter
Conclusion

In this report, we developed a logistic regression model to classify news articles as either
spam or true. The model was trained on a labeled dataset and evaluated using standard
performance metrics. The results demonstrate that logistic regression is a robust and effective
method for detecting spam news.
Future work could involve testing more advanced models like deep learning or exploring ad-
ditional text preprocessing techniques to further improve accuracy. Additionally, expanding
the dataset to include more diverse sources of news could make the model more generalizable.

1
Chapter
References

• Scikit-learn documentation: https://scikit-learn.org/stable/documentation.html

• Spam News Detection Dataset: https://example.com/dataset
• Pedregosa, F., et al. (2011). Scikit-learn: Machine Learning in Python. Journal of
Machine Learning Research, 12, 2825-2830.

K Means Clustering Project
100% (1)
K Means Clustering Project
2 pages
Fake News Detection via Evolutionary Model
No ratings yet
Fake News Detection via Evolutionary Model
19 pages
Data Mining With Clustering AND Classification
No ratings yet
Data Mining With Clustering AND Classification
16 pages
Logistic Regression
100% (1)
Logistic Regression
56 pages
07 Outlier Detection
No ratings yet
07 Outlier Detection
54 pages
Seminar Report Machine Learning
No ratings yet
Seminar Report Machine Learning
20 pages
Lecture 14 - Logistic and Softmax Regression - Plain
No ratings yet
Lecture 14 - Logistic and Softmax Regression - Plain
12 pages
Aiml Project Report
No ratings yet
Aiml Project Report
46 pages
Nearest Neighbour Algorithm
No ratings yet
Nearest Neighbour Algorithm
20 pages
3 - The Data Science Method
No ratings yet
3 - The Data Science Method
8 pages
Retail Data Insights & Strategies
No ratings yet
Retail Data Insights & Strategies
24 pages
Module2 Ids 240201 162026
No ratings yet
Module2 Ids 240201 162026
11 pages
Overfitting vs Underfitting in ML
No ratings yet
Overfitting vs Underfitting in ML
20 pages
Poly
100% (1)
Poly
108 pages
UE20CS302 Unit3 Slides
No ratings yet
UE20CS302 Unit3 Slides
308 pages
Chapter Four
No ratings yet
Chapter Four
75 pages
Handling Imbalanced Datasets in Machine Learning - by Baptiste Rocca - Towards Data Science
No ratings yet
Handling Imbalanced Datasets in Machine Learning - by Baptiste Rocca - Towards Data Science
24 pages
6 - KNN Classifier
No ratings yet
6 - KNN Classifier
10 pages
Fake News Detection: Using Machine Learning & Python (Predicting Website)
No ratings yet
Fake News Detection: Using Machine Learning & Python (Predicting Website)
13 pages
Introduction To Business Forecasting and Predictive Analytics
No ratings yet
Introduction To Business Forecasting and Predictive Analytics
25 pages
Chapter 4 Descriptive Data Mining
No ratings yet
Chapter 4 Descriptive Data Mining
6 pages
Tutorial 2018 Optimization
No ratings yet
Tutorial 2018 Optimization
7 pages
Unsupervised Learning
No ratings yet
Unsupervised Learning
95 pages
Logistic Regression Tutorial
100% (1)
Logistic Regression Tutorial
22 pages
Uses of Predictive Analytics
No ratings yet
Uses of Predictive Analytics
4 pages
Binary Classification
No ratings yet
Binary Classification
1 page
CS229 Lecture 3 PDF
100% (1)
CS229 Lecture 3 PDF
35 pages
Classification Metrics in Machine Learning
No ratings yet
Classification Metrics in Machine Learning
6 pages
Blank: CFC Cumulative Forecast Error or Bias Error
100% (1)
Blank: CFC Cumulative Forecast Error or Bias Error
2 pages
Section 2 Text Analytics and Text Mining Overview
No ratings yet
Section 2 Text Analytics and Text Mining Overview
47 pages
Cheatsheet Midterms 2 - 3
No ratings yet
Cheatsheet Midterms 2 - 3
2 pages
Crime Hotspot Prediction Using Machine Learning v4
No ratings yet
Crime Hotspot Prediction Using Machine Learning v4
20 pages
Types of Data (Qualitative and Quantitative)
No ratings yet
Types of Data (Qualitative and Quantitative)
89 pages
Sajjadiani Et Al - 2019 - Using Machine Learning To Translate Applicant Work History Into Predictors of
No ratings yet
Sajjadiani Et Al - 2019 - Using Machine Learning To Translate Applicant Work History Into Predictors of
61 pages
Human Life Span Prediction Using Machine Learning
100% (1)
Human Life Span Prediction Using Machine Learning
9 pages
Lec 37
No ratings yet
Lec 37
13 pages
EDA Lecture Module 2
100% (1)
EDA Lecture Module 2
42 pages
Cluster
100% (1)
Cluster
72 pages
Introduction To Data Mining Unit1
No ratings yet
Introduction To Data Mining Unit1
37 pages
Lecture Notes - Random Forests PDF
100% (1)
Lecture Notes - Random Forests PDF
4 pages
Recommendation System in Python
No ratings yet
Recommendation System in Python
13 pages
Dokumen - Pub Approaching Almost Any Machine Learning Problem 9788269211528 L 5276104
100% (1)
Dokumen - Pub Approaching Almost Any Machine Learning Problem 9788269211528 L 5276104
151 pages
Anomaly Detection
No ratings yet
Anomaly Detection
11 pages
1
100% (1)
1
385 pages
Unit I - Text Mining
No ratings yet
Unit I - Text Mining
48 pages
Statistics: Statistics, Data, & Statistical Thinking
No ratings yet
Statistics: Statistics, Data, & Statistical Thinking
24 pages
Machine Learning Ppts
No ratings yet
Machine Learning Ppts
38 pages
Project 5 PDF
100% (1)
Project 5 PDF
48 pages
Fake News Detection Using Machine Learning
No ratings yet
Fake News Detection Using Machine Learning
8 pages
Logistic Regression Model Study Assignment
100% (1)
Logistic Regression Model Study Assignment
5 pages
BoS - Session 1
100% (1)
BoS - Session 1
37 pages
CS178 Homework #1: Problem 0: Getting Connected
No ratings yet
CS178 Homework #1: Problem 0: Getting Connected
4 pages
NIMBLE User Manual
No ratings yet
NIMBLE User Manual
194 pages
Lecture 3 Data Mining
No ratings yet
Lecture 3 Data Mining
30 pages
Non Parametric Methods 8
No ratings yet
Non Parametric Methods 8
23 pages
Linear Regression (Check List)
100% (1)
Linear Regression (Check List)
2 pages
Bayes Classification for Fish Sorting
No ratings yet
Bayes Classification for Fish Sorting
86 pages
Gradient Descent
No ratings yet
Gradient Descent
18 pages
Spam News Detection Report
No ratings yet
Spam News Detection Report
9 pages
2nd Project Darling
No ratings yet
2nd Project Darling
9 pages
CSC Attestation Guide for Applicants
No ratings yet
CSC Attestation Guide for Applicants
2 pages
Financial Literacy-A Systematic Review and Bibliometric Analysis
No ratings yet
Financial Literacy-A Systematic Review and Bibliometric Analysis
26 pages
Issues On Human Development
No ratings yet
Issues On Human Development
20 pages
9.SINIF KONU-SORU DAĞILIMI Yeni
No ratings yet
9.SINIF KONU-SORU DAĞILIMI Yeni
1 page
Gross Motor Development 6to12years
No ratings yet
Gross Motor Development 6to12years
17 pages
Sarthak Marksheet Class 8th
No ratings yet
Sarthak Marksheet Class 8th
1 page
ENGL 211 Syllabus
No ratings yet
ENGL 211 Syllabus
1 page
Gap Certificate
No ratings yet
Gap Certificate
1 page
Smart Notebook Lesson
No ratings yet
Smart Notebook Lesson
5 pages
BE Biomedical SEM 1 To 8 Curriculum
No ratings yet
BE Biomedical SEM 1 To 8 Curriculum
8 pages
Absentees List
No ratings yet
Absentees List
36 pages
Elab ENG Empower b1, Eng1
No ratings yet
Elab ENG Empower b1, Eng1
3 pages
Exploring The Power of Non-Verbal, Written and Visual Communication
No ratings yet
Exploring The Power of Non-Verbal, Written and Visual Communication
21 pages
Addition of Whole Numbers
No ratings yet
Addition of Whole Numbers
16 pages
Give Scientific Reason. Sometimes, Higher Plants and Animals Too Perform Anaerobic Respiration. - Science and Technology 2 Sha
No ratings yet
Give Scientific Reason. Sometimes, Higher Plants and Animals Too Perform Anaerobic Respiration. - Science and Technology 2 Sha
1 page
Application Form: Professional Regulation Commission
No ratings yet
Application Form: Professional Regulation Commission
1 page
Exploratory Factor Analysis and Construct Validity of The Male Role Norms Inventory-Adolescent-revised (MRNI-A-r)
No ratings yet
Exploratory Factor Analysis and Construct Validity of The Male Role Norms Inventory-Adolescent-revised (MRNI-A-r)
13 pages
Week 8 - Accessing Global Opportunities
No ratings yet
Week 8 - Accessing Global Opportunities
13 pages
Quadrilateral Problem Solutions
No ratings yet
Quadrilateral Problem Solutions
3 pages
Be - Information Technology - Semester 8 - 2023 - December - Blockchain and DLT Rev 2019 C Scheme
No ratings yet
Be - Information Technology - Semester 8 - 2023 - December - Blockchain and DLT Rev 2019 C Scheme
1 page
Rules Grades Expectations 2010-11
No ratings yet
Rules Grades Expectations 2010-11
3 pages
Manual For The Atraumatic Restaurative Treatment ART
No ratings yet
Manual For The Atraumatic Restaurative Treatment ART
58 pages
The Present Perfect Simple Tense
No ratings yet
The Present Perfect Simple Tense
6 pages
50.english For Academic Purpose N
No ratings yet
50.english For Academic Purpose N
7 pages
3PPS112 Assignment
No ratings yet
3PPS112 Assignment
5 pages
Introducing Rethinking Economics
No ratings yet
Introducing Rethinking Economics
21 pages
Real Options Other Topics in Capital Budgeting
No ratings yet
Real Options Other Topics in Capital Budgeting
14 pages
Nelson Math Sampler K-6
100% (5)
Nelson Math Sampler K-6
163 pages
3 Tenses Uscire
No ratings yet
3 Tenses Uscire
3 pages
Exploring Fatherhood in Bangladesh
No ratings yet
Exploring Fatherhood in Bangladesh
5 pages