FAKE NEWS DETECTION : using machine learning
submitted by: Ms. Addala Teja sri seethalu
(2302001)
AIMLDS
KGRL COLLEGE OF PG COURSES (A)
BHIMAVARAM
M.C.A
JULY - 2024
PROJECT: FAKE NEWS DETECTION USING MACHINE LEARNING
Uncovering the truth has never been easier! Learn how machine learning
algorithms can help combat fake news with our fake news detection project tutorial!
Imagine a scenario where a false news story spreads rapidly on social
media ,claiming that a particular medication is a cure for a deadly disease . people start
hoarding the medication ,causing scarcity and preventing those who need it from accessing
it.
This example scenario shows one of the several real – world risks of fake news.
The rapid spread of fake news has become a major issue worldwide. The spread of false
misleading news has led to significant and economic consequences, impacting from finance
to healthcare. For example ,IN 2020,during the COVID-19 pandemic, several countries
witnessed a spike in false news about the virus, leading to confusion and panic among
people. Misinformation and fake news can have a long-term impact, especially when people
rely on accurate imformati0n to make critical decisions. The need for detecting fake news
has never been more crucial. Machine learning techniques can help us detect fake news
efficiently and accurately . Using natural language processing techniques, machine learning
algorithms can accurately detect and categorize true and false news ML systems may
distinguish between true news and false news by analyzing patterns in the language and
source used in news reports.
This blog will explore a fake news detection project using machine learning and discuss how
machine learning algorithms can efficiently detect and distinguish false news from real news.
We will also explore the key machine -learning algorithms used to identify false and true
news and real-world use cases of fake news detection.
Table of Contents
What is fake news detection using machine learn project?
Advantages and disadvantages of fake news detection using machine learn
Top 5 machine learning algorithms for fake news detection
Top 5 fake news detection project datasets
Fake news detection real-world use cases / applications
Fake news detection is very useful to us
Top 3 fake news detection projects in GitHub
Build a fake news detection project in python with source code – a step – by – step
approach
Boost your career with fake news detection project by
FAQs
What is fake news detection using machine learning project?
Detecting fake news using machine learning techniques would mean
having an automatic detection system that looks at a piece of text (tweets, news articles,
WhatsApp message) and determine how likely it looks like at a piece of false news. The
system will be a machine learning model trained on a large enough dataset containing
example of real and false news from various sources and styles. However, since machine
learning models only look at numerical features, we must perform natural language
processing on this text corpus(collection of text samples).
Natural language processing will perform data cleaning
,stemming ,lemmatization, and vectorization using one of the many available techniques
and convert sentences into a vectors of numbers that machine learning models can
interpret. Once this is done, we can train models like naïve bayes, logistic regression, and
random forests and observe their results.
If we find that the performance of these machine learning techniques is lacking in the
dataset, we can delve into deep learning and look at LSTM or Attention-based models to
perform text classification.
But first, let us see why you should use machine learning for detecting false news and what
drawbacks you should be aware of while doing so .
FAKE NEWS DETECTION
MADE BY : NAVYA SRI (22PD1A0552)
Advantages and Disadvantages of fake news detection using machine learnin :
Machine learning has led to significant developments in fake news detection. However,
machine learning has advantages and disadvantages when detecting false news. This section
will explore the pros and cons of fake news prediction using machine learning.
Advantages of detection fake news using ML :
Scalability
Privacy concern
Maintenance
Top 5 machine learning algorithms for fake news detection data science project :
GNN
(GRAPH NEURAL NETWORKS )
Bi LSTM + Attention
( Bi-directional learning for stance detection in context of checking fake news )
CNN + DNN
( Deep neural network that ends with a soft max layer )
CNN + BOOSTED TREES
( convolution neural networks and boosted trees allows for more robust and accurate
detection of false news. )
MLP
( multilayer perceptron is a type of neural network )
Top 5 fake news detection project datasets :
Fake news net : dataset of political and gossip tweets
Fake news corpus
Fake health
Constraint COVID-19 fake news dataset
FNC-1 ( FAKE NEWS CHALLENGE STAGE 1 )
Fake news prediction using machine learning real world use cases / applications
Fake news detection has a wide range of applications across various industries. Let us
explore some real-world applications of false news detection
o Social media
( fake news spreads quickly on social media platforms )
o News / journalism
( News organization use machine learning algorithm to verify information and sources )
o Politics
(fake news can significantly impact political campaigns and election )
o Finance
( fake news can also significantly impact financial markets )
o Healthcare
( Fake news can also seriously affect the healthcare industry )
Top 3 fake news detection projects on GitHub :
There are many research projects on false news that one can explore to understand the
scope of the problem and the best available approaches.to get started , we list some of the
better projects available publicly on GitHub for detecting fake news with python.y
Comprehensive project for fake news analusis using machine learning,build fake news
detecting using python project with source code – A step – by – step approach
DATASET DESCRIPTION
Train.csv : A full training database with the following attributes:
o Id : unique id for a news
o Title : the title of a new article
o Author: author of the news article
o Text : the text of the article
o Lable : a lable that marks the article as potentially unreliable
… 1 : unreliable
… 0 : reliable
Here is a basic example of fake news detection using machine learning with Python
and scikit-learn. This example uses a logistic regression model, but you can
experiment with other models for better results.
### Prerequisites
Make sure you have the following libraries installed:
- pandas
- scikit-learn
- nltk (for natural language processing tasks)
You can install them using pip:
sh
pip install pandas scikit-learn nltk
### Step-by-Step Code
1. *Import necessary libraries:*
python
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report
import nltk
from nltk.corpus import stopwords
import re
nltk.download('stopwords')
2. *Load and preprocess the dataset:*
For this example, we will use a CSV file containing labeled news articles. You can use
any suitable dataset.
python
# Load dataset
df = pd.read_csv('fake_news_dataset.csv') # Ensure your CSV file has columns like
'text' and 'label'
# Basic text cleaning function
def clean_text(text):
text = re.sub(r'\W', ' ', text)
text = re.sub(r'\s+', ' ', text)
text = re.sub(r'\s+', ' ', text)
text = text.lower()
return text
df['text'] = df['text'].apply(clean_text)
# Define features and labels
X = df['text']
y = df['label']
3. *Split the dataset into training and testing sets:*
python
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)
4. *Convert text data to numerical data using TF-IDF Vectorizer:*
python
vectorizer = TfidfVectorizer(stop_words=stopwords.words('english'), max_df=0.7)
X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)
5. *Train the machine learning model:*
python
model = LogisticRegression()
model.fit(X_train_tfidf, y_train)
6. *Evaluate the model:*
python
y_pred = model.predict(X_test_tfidf)
print(f'Accuracy: {accuracy_score(y_test, y_pred)}')
print(classification_report(y_test, y_pred))
### Complete Example
```python
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report
import nltk
from nltk.corpus import stopwords
import re
nltk.download('stopwords')
# Load dataset
df = pd.read_csv('fake_news_dataset.csv') # Ensure your CSV file has columns like
'text' and 'label'
# Basic text cleaning function
def clean_text(text):
text = re.sub(r'\W', ' ', text)
text = re.sub(r'\s+', ' ', text)
text = re.sub
Detecting fake news using machine learning involves training algorithms to identify
patterns and features associated with false information. Here’s an overview of the
process:
1. *Data Collection*: Gather datasets containing labeled examples of fake and real
news. Commonly used datasets include the Fake News Challenge (FNC) and the LIAR
dataset.
2. *Data Preprocessing*: Clean and prepare the text data for analysis. This includes:
- Tokenization: Breaking down text into individual words or tokens.
- Stop Word Removal: Eliminating common words that do not carry significant
meaning.
- Lemmatization/Stemming: Reducing words to their base or root form.
- Vectorization: Converting text into numerical representations, such as TF-IDF
(Term Frequency-Inverse Document Frequency) or word embeddings (e.g.,
Word2Vec, GloVe).
3. *Feature Engineering*: Identify and create features that help distinguish fake
news from real news. These features can be:
- Text-based features: Word frequencies, n-grams, sentiment scores.
- Source-based features: Credibility and reputation of the news source.
- Social context features: Engagement metrics, user profiles, and propagation
patterns on social media.
4. *Model Selection*: Choose machine learning algorithms to train on the processed
data. Common models include:
- Logistic Regression
- Support Vector Machines (SVM)
- Decision Trees and Random Forests
- Gradient Boosting Machines (GBM)
- Neural Networks, especially Recurrent Neural Networks (RNN) and Transformers
for handling sequential text data.
5. *Training and Evaluation*: Split the data into training and testing sets. Train the
model on the training set and evaluate its performance on the testing set using
metrics like accuracy, precision, recall, and F1-score.
6. *Model Tuning*: Optimize hyperparameters and improve the model’s
performance through techniques like cross-validation, grid search, or random search.
7. *Deployment*: Integrate the trained model into applications or systems that can
automatically flag potential fake news articles. Continuous monitoring and retraining
of the model are necessary to adapt to new patterns and changes in the data.
By leveraging these steps, machine learning models can help automate and enhance
the detection of fake news, contributing to more reliable and trustworthy
information dissemination.
To implement a fake news detection system using machine learning, let's assume you
have a dataset with two CSV files: true.csv and fake.csv. Below is a step-by-step guide
to building a machine learning model for fake news detection:
### Step 1: Import Libraries
First, you'll need to import the necessary libraries for data manipulation,
visualization, and machine learning.
python
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
### Step 2: Load the Data
Load the true.csv and fake.csv datasets.
python
# Load the datasets
true_news = pd.read_csv('path/to/true.csv')
fake_news = pd.read_csv('path/to/fake.csv')
# Add a label column to each dataset
true_news['label'] = 1 # 1 indicates true news
fake_news['label'] = 0 # 0 indicates fake news
# Combine the datasets
news = pd.concat([true_news, fake_news]).reset_index(drop=True)
### Step 3: Preprocess the Data
Clean and preprocess the text data.
python
import re
from nltk.corpus import stopwords
# Function to clean the text
def clean_text(text):
text = re.sub(r'http\S+', '', text) # Remove URLs
text = re.sub(r'[^A-Za-z\s]', '', text) # Remove non-alphabetic characters
text = text.lower() # Convert to lowercase
text = ' '.join(word for word in text.split() if word not in stopwords.words('english'))
# Remove stopwords
return text
# Apply the function to the text column
news['text'] = news['text'].apply(clean_text)
### Step 4: Split the Data
Split the data into training and testing sets.
python
X = news['text']
y = news['label']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)
### Step 5: Feature Extraction
Convert the text data to numerical features using TF-IDF.
Python
tfidf_vectorizer = TfidfVectorizer(max_features=5000)
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)
X_test_tfidf = tfidf_vectorizer.transform(X_test)
### Step 6: Train the Model
Train a logistic regression model.
python
model = LogisticRegression()
model.fit(X_train_tfidf, y_train)
### Step 7: Evaluate the Model
Evaluate the model on the test data.
python
y_pred = model.predict(X_test_tfidf)
# Calculate evaluation metrics
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
# Plot the confusion matrix
conf_matrix = confusion_matrix(y_test, y_pred)
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues', xticklabels=['Fake',
'True'], yticklabels=['Fake', 'True'])
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.show()
### Step 8: Save and Deploy the Model
Save the trained model and the TF-IDF vectorizer for future use.
python
import joblib
# Save the model
joblib.dump(model, 'fake_news_model.pkl')
# Save the vectorizer
joblib.dump(tfidf_vectorizer, 'tfidf_vectorizer.pkl')
You can now deploy this model to detect fake news in real-time by loading the saved
model and vectorizer, then using them to predict the labels for new news articles!
To work on fake news detection using machine learning, you'll need a suitable
dataset to train and test your models. Here are some popular datasets commonly
used for this purpose:
1. *LIAR Dataset*:
- Contains over 12,000 labeled short statements from Politifact, with labels like
pants-fire, false, barely-true, half-true, mostly-true, and true.
- Available at: [LIAR
Dataset](https://www.cs.ucsb.edu/~william/data/liar_dataset.zip)
2. *Fake News Challenge (FNC-1) Dataset*:
- Contains over 50,000 labeled news articles, with stance detection as the primary
task.
- Available at: [FNC-1 Dataset](http://www.fakenewschallenge.org/)
3. *Kaggle Fake News Dataset*:
- Contains a mix of true and fake news articles.
- Available at: [Kaggle Fake News](https://www.kaggle.com/c/fake-news)
4. *BuzzFeed News Dataset*:
- Consists of political news articles labeled by BuzzFeed journalists as either true,
false, or mixed.
- Available at: [BuzzFeed News Dataset](https://github.com/BuzzFeedNews/2016-
10-facebook-fact-check)
5. *ISOT Fake News Dataset*:
- Contains two CSV files: one for fake news and one for true news.
- Available at: [ISOT
Dataset](https://www.uvic.ca/engineering/ece/isot/datasets/fake-news/index.php)
To start working on fake news detection, follow these steps:
1. *Data Collection*:
- Choose a dataset from the above options and download it.
2. *Data Preprocessing*:
- Clean the text data (remove punctuation, stop words, etc.).
- Tokenize the text.
- Convert text to numerical representations (e.g., TF-IDF, word embeddings).
3. *Model Selection*:
- Choose a machine learning model (e.g., logistic regression, SVM, random forest,
deep learning models like LSTM, BERT).
- Split your data into training and testing sets.
4. *Model Training*:
- Train your chosen model on the training data.
- Evaluate the model on the test data using appropriate metrics (accuracy,
precision, recall, F1-score).
5. *Model Evaluation*:
- Fine-tune your model based on the evaluation metrics.
- Consider cross-validation for a more robust evaluatio
6. *Deployment*:
- Once satisfied with the model performance, deploy it for real-time fake news
detection.
This project is made by the student of.,
KGRL COLLEGE OF PG COURSES (A)
BHIMAVARAM
REPORT: I’M Addala Teja sri seethalu (2302001), M.C.A
VERY INTERESTED IN THIS FAKE NEWS DETECTION PROJECT AND NOW
I KNOWN WHAT IS REAL NEWS OR WHAT IS FAKE NEWS.
…..~THANKYOU SIR GIVING ME THIS VALUABLE OPPORTUNITY~…..