0% found this document useful (0 votes)
31 views72 pages

Pavan Final

Uploaded by

Anusha Kandula
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views72 pages

Pavan Final

Uploaded by

Anusha Kandula
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 72

Detecting fake news using machine learning and deep learning algorithms

DETECTING FAKE NEWS USING MACHINE LEARNING AND


DEEP LEARNING ALGORITHMS

Project Report
Submitted In partial fulfilment for the award of the degree of

MASTER OF COMPUTER APPLICATIONS

Submitted by
M.PAVAN KUMAR
(REGD.NO.18L35F0009)

Under the Guidance of


Mr .K. LEELA PRASAD
Assistant professor
DEPARTMENT OF INFORMATION TECHNOLOGY

VIGNAN’S INSTITUTE OF INFORMATION TECHNOLOGY (Autonomous) Affiliated to


JNTU Kakinada & Approved by AICTE, New Delhi
Re-Accredited by NAAC (CGPA of 3.41/ 4.00)
ISO 9001:2008, ISO 14001:2004, OHSAS 18001:2007 Certified Institution
VISAKHAPATNAM – 530 039
April 2020
Detecting fake news using machine learning and deep learning algorithms

VIGNAN’S INSTITUTE OF INFORMATION TECHNOLOGY (A)


Department of Master of Computer Applications

CERTIFICATE

This is to certify that the project report entitled “DETECTING FAKE NEWS USING MACHINE
LEARNING AND DEEP LEARNING ALGORITHM” is a bonafide record of project work car-
ried out under my supervision by M. PAVAN KUMAR (18L35F0009) during the academic year
2019-2020, in partial fulfilment of the requirements for the award of the degree of Master of Com-
puter Applications Jawaharlal Nehru Technological University, Kakinada. The results embodied in
this project report have not been submitted to any other University or Institute for the award of any
Degree or Diploma.

Head of the Department Signature of Project Guide


Dr. B. Prasad Mr. K. Leela Prasad
(Associate Professor) (Assistant Professor)

External Examiner
Detecting fake news using machine learning and deep learning algorithms

DECLARATION

We hereby declare that the project report entitled “DETECTING FAKE NEWS USING MACHINE
LEARNING AND DEEP LEARNING ALGORITHM” has been carried out by us and has not been
submitted either in part or whole for the award of any degree, diploma or any other similar title to this or
any other university.

PLACE: VISAKHAPATNAM M. PAVAN KUMAR

DATE: (18L35F0009)
Detecting fake news using machine learning and deep learning algorithms

ACKNOWLEDGEMENT
It gives us a great sense of pleasure to acknowledge the assistance and cooperation we have received from
several persons while undertaking this MCA Final Year Project. We owe special debt of gratitude to
Mr.K.LEELA PRASAD Asst. prof., Department of information technology for his constant support and
guidance throughout the course of our work. His guidance have been a constant source of inspiration for
us.

We also take the opportunity to acknowledge the contribution of HOD, Dr. B. Prasad, Assoc.Prof.,, De-
partment of Master of Computer Applications for full his support and assistance during the development
of the project.

We want to thank Dr. B. Arundhati ,Principal of VIIT and the Management for providing all the neces-
sary facilities.

We also acknowledge the contribution of all faculty members of the department for their kind assistance
and cooperation during the project.

M.PAVAN KUMAR

(18L35F0009)
Detecting fake news using machine learning and deep learning algorithms

ABSTRACT

This Project comes up with the applications of NLP (Natural Language Processing) techniques for detect-
ing the 'fake news', that is, misleading news stories that comes from the non-reputable sources. Only by
building a model based on a count vectorizer (using word tallies) or a (Term Frequency Inverse Docu -
ment Frequency) tfidf matrix, (word tallies relative to how often they’re used in other articles in your
dataset) can only get you so far. But these models do not consider the important qualities like word order-
ing and context. It is very possible that two articles that are similar in their word count will be completely
different in their meaning. The data science community has responded by taking actions against the prob -
lem. So a proposed work on assembling a dataset of both fake and real news and employ a Naive
Bayes ,logistic regression,randomforest classifier in order to create a model to classify an article into fake
or real based on its words and phrases.
Detecting fake news using machine learning and deep learning algorithms

TABLE OF CONTENTS

S.NO TITLE PG.NO

1. INTRODUCTION 1-6

2.LITERATURE SURVEY 7

2.0 Literature Review 8


2. 9
2.1 Existing System

2.2 Proposed System 10

3.SYSTEM ANALYSIS AND DESIGN 11

3.1 Proposed Models 12

3.2 N-gram Model 13


3.
3.3 Data Pre-processing 14

3.4 Features Extraction 15

3.5 Classification Process 16

3.6 Requirements Analysis 16

4. ALGORITHMS 17-20

5. METHODOLOGY 21-34

6. SAMPLE CODE 35-46

7. FORMS AND REPORTS 47-59

8. CONCLUSION 60-61

62-63
9. REFERENCES
Detecting fake news using machine learning and deep learning algorithms

CHAPATER-1
INTRODUCTION
Detecting fake news using machine learning and deep learning algorithms

1.1 INTRODUCTION:

The rise of fake news during the 2016 U.S. Presidential Election highlighted not only
the dangers of the effects of fake news but also the challenges presented when attempting to
separate fake news from real news. Fake news may be a relatively new term but it is not ne-
cessarily a new phenomenon. Fake news has technically been around at least since the ap-
pearance and popularity of one-sided, partisan newspapers in the 19thcentury However, ad-
vances in technology and the spread of news through different types of media have increased
the spread of fake news today. As such the effects of fake news have increased exponentially
in their past and something must be done to prevent this from continuing in the future.

I have identified the three most prevalent motivations for writing fake news and
chosen only one as the target for this project as a means to narrow the search in a meaningful
way. The first motivation for writing fake news, which dates back to the 19th century one-
sided party newspapers is to influence public opinion. The second, which requires more re-
cent advances in technology, is the use of fake head-lines as click it to raise money. The third
motivation for writing fake news, which is equally prominent yet arguably less dangerous, is
satirical writing while all three subsets of fake news, namely ,(1)clickbait, (2),influential, and
(3)satire, share the common thread of being fictitious, their wide spread effects are vastly dif-
ferent. As such, this paper will focus primarily on fake news as defined by politifact “fabric-
ated content that intentionally masquerades as news coverage of actual events.” This defini-
tion excludes satire, which is intended to be humorous and not deceptive to readers.

Most satirical articles come from sources like The Onion which specifically distin-
guish themselves as satire. Satire can already be classified by machine learning techniques ac-
cording to Therefore our goal is to move beyond these achievements and use machine learn-
ing to classify, at least as well as humans, more difficult discrepancies between real and fake
news.

The dangerous effects of fake news, as previously defined, are made clear by events such as
in which a man attacked a pizzeria due to a wide spread fake news article. This story along
with analysis from provide evidence that humans are not very good at detecting fake news,
possibly not better than chance as such, the question remains whether or not machines can do
a better job. There are two methods by which machines could attempt to solve the fake news
problem better than humans.
Detecting fake news using machine learning and deep learning algorithms

The first is that machines are better at detecting and keeping track of statistics than humans, for
example it is easier for a machine to detect that the majority of verbs used are “suggests” and“im-
plies ” versus, “states” and “proves.”

Additionally, machines may be more efficient in surveying a knowledge base to find all relevant
articles and answering based on those many different sources. Either of these methods could
prove useful in detecting fake news but we decided to focus on how a machine can solve the fake
news problem using supervised learning that extracts features of the language and content only
within the source in question, without utilizing any fact check error knowledge base. For many
fake news detection techniques, a fake article published by a trust worthy author through a trust
worthy source would not be caught. This approach would combat those “false negative” classific-
ations of fake news. In essence, the task would be equivalent to what a human face when reading
a hard copy of a newspaper article, without internet accessor outside knowledge of the subject
(versus reading something online where he can simply look up relevant sources). The machine,
like the human in the coffee shop, will have only access to the words in the article and must use
strategies that do not rely on black lists of authors and sources.

The current project involves utilizing machine learning and natural language Processing tech-
niques to create a model that can expose documents that are, with high probability, fake news art-
icles. Many of the current automated approaches to this problem are centred around a “blacklist”
of authors and sources that are known producers of fake news. But what about when the author is
unknown or when fake news is published through a generally reliable source? In these cases it is
necessary to rely simply on the content of the news article to make a decision on whether or not
it is fake. By collecting examples of both real and fake news and training a model, it should be
possible to classify fake news articles with a certain degree of accuracy.

1.1 Introduction to Fake News Analysis

The goal of this project is to find the effectiveness and limitations of language-based techniques
for detection of fake news through the use of machine learning algorithms including but not lim-
ited to convolutional neural networks and recurrent neural networks .

2
Department of MCA, VIIT
Detecting fake news using machine learning and deep learning algorithms

In practice, a lot of sentences convey affect through underlying meaning rather than affect ad-
jectives for example, the text “My husband just filed for divorce and he wants to take custody of my
children away from me” certainly evokes strong emotions, but uses no affect keywords, and therefore,
cannot be classified using a keyword spotting approach.

Lexical affinity is slightly more sophisticated than keyword spotting as, rather than simply detecting
obvious affect words, it assigns arbitrary words a probabilistic ‘affinity’ for a particular emotion for
example, ‘accident’ might be assigned a 75% probability of being indicating a negative affect, as in
‘car accident’ or ‘hurt by accident’ These probabilities are usually trained from linguistic corpora.

Though often outperforming pure keyword spotting, there are two main problems with the approach
First, lexical affinity, operating solely on the word-level, can easily be tricked by sentences like “I
avoided an accident” (negation) and “I met my girlfriend by accident” (other word senses) Second,
lexical affinity probabilities are often biased toward text of a particular genre, dictated by the source
of the linguistic corpora This makes it difficult to develop a reusable, domain-independent model .
Statistical methods, such as Bayesian inference and support vector machines, have been popular for
affect classification of texts. By feeding a machine learning algorithm a large training corpus of af-
fectively annotated texts, it is possible for the system to not only learn the affective valence of affect
keywords (as in the keyword spotting approach), but also to take into account the valence of other ar-
bitrary keywords (like lexical affinity), punctuation, and word co-occurrence frequencies. However,
traditional statistical methods are generally semantically weak, meaning that, with the exception of
obvious affect keywords, other lexical or co-occurrence elements in a statistical model have little pre-
dictive value individually .As a result, statistical text classifiers only work with acceptable accuracy
when given a sufficiently large text input .So, while these methods may be able to affectively classify
user’s text on the page- or paragraph- level, they do not work well on smaller text units such as sen-
tences or clauses .

The rapid growth of opinion sharing on social media has led to an increased interest in senti-
ment analysis of social media texts. Sentiment Analysis can provide invaluable insights ranging from
product reviews to capturing trending topics to designing business models for targeted advertise-
ments. Many organizations today rely heavily on sentiment analysis of social media texts to monitor

3
Department of MCA, VIIT
Detecting fake news using machine learning and deep learning algorithms

the performance of their products and take the user feedback into account while upgrading to newer
versions.

4
Department of MCA, VIIT
Detecting fake news using machine learning and deep learning algorithms

Social media texts are informal with several linguistic differences. In multilingual societies
like India, users generally combine the prominent language, like English, with their native languages.
This process of switching texts between two or more languages is referred to as code-mixing. Mil-
lions of internet users in India communicate by mixing their regional languages with English which
generates enormous amount of code-mixed social media texts. One of such popular combinations is
the mixing of Hindi and English, resulting in Hindi-English (Hi-En) code-mixed data. For example,
“yeh gaana bohut super hai”(this song very super is), meaning “this is a superb song”, is a Hi-En
code-mixed text.

Apart from several existing challenges such as the presence of multiple entities in the text and
sarcasm detection, code-mixing brings with it many other unique challenges. The linguistic complex-
ity of code-mixed content is compounded by the presence of spelling variations, transliteration and
non- adherence to formal grammar. Along with diverse sentence constructions, words in Hindi can
have multiple variations when written in English which leads to a large amount of sparse and rare
tokens. For instance, “pyaar” (love) can be written as “peyar”, “pyar”, “piyar”, “piyaar”, or “py-
aarrrr”, etc.

Code-mixing is a well-known problem in the field of NLP. Researchers have put in efforts for
language identification, POS tagging and Named Entity Recognition of code-mixed data .Over the
past years, researchers have established deep neural network based state-of-the-art models for senti-
ment analysis in English data. For the problem of sentiment analysis of Hi-En code-mixed data, sub-
word level representations in LSTM have shown promising results . However, since the code-mixed
data is noisy in nature and the available datasets are smaller in size to tune deep learning models, we
hypothesize that n-gram based traditional models should be able to assist deep learning based models
in improving the overall accuracy of sentiment analysis in code-mixed data

In this project, we propose an ensemble model where we combine the outputs of character- tri-
grams based LSTM model and word n-gram based MNB model to predict the sentiment of Hi- En-
code-mixed texts. While the LSTM model encodes deep sequential patterns in the text, MNB captures
low-level word combinations of keywords to compensate for the grammatical inconsistencies.
5
Department of MCA, VIIT
Detecting fake news using machine learning and deep learning algorithms

1.2 EXISTING SYSTEM


A number of features have been proposed and evaluated in view of the impact of features
on the accuracy of the classifier. Part of the speech (POS) tags, Linguistic inquiry and word
count, which in combination with unigrams and bigrams are used. In the existing work, the sys-
tem uses only to semi-supervised learning. Only Text Classification as sentiment text and it
never finds fake news.

Disadvantages:

In the existing work, the system uses only to semi-supervised learning.


Only Text Classification as sentiment text and it never finds fake news.

1.3 PROPOSED SYSTEM

In the proposed system, each news goes through tokenization process first. Then, unnecessary words are re-
moved and candidate feature words are generated. Each candidate feature words are checked against the dic-
tionary and if its entry is available in the dictionary then its frequency is counted and added to the column in
the feature vector that corresponds the numeric map of the word. Alongside with counting frequency, the
length of the review is measured and added to the feature vector. Finally, sentiment score which is available
in the data set is added in the feature vector. We have assigned negative sentiment as zero valued and posit-
ive sentiment as some positive valued in the feature vector. The system is very fast and effective due to
semi- supervised and supervised learning. Focused on the content of the review based approaches. As fea-
ture we have used word frequency count, sentiment polarity and length of review

6
Department of MCA, VIIT
Detecting fake news using machine learning and deep learning algorithms

1.4 Architecture of Proposed System

Fig 1.5 Architecture of Proposed System

Advantages:

The system is very fast and effective due to semi-supervised and supervised learning.
Focused on the content of the review based approaches. As feature we have used word
frequency count, sentiment polarity and length of review.
7
Department of MCA, VIIT
Detecting fake news using machine learning and deep learning algorithms

CHAPTER-2

LITERATURE SURVEY

8
Department of MCA, VIIT
Detecting fake news using machine learning and deep learning algorithms

While there are some existing applications like BS Detector and Politifact which to some
extent help users to identify misleading news but it requires human intervention and also the do-
main is limited in case of BS Detector which does not give the user the extent of any article to be
fake.

They are using linguistic cues approaches and network analysis approaches to design a
basic fake news detector which provides high accuracy in terms of classification tasks. They pro-
pose a hybrid system whose features like multi-layer linguistic processing, the addition of net-
work behavior are included. They propose a method to detect online deceptive test by using a lo-
gistic regression classifier which is based on POS tags extracted from a corpus deceptive and
truthful texts and achieves an accuracy of 72% which could be further improved by performing
cross-corpus analysis of classification models and reducing the size of the input feature vector.

2.1Existing System

To detect fake news on social media presents a datamining perspective which includes
fake news characterization on psychology and social theories. This article discusses two major
factors responsible for widespread acceptance of fake news by the user which are Naive Realism
and Confirmation Bias. Further, it proposes a two-phase general datamining framework which
includes

1) Feature Extraction and 2) Model Construction and discusses the datasets and evaluation met-
rics for the fake news detection research. They propose an SVM-based algorithm with 5 predict-
ive features i.e. Absurdity, Humour, and Grammar, Negative Affect, and Punctuation and uses
satirical cues to detect misleading news. The paper translates theories of humor, irony, and satire
into a predictive model for satire detection with 87% accuracy.

The purpose of this paper is to propose a new model for fake news detection which is us-
ing Stance Detection and IF-TDF method for analyzing the data which is taken from various
datasets of fake and legitimate news and Random Forest classifier for classifying the output into
four classes namely: True, Fake, Mostly True, and Mostly Fake. Using Random Forest gives us
an advantage of handling binary features and moreover, they do not expect linear features.

9
Department of MCA, VIIT
Detecting fake news using machine learning and deep learning algorithms

2.2Proposed System

In [1], Shloka Gilda presented concept approximately how NLP is relevant to stumble on
fake information. They have used time period frequency-inverse record frequency (TF- IDF) of
bi-grams and probabilistic context free grammar (PCFG) detection. They have examined their
dataset over more than one class algorithms to find out the great model.

They locate that TF-IDF of bi-grams fed right into a Stochastic Gradient Descent model
identifies non-credible resources with an accuracy of 77.2%.

In [2], Mykhailo Granik proposed simple technique for fake news detection the usage of
naive Bayes classifier. They used BuzzFeed news for getting to know and trying out the Naïve
Bayes classifier. The dataset is taken from facebook news publish and completed accuracy up to
74 % on test set.

In [3] Cody Buntain advanced a method for automating fake news detection on Twitter.
They applied this method to Twitter content sourced from BuzzFeed’s fake news dataset. Fur-
thermore, leveraging non-professional, crowdsourced people instead of journalists presents a be-
neficial and much less costly way to classify proper and fake memories on Twitter rapidly.

In [4], Marco L. Della offered a paper which allows us to recognize how social networks
and gadget studying (ML) strategies may be used for faux news detection .They have used novel
ML fake news detection method and carried out this approach inside a Facebook Messenger
chatbot and established it with a actual-world application, acquiring a fake information detection
accuracy of eighty one.7%.

In [5], Rishabh Kaushal carried out 3 getting to know algorithms specifically Naive
Bayes, Clustering and Decision bushes on some of features such astweet-degree and consumer-
level like Followers/Followees, URLs, SpamWords, Replies and HashTags. Improvement of un-
solicited mail detection is measured on the premise of general Accuracy, Spammers Detection
Accuracy and Non- Spammers Detection Accuracy.

10
Department of MCA, VIIT
Detecting fake news using machine learning and deep learning algorithms

In [6], Saranya Krishnan used superior framework to identify faux information contents.
Initially, they've extracted content material capabilities and consumer functions via Twitter API.

Then functions together with statistical analysis of twitter user accounts, reverse picture
searching, verification of fake news assets are used by facts mining algorithms for class and ana-
lysis.

[7] Detecting fake news through various machine learning models. The given machine
learning models implemented are naïve Bayes classifier and support vector machine. No specific
accuracy was recorded as only the models were discussed.

[8] to detect whether the given Tweets are credible or not. The machine learning model
implemented are naïve Bayes classifier, decision trees, Support vector machines and neural net-
works. With both tweet and user features, the best F1 score is 0.94. Higher accuracy could have
been attained by considering non-credible news into account.

[9] Method for automating fake news detection on Twitter by learning to predict accu-
racy assessments in two credibility-focused Twitter datasets. Accuracy rate of the given models
are at 70.28%. The main limitation lies in the structural difference CREDBANK and PHEME,
which could affect model transfer.

[10] Wang, Guan and Xie, Sihong and Liu, Bing and Philip, S Yu, Review graph based
online store review spammer detection, 2011 used the Bayesian approach and laid out a cluster-
ing problem with opinion spam sensing.

[11] Sun, Chengai and Du, Qiaolin and Tian, Gang, Exploiting product related review
features for fake review detection, Mathematical Problems in Engineering, (2016). although all
used the Support Vector Machine (SVM) as a classification, are other literature pieces that also
have taken supervised learning

11
Department of MCA, VIIT
Detecting fake news using machine learning and deep learning algorithms

CHAPTER-3

SYSTEM ANALYSIS AND DESIGN

12
Department of MCA, VIIT
Detecting fake news using machine learning and deep learning algorithms

Fig 3.0 SYSTEM DESIGN

13
Department of MCA, VIIT
Detecting fake news using machine learning and deep learning algorithms

Fig 3.0.1 FLOW CHART

14
Department of MCA, VIIT
Detecting fake news using machine learning and deep learning algorithms

Proposed Models

3.1 N-gram Model: N-gram modeling is a popular feature identification and analysis
approach used in language modeling and Natural language processing fields. N-gram is a con-
tiguous sequence of items with length n. It could be a sequence of words, bytes, syllables, or
characters. The most used n-gram models in text categorization are word-based and character-
based n-grams. In this work, we use word-based n-gram to represent the context of the document
and generate features to classify the document. We develop a simple n-gram based classifier to
differentiate between fake and honest news articles. The idea is to generate various sets of n-
gram frequency profiles from the training data to represent fake and truthful news articles. We
used several baseline n-gram features based on words and examined the effect of the n-gram
length on the accuracy of different classification algorithms.

3.2 Data Pre-processing: Before representing the data using n-gram and vector-based
model, the data need to be subjected to certain refinements like stop-word removal, tokenization,
a lower casing, sentence segmentation, and punctuation removal. This will help us reduce the
size of actual data by removing the irrelevant information that exists in the data. We created a
generic processing function to remove punctuation and non-letter characters for each document;
then we lowered the letter case in the document. In addition, an n-gram word based tokenizer
was created to slice the text based on the length of n. Stop Word Removal Stop words are in-
significant words in a language that will create noise when used as features in text classification.
These are words commonly used a lot in sentences to help connect thought or to assist in the sen-
tence structure. Articles, prepositions and conjunctions and some pronouns are considered stop
words. We removed common words such as, a, about, an, are, as, at, be, by, for, from, how, in, is,
of, on, or, that, the, these, this, too, was, what, when, where, who, will, etc. Those words were re-
moved from each document, and the processed documents were stored and passed on to the next
step. Stemming After tokenizing the data, the next step is to transform the tokens into a standard
form. Stemming simply is changing the words into their original form,and decreasing the number
of word types or classes in the data. For example, the words “Running”, “Ran” and “Runner”
will be reduced to the word “run.” We use stemming to make classification faster and efficient.
Furthermore, we use Porter stemmer, which is the most commonly used stemming algorithms

15
Department of MCA, VIIT
Detecting fake news using machine learning and deep learning algorithms

due to its accuracy.

16
Department of MCA, VIIT
Detecting fake news using machine learning and deep learning algorithms

3.3Features Extraction: One of the challenges of text categorization is learning from


high dimensional data. There is a large number of terms, words, and phrases in documents that
lead to a high computational burden for the learning process. Furthermore, irrelevant and redun-
dant features can hurt the accuracy and performance of the classifiers. Thus, it is best to perform
feature reduction to reduce the text feature size and avoid large feature space dimension. We
studied in this research two different features selection methods, namely, Term Frequency (TF)
and Term Frequency-Inverted Document Frequency (TF-IDF). These methods are described in
the following.

Term Frequency (TF): Term Frequency is an approach that utilizes the counts of words
appearing in the documents to figure out the similarity between documents. Each document is
represented by an equal length vector that contains the words counts. Next, each vector is nor-
malized in a way that the sum of its elements will add to one. Each word count is then converted
into the probability of such word existing in the documents. For example, if a word is in a certain
document it will be represented as one, and if it is not in the document, it will be set to zero.
Thus, each document is represented by groups of words.

TF-IDF: The Term Frequency-Inverted Document Frequency (TF-IDF) is a weighting


metric often used in information retrieval and natural language processing. It is a statistical met-
ric used to measure how important a term is to a document in a dataset. A term importance in-
creases with the number of times a word appears in the document however, this is counteracted
by the frequency of the word in the corpus. One of the main characteristics of IDF is it weights
down the term frequency while scaling up the rare ones. For example, words such as “the” and
“then” often appear in the text, and if we only use TF, terms such as these will dominate the fre-
quency count. However, using IDF scales down the impact of these terms.

3.4 Classification Process: It starts with preprocessing the data set, by removing unnec-
essary characters and words from the data. N-gram features are extracted, and a features matrix is
formed representing the documents involved. The last step in the classification process is to train
the classifier. We investigated different classifiers to predict the class of the documents. We in-
vestigated specifically six different machine learning algorithms, namely, Stochastic Gradient
Descent (SGD), Support Vector Machines (SVM), Linear Support Vector Machines (LSVM), K-
17
Department of MCA, VIIT
Detecting fake news using machine learning and deep learning algorithms

Nearest neighbour (KNN) and Decision Trees (DT). We used implementations of

these classifiers from the Python Natural Language Toolkit (NLTK). We split the dataset into
training and testing sets. For instance, in the experiments presented subsequently, we use 5- fold
cross validation, so in each validation around 80% of the dataset is used for training and 20% for
testing.

FIG 3.4.1 Classification process

3.5 REQUIREMENTS ANALYSIS:


HARDWARE REQUIREMENTS:
Processor: Core i3 or Higher
Speed :2.1 Ghz

RAM:4GB
(min)

Hard Disk: 200 GB

SOFTWARE REQUIREMENTS:
Operatingsystem :Window8

Coding Language: python

18
Department of MCA, VIIT
Detecting fake news using machine learning and deep learning algorithms

IDLE: Jupiter notebook

Chapter-4
ALGORITHMS

19
Department of MCA, VIIT
Detecting fake news using machine learning and deep learning algorithms

4.0 Logistic Regression: Logistic regression is a classification algorithm used to as-


sign observations to a discrete set of classes. Unlike linear regression which outputs continuous
number values, logistic regression changes its yield utilizing the calculated sigmoid capacity to
restore a likelihood esteem which would then be able to be mapped to at least two discrete
classes. The LR model uses gradient descent to converge onto the optimal set of weights (θ) for
the training set. For our model, the hypothesis used is the sigmoid function:

4.1Support Vector Machine (SVM): A Support Vector Machine (SVM) is a su-


pervised machine learning algorithm that can be used for both classification and regression pur-
poses. SVMs are mostly used in classification problems. SVMs are founded on the idea of find-
ing a hyperplane that best divides a dataset into two classes. Support vectors are the data points
nearest to the hyperplane, the points of a data set that, if deleted, would alter the position of the
dividing hyperplane. Because of this, they can be considered the critical elements of a data set.
The distance between the hyperplane and the nearest data point from either set is known as the
margin. The aim is to choose a hyperplane with the greatest possible margin between the hyper-
plane and any point within the training set, giving a higher chance of new data being classified
correctly. The expression for this kernel is given by the following expression:

20
Department of MCA, VIIT
Detecting fake news using machine learning and deep learning algorithms

4.2Naïve Bayes Classification: In machine learning, Naive Bayes classifiers are a


family of simple "probabilistic classifiers" based on applying Bayes' theorem with powerful
(naive) independent assumptions between the features. Naive Bayes classifiers are highly scal-
able, requiring a number of parameters linear in the number of variables (features/predictors) in a
learning problem. Maximum-likelihood training can be done by evaluating a closed-form expres-
sion, which takes linear time, rather than by expensive iterative approximation as used for many
other types of classifiers. The formula for naïve bayes classifier is:

A pseudo-count will be implemented in every probability estimate. This ensures that no


probability will be zero. It is a way of regularizing Naïve Bayes. where the pseudo-count α > 0 is
the smoothing parameter (α = 0 corresponds to no smoothing). Additive smoothing is a type of
shrinkage estimator, as the resulting estimate will be between the empirical estimate xi / N, and
the uniform probability 1/d. Most of the times, α is taken as 1 but a smaller value can also be
chosen depending on the requirements. The frequency-based probability might introduce zeros
when multiplying the probabilities, leading to a failure in preserving the information contributed
by the non-zero probabilities. Therefore, a smoothing approach, for example, the Lidstone
smoothing, must be adopted to counter this problem. After deciding on these problems, the Naïve
Bayes classifier will be mostly used to obtain reasonable results. A smoothing approach will in-
crease the accuracy of the problem which is being attempted. It is also being seen that Naïve
Bayes classifier is both simple and powerful for Natural Language Processing (NLP) tasks such
as text classification problems

4.3 Evaluation Metrics: To evaluate the performance of algorithms for fake news
de-tection problem, various evaluation metrics have been used. In this subsection, we review the
most widely used metrics for fake news detection. Most existing approaches consider the fake
news problem as a classification problem that pre-dicts whether a news article is fake or not:
21
Department of MCA, VIIT
Detecting fake news using machine learning and deep learning algorithms

• True Positive (TP): when predicted fake news pieces are actually annotated as fake news;

•True Negative (TN): when predicted true news pieces are actually annotated as true news;

•False Negative (FN): when predicted true news pieces are actually annotated as fake news;

•False Positive (FP): when predicted fake news pieces are actually annotated as true news.

By formulating this as a classification problem, we can define following metrics,

Precision =|T P | / |T P |+|F P | (2)


Recall =|T P | / |T P |+|F N | (3)

F1=2· Precision ·Recall / Precision +Recall (4) Accur-


acy =|T P |+|T N | / |T P |+|T N |+|F P |+|F N| (5)

These metrics are commonly used in the machine learning community and enable us to
evaluate the performance of a classifier from different perspectives. Specifically, accuracy meas -
ures the similarity between predicted fake news and real fake news. Precision measures the frac-
tion of all detected fake news that are annotated as fake news, addressing the important problem
of identifying which news is fake. How- ever, because fake news datasets are often skewed, a
high precision can be easily achieved by making fewer positive predictions. Thus, recall is used
to measure the sensitivity, or the fraction of annotated fake news articles that are predicted to be
fake news. F1 is used to combine precision and recall, which can provide an overall prediction
performance for fake news detection. Note that for Precision, Recall, F1, and Accuracy, the
higher the value, the better the performance.

22
Department of MCA, VIIT
Detecting fake news using machine learning and deep learning algorithms

CHAPTER-5

METHODOLOGY

23
Department of MCA, VIIT
Detecting fake news using machine learning and deep learning algorithms

5.1Sentence-Level Baselines
I have run the baselines described in namely multi-class classification done via logistic
regression and support vector machines. The features used were n-grams and TF-IDF. N-grams
are consecutive groups of words, up to size “n”. For example, bi-grams are pairs of words seen
next to each other. Features for a sentence or phrase are created from n-grams by having a vector
that is the length of the new “vocabulary set,” i.e. it has a spot for each unique n-gram that re-
ceives a 0 or 1 based on whether or not that n-gram is present in the sentence or phrase in ques-
tion. TF-IDF stands for term frequency inverse document frequency. It is a statistical measure
used to evaluate how important a word is to a document in a collection or corpus. As a feature,
TF-IDF can be used for stop-word filtering, i.e. discounting the value of words like “and,”,
“the”, etc. whose counts likely have no effect on the classification of the text. An alternative ap -
proach is removing stop- words (as defined in various packages, such as Pythons NLTK). The
results for this preliminary evaluation are found in Table 6.1

Table 5.1: Preliminary Baseline Results

24
Department of MCA, VIIT
Detecting fake news using machine learning and deep learning algorithms

Additionally, we explored some of the characteristic n-grams that may influence Logistic
Regression and other classifiers. In calculating the most frequent n-grams for “pants-fire” phrases
and those of “true” phrases, we found that the word “wants” more frequently appears in “pants-
fire” (i.e. fake news) phrases and the phrase “states” more frequently appears in “true” (i.e. real
news) phrases. Intuitively this makes sense because it is easier to lie about what a politician
wants than to lie about what he or she has stated since the former is more difficult to confirm.
This observation motivates the experiments in Section 6.2 which aim to find a set of similarly in-
tuitive patterns in the body texts of fake news and real news articles.

5.2 Document-Level

Deep neural networks have shown promising results in NLP for other classification tasks
such as CNNs are well suited for picking up multiple patterns, and sentences do not provide
enough data for this to be useful. However, a CNN baseline modelled off of the one described
for NLP in did not show a large improvement in accuracy on this task using the Liar Dataset.
This is due to the lack of context provided in sentences. Not surprisingly, the same CNN per-
formance on the full body text datasets we created was much higher.

5.2.1Tracking Important Trigrams

The nature of this project was to decide if and how machine learning could be useful in
detecting patterns characteristic of real and fake news articles. In accordance with this purpose,
we did not attempt to build deeper and better neural nets in order to improve performance, which
was already much higher than expected. Instead, we took steps to analyze the most basic neural
net. We wanted to learn what patterns it was learning that resulted in such a high accuracy of be-
ing able to classify fake and real news.

If a human were to take on the task of picking out phrases that indicate fake or real news,
they may follow guidelines such as those in this and similar, guidelines often encourage readers
to look for evidence supporting claims because fake news claims are often unbacked by evid-
ence. Likewise, these guidelines encour- age people to read the full story, looking for details that
seem “far-fetched.” Figures

25
Department of MCA, VIIT
Detecting fake news using machine learning and deep learning algorithms

5.1 and 5.2 show examples of the phrases a human might pick up on to decide if an article is fake
or real news. We were curious to see if a neural net might pick up on similar patterns.

Figure 5.1: Which trigrams might a human find indicative of real news?

26
Department of MCA, VIIT
Detecting fake news using machine learning and deep learning algorithms

Figure 5.2: Which trigrams might a human find indicative of fake news?

The best way to do this was to simplify the network so that it had only one filter size. The net-
work in was tuned to learn filter sizes 3, 4, and 5. With guidelines often encourage readers to
look for evidence supporting claims because fake news claims are often unbacked by evidence.

This intricacy, the model was able to learn overlapping segments. For example, the 4-gram
“Donald Trumps presidential election” could be learned in addition to the trigrams “Donald
Trumps presidential” and “Trumps presidential election”. To avoid this overlapping, we simpli-
fied the network to only look at filter size 3, i.e. trigrams. We found that this did not cause a sig-
nificant drop in accuracies; there was less than one half percent decrease in accuracy from the
model with filter sizes = [3,4,5] to the model with filter sizes = [3]. We limited the data to 1000
words because less than ten percent of the data was over this limit and found most of the time the
article was longer than 1000 words it contained excess information at the end that was not relev-
ant to the article itself. For example, lengthy ads were sometimes found at the end of articles,
causing them to go over 1000 words. There were no noticeable drops in accuracy across trials
when we restricted the document length to 1000 words.

In order to obtain the trigrams that were most important in the classification decision, we
essentially had to back-propagate from the output layer to the raw data (i.e. actual body text be-
ing classified), as seen in Figures 5.3,5.4, 5.5 and 5.6.We did this in a manner similar to for any
text being evaluated by the CNN, we can find the trigrams that were “most fake” and “most real”
by looking at the weighti × activationi for each of the individual neuron, i, when that text was
evaluated. I will explain the process for finding the most real trigrams, and the same process can
be used to find the most fake trigrams. The only difference is which column of the 2-columns in
each layer you choose to look at. The first step in this process is looking at the max pool layer
where you will find a down sampled version of the convolutional layer (See Figure 6.4) Each of
the 128 values are selected as the max of 998 values in the previous layer. Due to the dropout
probability, we expect that a different pattern will cause the highest activation for each of these

27
Department of MCA, VIIT
Detecting fake news using machine learning and deep learning algorithms

neurons. i, weighti × activationi for that text. Therefore, we can select the neurons with the
highest (most positive) As such, the max-pool layer represents the value of the trigram that was
closest to this pattern, and made the neurons activation the highest. Each value in the max-pool
layer is representative of the neuron, weighti] × activationi to ultimately find the “most real” tri-
grams or we can select the neurons with the lowest (most negative) weight + i × activationi to ul-
timately find the “least real” trigrams. Dimension in the output of the convolutional layer with
ReLU function applied.

Now, we have 998 values to look at. One of these values was chosen to be the max-pooled
value, so we must look at all of them and find the match. Once we find the matching number, we
have its index. Its index is representative of the trigram index in the original text. So if the index
is 0, we look at the first trigram (words at indices 0,1, and 2) and if the index is 1, we look at the
second trigram (words at indices 1, 2 and 3).

Figure 5.3: The output layer of the CNN where the higher value indicates the final classification
of the text

28
Department of MCA, VIIT
Detecting fake news using machine learning and deep learning algorithms

Figure 5.4: Step 1: The Max Pool Values have the weighti* activationi for each of the neurons i.

29
Department of MCA, VIIT
Detecting fake news using machine learning and deep learning algorithms

Figure 5.5: Step 2: Find the index of the max pooled value from Step 1 in the convolutional layer.

30
Department of MCA, VIIT
Detecting fake news using machine learning and deep learning algorithms

Figure 5.6: Step 3: The index in convolutional layer found in Step 2 represents which of the 998
trigrams caused the max pooled values from Step 1. Use that same index to find the correspond-
ing trigram.

31
Department of MCA, VIIT
Detecting fake news using machine learning and deep learning algorithms

5.3Topic Dependency

As we suspected from the makeup of the dataset which can be seen from 6.7 which
demonstrates a general overview of the makeup of both of the datasets, there is a significant dif-
ference in the subjects being written about in fake news and real news, even in the same time
range with the same current events going up. More specifically, you can see that the concentra-
tion of articles that involve “Hillary”, “Wikileaks”, and “republican” is higher in Fake News than
it is in real news. This is not to say that these words did not appear in real news, but they were not
some of the “most frequent” words there. Additionally, words like football and “love” appear
very frequently in the real news dataset, but these are topics that you can imagine would not be
written about, or rarely be written about, in fake news. The “hot topics” of fake news present an-
other issue in this task. We do not want a model that simply chooses a classification based on the
probability that a fake or real news article would be written on that topic just like we would never
tell a person that every article written about Hillary is fake news or every article written about
love is real news.

The way we accounted for these differences in the dataset was by separating our training
set and tests sets on the presence/absence of certain words. We tried this for a number of topics
that were present in both fake news and real news but had different proportions in the two cat-
egories. The words we chose were “Trump”, “election”, “war”, and “email.”

To create a model that was not biased about the presence of one of these words, we extrac-
ted all body texts which did not contain that word. We used this set as the training set. Then, we
used the remaining body texts that did contain the target word as the test set. The accuracy of the
model on the test set represents transfer learning in the sense that the model was trained on a
number of articles about topics other than the target word and had to use what it learned to clas-
sify texts about the target word. The accuracies were still quite high, as demonstrated in section
5. This shows that the model was learning patterns of language other than those specific words.
This could mean that it learned similar words because of the word embeddings or it could mean
that it learned completely different words to “pay attention” to, or both.

32
Department of MCA, VIIT
Detecting fake news using machine learning and deep learning algorithms

(a) Fake News Frequent Words (b) Real News Frequent

Words Figure 5.7: Words exclusively common to one category (Fake/Real)

5.4Cleaning

Pre-processing data is a normal first step before training and evaluating the data using a
neural network. Machine learning algorithms are only as good as the data you are feeding them.
It is crucial that data is formatted properly and meaningful features are included in order to have
sufficient consistency that will result in the best possible results. For computer vision machine
learning algorithms, pre-processing the data involves many steps including normalizing image
inputs and dimensionality reduction. The goal of these is to take away some of the unimportant
distinguishing features between different images. Features like the darkness or brightness are .

33
Department of MCA, VIIT
Detecting fake news using machine learning and deep learning algorithms

The task of pre-processing data is often an iterative task rather than a linear one. This was
the case in this project where we used a new and not yet standardized dataset. As we found cer-
tain unmeaningful features that the neural net was learning, we learned what more we needed to
pre- process from the data.

5.5 Non-English Word Removal

Two observations that lead us to more pre-processing were the presence of run-on words
and proper nouns in the most important trigrams for classification. An example of a run on word
that we saw frequently was in the “most fake” trigram category was “Not MyPresident” that
came from a trending “hashtag” on twitter. There were also decisive trigrams that were simply
pronouns like “Donald J Trump.” Proper nouns could not possibly be helpful in a meaningful
way to a machine learning algorithm trying to detect language patterns indicative of real or fake
news. We want our algorithm to be agnostic to the subject material and make a decision based on
the types of words used to describe whatever the subject is. Another algorithm may aim to fact
check statements in news articles. In this situation, it would be important to maintain the proper
nouns/subjects because changing the proper noun in the sentence “Donald J. Trump is our current
president” to “Hillary Clinton is our current president” changes the classification of true fact to
false fact. However, our purpose is not fact checking but rather language pattern checking, so re-
moval of proper nouns should aid in pointing the machine learning algorithms in the right direc-
tion as far as finding meaningful features.

We removed “non-English” words by using Py Enchants version of the English diction-


ary. This also accounted for removal of digits, which should not be useful in this classification
task, and websites. While links to websites may be useful in classifying the page rank of an art-
icle, it is not useful for the specific tool we were trying to create.

34
Department of MCA, VIIT
Detecting fake news using machine learning and deep learning algorithms

5.6 Source Pattern Removal

Another observation was that the two real news sources had some specific patterns that
were easily learnable by the machine learning algorithms. This was more of an issue with the real
news sources than the fake news sources because there were many more fake news sources than
real news sources. More specifically, there were 244 fake news sources and only 128 neurons so
the algorithm could not simply attune one neuron to each of the fake news sources patterns.
There were only two real news sources, however. Therefore, the algorithm was able to pick up
easily on the presence or absence of these patterns and use that, without much help from other
words or phrases, to classify the data.

There were a few separate steps in removing patterns from the real news sources. The New
York Times articles of a particularly common section often started off with “Good morning. (or
evening) Here’s what you need to know:” This, along with other repeated sentences were always
in italics. To account for the lack of consistency in the exact sentences that were repeated, we had
to scrape the data again from the URLs and remove anything that was originally in italics. An-
other repeated pattern in the New York Times articles was parenthetical questions with links to
sign up for emails, for example “Want to get California Today by email? Sign up. Another pat-
tern was in The Guardian, articles almost always ended with “Share on Facebook Share on Twit-
ter Share via Email Share on LinkedIn Share on Pinterest Share on Google+ Share on WhatsApp
Share on Messenger Reuse this content” which is the result of links/buttons on the bottom of the
webpage to share the article. When removing the non-English words, we were left with “on on
on on on this content” which was enough of a pattern to force the model to learn classification al-
most solely based on its presence.

Note that this was a particularly strong pattern because it was consistent throughout the
Guardian articles from a ence or absll sections of the Guardian. Also, the majority of articles in
our real news set are from the Guardian.

35
Department of MCA, VIIT
Detecting fake news using machine learning and deep learning algorithms

5.7 Describing Neurons:

Although the accuracy was high in the classification task even after extensive pre- pro-
cessing of the data, we wanted a way to more qualitatively evaluate how and what the neural net
was learning the classification. Understanding and visualizing the way a CNN encodes informa-
tion is an ongoing question. It is an infinitely more challenging pattern when there are more than
one convolutional layer, which is why we kept our neural net shallow. For CNNs with one convo-
lutional layer shows a way to visualize any CNN single neuron as a filter in the first layer, in
terms of the image space. We were able to use a similar method to “visualize” the CNN neurons
as filters in the first (and only) layer in terms of text space.

Instead of finding the location in each image of the window that caused each neuron to fire
the most, we find the location in the pre-processed text of the trigram (or length 3 sequence of
words) that caused each neuron to fire the most. As the authors of were able to identify patterns
of colors/lines in images that caused firing, we were able to identify textual patterns that caused
firing. Textual patterns are more difficult to visualize than image space patterns. While similar
but non- identical RGB pixel values look similar, two words that are mathematically “similar” in
their embedding but non-identical do not look similar. They do, however, have similar meanings.

In order to get a general grasp of the meaning of words/trigrams that each neuron was fir-
ing most highly for, we followed similar steps to those described in the section of 6.2.1 However,
instead of finding those neurons that had the highest/lowest weight × activation, we looked at
each neuron, and which trigram in each body text resulted in the pooled value for that neuron.
Then, we accumulated all of the trigrams for each neuron and summarized them by counting the
instances of each word in the trigram. Our algorithm reported the words with the highest counts,
excluding stop words as described by NLTK (i.e. words like “the”, “a”, “by”, “it”, which are not
meaningful in this circumstance).

36
Department of MCA, VIIT
Detecting fake news using machine learning and deep learning algorithms

CHAPTER-6
SAMPLE CODE

37
Department of MCA, VIIT
Detecting fake news using machine learning and deep learning algorithms

6.0 CODE:

# Include Libraries

import pandas as pd

print(pd. version )

from sklearn.model_selection import train_test_split

import sklearn

from sklearn.feature_extraction.text import CountVectorizer

from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.naive_bayes import MultinomialNB

from sklearn.linear_model import LogisticRegression

from sklearn.ensemble import RandomForestClassifier

from sklearn.metrics import plot_confusion_matrix

from sklearn.svm import SVC

from sklearn import metrics

from matplotlib import pyplot as plt

from sklearn.linear_model import PassiveAggressiveClassi-

fier from sklearn.feature_extraction.text import HashingVec-

torizer import itertools

38
Department of MCA, VIIT
Detecting fake news using machine learning and deep learning algorithms

import numpy as np

import re

import csv

import nltk

from nltk.corpus import stopwords

from nltk.stem import PorterStemmer

from nltk.stem.wordnet import WordNetLemmatizer

from nltk.stem import SnowballStemmer

# Importing dataset using pandas dataframe

df = pd.read_csv("fake_or_real_news.csv")

# Inspect shape of `df`

df.shape

# Set index

df = df.set_index("Unnamed: 0")

39
Department of MCA, VIIT
Detecting fake news using machine learning and deep learning algorithms

# Print all lines of `df`

print(df)

# Separate the labels and set up training and test datasets

y = df.label

df.drop("label", axis=1) #where numbering of news article is done that column is dropped in
dataset
X_train, X_test, y_train, y_test = train_test_split(df['text'], y, test_size=0.33, random_state=53)

with open('Train-SetX.csv','w',encoding='utf-8',newline='') as file:

writer = csv.writer(file, delimiter=',')

for line in X_train:

writer.writerow([line])

with open('Test-SetX.csv','w',encoding='utf-8',newline='') as file:

writer = csv.writer(file, delimiter=',')

for line in X_test:

writer.writerow([line])

with open('Train-SetY.csv','w',encoding='utf-8',newline='') as file:

writer = csv.writer(file, delimiter=',')

for line in y_train:

writer.writerow([line])

40
Department of MCA, VIIT
Detecting fake news using machine learning and deep learning algorithms

with open('Test-SetY.csv','w',encoding='utf-8',newline='') as file:

writer = csv.writer(file, delimiter=',')

for line in y_test:

writer.writerow([line])

count_vectorizer = CountVectorizer(stop_words='english')

count_train = count_vectorizer.fit_transform(X_train) # Learn the vocabulary


dictionary and return term-document matrix.

count_test = count_vectorizer.transform(X_test)

tfidf_vectorizer = TfidfVectorizer(stop_words='english', max_df=0.7) # This removes


words which appear in more than 70% of the articles

tfidf_train = tfidf_vectorizer.fit_transform(X_train)

tfidf_test = tfidf_vectorizer.transform(X_test)

n_vect = CountVectorizer(min_df = 5, ngram_range = (1,2)).fit(X_train)

n_train = n_vect.fit_transform(X_train)

n_test = n_vect.transform(X_test)

# Function to plot the confusion matrix

41
Department of MCA, VIIT
Detecting fake news using machine learning and deep learning algorithms

Def plot_confusion_matrix(cm, classes,normalize=False,title='',cmap=plt.cm.Blues):

plt.imshow(cm, interpolation='nearest', cmap=cmap)

plt.title(title) plt.-

colorbar()

tick_marks = np.arange(len(classes)) plt.x-

ticks(tick_marks, classes, rotation=45)

plt.yticks(tick_marks, classes)

if normalize:

cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis] print("Nor-

malized confusion matrix")

else:
print('Confusion matrix, without normalization')
thresh = cm.max() / 2.

for i, j in itertools.product(range(cm.shape[0]),

range(cm.shape[1])):

plt.text(j, i, cm[i, j],

horizontalalignment="center",

color="white" if cm[i, j] > thresh else "black")


plt.tight_layout() plt.yla-

bel('True label')

plt.xlabel('Predicted label')

plt.show()

42
Department of MCA, VIIT
Detecting fake news using machine learning and deep learning algorithms

# Naive Bayes classifier for Multinomial model

def Naïve Bayes (xtrain, ytrain, xtest, ytest ,ac):

clf = Multinomial NB (alpha=.01, fitprior=True)

clf.fit (xtrain, ytrain)

pred = clf.predict(xtest)

score = metrics.accuracy_score(ytest, pred)

print("accuracy: %0.3f" % score)

cm=metrics.confusion_matrix(ytest,pred, labels=['FAKE', 'REAL'])

plot_confusion_matrix(cm, classes=['FAKE', 'REAL'],title='Confusion matrix Naive Bayes')

print(cm)

ac.append(score)

def Logreg(xtrain,ytrain,xtest,ytest,ac):

i=1

logreg=LogisticRegression(C=9) lo-

greg.fit(xtrain,ytrain)

pred = logreg.predict(xtest)

score = metrics.accuracy_score(ytest, pred)

print("accuracy: %0.3f" % score)

43
Department of MCA, VIIT
Detecting fake news using machine learning and deep learning algorithms

cm = metrics.confusion_matrix(ytest, pred, labels=['FAKE', 'REAL'])

plot_confusion_matrix(cm,classes=['FAKE','REAL'],title='ConfusionmatrixLogisticRegressio’)

print(cm) ac.ap-

pend(score)

def RForest(xtrain,ytrain,xtest,ytest,ac):
clf1=RandomForestClassifier(max_depth=50,random_state=0,n_estimators=25)

clf1.fit(xtrain,ytrain)

pred = clf1.predict(xtest)

score = metrics.accuracy_score(ytest, pred)

print("accuracy: %0.3f" % score)

cm = metrics.confusion_matrix(ytest, pred, labels=['FAKE', 'REAL'])

plot_confusion_matrix(cm, classes=['FAKE', 'REAL'],title='Confusion matrix RandoForest')

print(cm) ac.ap-

pend(score)

def SVM(xtrain,ytrain,xtest,ytest,ac):

clf3 = SVC(C=100, gamma=0.1)

clf3.fit(xtrain, ytrain)

pred = clf3.predict(xtest)

score = metrics.accuracy_score(ytest, pred)

print("accuracy: %0.3f" % score)

44
Department of MCA, VIIT
Detecting fake news using machine learning and deep learning algorithms

cm=metrics.confusion_matrix(ytest,pred,labels=['FAKE','REAL'])

plot_confusion_matrix(cm,classes=['FAKE', 'REAL'],title='Confusion matrix SVM')

print(cm)

ac.append(score)

def process(xtrain,ytrain,xtest,ytest,ac):

print("For Multinomial Naive BayesModel")

NaiveBayes(xtrain,ytrain,xtest,ytest,ac)

print("For Random Forest Classifiers")

RForest(xtrain,ytrain,xtest,ytest,ac)

print("ForSupportVectorMachine_Radial Basis Function Classifier")

SVM(xtrain,ytrain,xtest,ytest,ac)

print("For Logarithamic Classifier") Lo-

greg(xtrain,ytrain,xtest,ytest,ac)

colors = ["#1f77b4", "#ff7f0e", "#2ca02c", "#d62728", "#8c564b"] ex-

plode = (0.1, 0, 0, 0, 0)

al=["NaiveBayes","Random Forest","SVM","LogisticRegres-

sion"] cac=[]

tac=[]

nac=[]

45
Department of MCA, VIIT
Detecting fake news using machine learning and deep learning algorithms

process(count_train,y_train,count_test,y_test,cac)

print(cac)

result2=open('CountAccuracy.csv', 'w') res-

ult2.write("Algorithm,Accuracy" + "\n") for

i in range(0,len(cac)):

print(al[i]+","+str(cac[i]))

result2.write(al[i] + "," +str(cac[i]) + "\n")

result2.close()

fig = plt.figure(0)

df = pd.read_csv('CountAccuracy.csv')

acc = df["Accuracy"]

alc = df["Algorithm"]

plt.bar(alc, acc, align='center', alpha=0.5,color=colors) plt.xlabel('Al-

gorithm')

plt.ylabel('Accuracy')

plt.title('Count Accuracy Value')

fig.savefig('CountAccuracy.png')

plt.show()

process(tfidf_train,y_train,tfidf_test,y_test,tac)

print(tac)

46
Department of MCA, VIIT
Detecting fake news using machine learning and deep learning algorithms

result2=open('TfidfAccuracy.csv', 'w')

print(al[i]+","+str(tac[i]))

result2.write("Algorithm,Accuracy" + "\n")

for i in range(0,len(tac)):

result2.write(al[i] + "," +str(tac[i]) + "\n")


fig = plt.figure(0)

df = pd.read_csv('TfidfAccuracy.csv')

acc = df["Accuracy"]

alc = df["Algorithm"]

plt.bar(alc, acc, align='center', alpha=0.5,color=colors)

plt.xlabel('Algorithm')

plt.ylabel('Accuracy')

plt.title('Tfidf Accuracy Value')

fig.savefig('TfidfAccuracy.png')

plt.show()

process(n_train,y_train,n_test,y_test,nac)

print(nac)

result2=open('NgramAccuracy.csv', 'w') res-

ult2.write("Algorithm,Accuracy" + "\n") for

i in range(0,len(nac)):

print(al[i]+","+str(nac[i]))
47
Department of MCA, VIIT
Detecting fake news using machine learning and deep learning algorithms

fig = plt.figure(0)

result2.write(al[i] + "," +str(nac[i]) + "\n") res-

ult2.close()

df = pd.read_csv('NgramAccuracy.csv')

acc = df["Accuracy"]

alc = df["Algorithm"]

plt.bar(alc, acc, align='center', alpha=0.5,color=colors)

plt.xlabel('Algorithm')

plt.ylabel('Accuracy')

plt.title('Ngram Accuracy Value')

fig.savefig('NgramAccuracy.png')

plt.show()

48
Department of MCA, VIIT
Detecting fake news using machine learning and deep learning algorithms

CHAPTER-7

FORMS AND REPORTS

49
Department of MCA, VIIT
Detecting fake news using machine learning and deep learning algorithms

7.1 OUTPUT:
>>>

RESTART: C:\Users\hp\Downloads\Compressed\fake news detection project\fake_news.py

1.0.1

title ... label

Unnamed: 0

8476 You Can Smell Hillary’s Fear ... FAKE

10294 Watch The Exact Moment Paul Ryan Committed Pol..............FAKE

3608 Kerry to go to Paris in gesture of sympathy ... REAL

10142 Bernie supporters on Twitter erupt in anger ag.............FAKE

875 The Battle of New York: Why This Primary Matters ... REAL

4490 State Department says it can't find emails fro.............REAL

8062 The ‘P’ in PBS Should Stand for ‘Plutocratic’.............FAKE

8622 Anti-Trump Protesters Are Tools of the Oligarc............FAKE

4021 In Ethiopia, Obama seeks progress on peace, se............REAL

4330 Jeb Bush Is Suddenly Attacking Trump. Here's W............REAL

[6335 rows x 3 columns]

For Multinomial Naive BayesModel

accuracy: 0.892

50
Department of MCA, VIIT
Detecting fake news using machine learning and deep learning algorithms

Confusion matrix, without normalization

[879 129]

[ 96 987]

For Random Forest Classifiers

accuracy: 0.857

Confusion matrix, without normalization

[892 116]

[184 899]]

For Support Vector Machine_Radial Basis Function Classifier ac-

curacy: 0.521

Confusion matrix, without normaliza-

tion [1006 2]

[ 999 84]]

For Logarithamic Classifier

accuracy: 0.902

Confusion matrix, without normalization

[939 69]

[135 948]]

[0.8923959827833573, 0.8565279770444764, 0.5212816834050693, 0.9024390243902439]

NaiveBayes,0.8923959827833573

51
Department of MCA, VIIT
Detecting fake news using machine learning and deep learning algorithms

Random Forest,0.8565279770444764

SVM,0.5212816834050693

LogisticRegression,0.9024390243902439

For Multinomial Naive BayesModel ac-

curacy: 0.903

Confusion matrix, without normalization

[[891 117]

[ 85 998]]

For Random Forest Classifiers

accuracy: 0.868

Confusion matrix, without normalization

[888 120]

[157 926]]

For Support Vector Machine_Radial Basis Function Classifier ac-

curacy: 0.937

Confusion matrix, without normalization

[ 959 49]

[ 82 1001]]

For Logarithamic Classifier

accuracy: 0.931

52
Department of MCA, VIIT
Detecting fake news using machine learning and deep learning algorithms

Confusion matrix, without normalization

[[965 43]

[101 982]]

[0.9033955045432808, 0.8675274988043998, 0.937350549976088, 0.9311334289813487]

NaiveBayes,0.9033955045432808

Random Forest,0.8675274988043998

SVM,0.937350549976088

LogisticRegression,0.9311334289813487

For Multinomial Naive BayesModel ac-

curacy: 0.912

Confusion matrix, without normalization

[[ 904 104]

[ 80 1003]]

For Random Forest Classifiers

accuracy: 0.864

Confusion matrix, without normalization

[[910 98]

[186 897]]

For Support Vector Machine_Radial Basis Function Classifier ac-

curacy: 0.519

53
Department of MCA, VIIT
Detecting fake news using machine learning and deep learning algorithms

Confusion matrix, without normaliza-

tion [[1008 0]

[1006 77]]

For Logarithamic Classifier

accuracy: 0.917

Confusion matrix, without normalization

[[949 59]

[114 969]]

[0.9120038259206121, 0.8641798182687709, 0.5188904830224773, 0.9172644667623147]

NaiveBayes,0.9120038259206121

Random Forest,0.8641798182687709

SVM,0.5188904830224773

LogisticRegression,0.9172644667623147

DETERMING OUTPUT:

confusion matrix: A confusion matrix is a table that is often used to describe the perform-
ance of a classification model (or “classifier”) on a set of test data for which the true values are
known. It allows the visualization of the performance of an algorithm.

54
Department of MCA, VIIT
Detecting fake news using machine learning and deep learning algorithms

True Positive (TP) : Observation is positive, and is predicted to be positive.

False Negative (FN) : Observation is positive, but is predicted negative.

True Negative (TN) : Observation is negative, and is predicted to be negative.

False Positive (FP) : Observation is negative, but is predicted positive.

Classification Rate/Accuracy:

Classification Rate or Accuracy is given by the following relation

55
Department of MCA, VIIT
Detecting fake news using machine learning and deep learning algorithms

7.2 COUNT ACCURACY :

Fig 7.2.1 Confusion matrix of Naïve bayes Fig 7.2.2 Confusion matrix of Radom forest

Fig 7.2.3 Confusion matrix of SVM fig 7.2.4 Confusion matrix of logistic regression

56
Department of MCA, VIIT
Detecting fake news using machine learning and deep learning algorithms

7.2.5 COMPARISON OF COUNT ACCURACY:

57
Department of MCA, VIIT
Detecting fake news using machine learning and deep learning algorithms

7.3 N-GRAM ACCURACY :

Fig 7.4.1 Confusion matrix of Naïve bayes Fig 7.4.2 Confusion matrix of Radom forest

Fig 7.4.3 Confusion matrix of SVM fig 7.4.4 Confusion matrix of logistic regression

58
Department of MCA, VIIT
Detecting fake news using machine learning and deep learning algorithms

COMPARISON N-GRAM ACCURACY:

Fig 7.3.5 N-gram accuracy value

59
Department of MCA, VIIT
Detecting fake news using machine learning and deep learning algorithms

7.4 TD-IDF ACCURACY:

Fig 7.4.1 Confusion matrix of Naïve bayes Fig 7.4.2 Confusion matrix of Radom forest

Fig 7.5.3 Confusion matrix of SVM fig 7.5.4 Confusion matrix of logistic regression

60
Department of MCA, VIIT
Detecting fake news using machine learning and deep learning algorithms

COMPARISON TF-IDF ACCURACY:

Fig 7.4.5 tf-idf accuracy value

61
Department of MCA, VIIT
Detecting fake news using machine learning and deep learning algorithms

CHAPTER-8
CONCLUSION

62
Department of MCA, VIIT
Detecting fake news using machine learning and deep learning algorithms

8. CONCLUSION:

In the 21st century, the majority of the tasks are done online. Newspapers who were
earlier preferred as hard-copies are now being substituted by applications like Facebook,
Twitter, and news articles to be read online. The growing problem of fake news only makes
things more complicated and tries to change or hamper the opinion and attitude of people to-
wards use of digital technology. When a person is deceived by the real news two possible
things happen.People start believing that their perceptions about a particular topic are true as
assumed. Another problem is that even if there is any news article available which contradicts
a supposedly fake one, people believe in the words which just support their thinking without
taking in the measure the facts involved. Thus, in order to curb the phenomenon, Google and
Facebook.

63
Department of MCA, VIIT
Detecting fake news using machine learning and deep learning algorithms

CHAPTER-9
REFERENCES

64
Department of MCA, VIIT
Detecting fake news using machine learning and deep learning algorithms

REFERENCES:
1. Conroy, Niall & Rubin, Victoria & Chen, Yimin. (2015). Automatic Deception De-
tection: Methods for Finding Fake News. USA
2. Ball, L. & Elworthy, J. J Market Anal (2014) 2: 187. https://doi.org/10.1057/jma.2014.15

3. Lu TC. Yu T., Chen SH. (2018) Information Manipulation and Web Credibility. In:
Bucciarelli E., Chen SH., Corchado J. (eds) Decision Economics: In the Tradition of
Herbert A. Simon's Heritage. DCAI 2017. Advances in Intelligent Systems and Com-
puting, vol 618. Springer, Cham
4. Rubin, Victoria & Conroy, Niall & Chen, Yimin & Cornwell, Sarah. (2016). Fake
News or Truth? Using Satirical Cues to Detect Potentially Misleading
News.10.18653/v1/W160802.
5. 0. Wang, W.Y.: Liar, Liar Pants on fire: a new Benchmark dataset for fake news detection.

arXiv preprint (2017). arXiv:1705.00648.

6. Rubin., Victoria, L., et al.: Fake news or truth? Using satirical cues to detect poten-
tially misleading news. In: Proceedings of NAACL-HLT (2016).
7. Shivam B. Parikh and Pradeep K. Atrey, “Media-Rich Fake News Detection: A Survey”,

IEEE Conference on Multimedia Information Processing and Retrieval, 2018.

8. Hunt Allcott and Matthew Gentzkow. Social mediaand fake news in the 2016 elec-
tion. Technical report,National Bureau of Economic Research, 2017.

65
Department of MCA, VIIT

You might also like