0% found this document useful (0 votes)
88 views17 pages

Final Report Data Mining

This document appears to be a final project report for a Twitter data analysis project on the topic of demonetization in India. It includes sections on introduction and objectives, literature review on related work, experimental details on machine learning methods and the dataset, and planned sections on results and discussions, summary and conclusions, and references. The introduction provides background on demonetization in India and the objective to perform sentiment analysis on Twitter data related to this topic. The literature review covers related work on analyzing information and sentiment on Twitter. Experimental details include plans to use machine learning clustering and classification methods like K-means on a Twitter dataset from Kaggle about demonetization.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
88 views17 pages

Final Report Data Mining

This document appears to be a final project report for a Twitter data analysis project on the topic of demonetization in India. It includes sections on introduction and objectives, literature review on related work, experimental details on machine learning methods and the dataset, and planned sections on results and discussions, summary and conclusions, and references. The introduction provides background on demonetization in India and the objective to perform sentiment analysis on Twitter data related to this topic. The literature review covers related work on analyzing information and sentiment on Twitter. Experimental details include plans to use machine learning clustering and classification methods like K-means on a Twitter dataset from Kaggle about demonetization.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

School of Information Technology & Engineering

M-Tech Software Engineering


SWE2009- DATA MINING TECHNIQUES
TWITTER DATA ANALYSIS
FINAL REVIEW

Group Members
Name Reg.no
V.VAMSI KRISHNA 16MIS0170
K. SAITEJA 16MIS0272

Submitted to
Faculty: Prof.SUDHA.M
SLOT: B2+TB2

1
CERTIFICATE

This is to certify that the Project work entitled “TWITTER DATA ANALYSIS”
that is being submitted by “VAMSI KRISHNA AND K. SAITEJA” in M. Tech
(S.E) for SWE2009: DATA MINING TECHNIQUES is a record of bonafide
work done under my supervision. The contents of this Project work, in full or in
parts, have neither been taken from any other source nor have been submitted for
any other course.

Signature of faculty

(SUDHA.M)

2
AKNOWLEDGEMENT

We are thankful to the Department because of whom, we have gained confidence


in Innovative Thinking and it also enhanced our professional skills as to become
competent in this field.

In performing our project, we had to take the help and guideline of some respected
persons, who deserve our greatest gratitude. The completion of this project gives
us much Pleasure. We would like to show our gratitude to Prof. SUDHA.M, SITE
VIT University for giving us a good guideline for project throughout numerous
consultations. We would also like to expand our deepest gratitude to all those
who have directly and indirectly guided us in this project.

Thank you,

V.VAMSI KRISHNA 16MIS0170


K. SAITEJA 16MIS0272

3
TABLE OF CONTENTS

S.no Topics
Abstract
1. Introduction
1.1 Introduction
1.2 Objective of the work
1.3 Scope of the work
2. Literature review
2.1 Introduction
2.2 Background
2.3 Challenges
2.4 Problem definition and approach
3. Experimental details
3.1 Machine learning methods
3.2 Design frame work
3.3 Dataset, Data source, characterization, Pre-processing
3.4 Processing techniques
4. Results and Discussions
5. Summary and Conclusions
6. References

4
Abstract

Withdrawal of a particular form of currency (such currency notes) from circulation is known
as demonetization. On November 8th, India’s Prime Minister announced that 86% of the
country’s currency would be rendered null and void in 50 days and it will withdraw all 500 and
1,000 rupee notes the country’s most popular currency denominations from circulation, while
a new 2,000 rupee note added in. It was posited as a move to crackdown on corruption and the
country’s booming under-regulated and virtually untaxed grassroots economy. To the final
result of the implementation of demonetization we will apply sentimental analysis for the data
set, and we will find which type of user and tweets per/hour and rate of increase in number of
tweets so that we can find the interest of the user on demonetization.

5
CHAPTER-1

Introduction

Twitter is a micro-blogging website that has become increasingly popular with the network
community. Users update short messages, also known as Tweets, which are limited to 140
characters. Users update their personal opinions on many subjects, discuss current topics and
write about life events through tweets. This platform is favoured by many users because it has
no political and economic restrictions and is easily available to large number of people. As the
amount of users increase, micro-blogging platforms are becoming a place to find strong
viewpoints and sentiment. People use twitter to forecast and analyse in a lot of different areas.

Objective of the work

Withdrawal of a particular form of currency (such currency notes) from circulation is known
as demonetization. On November 8th, India’s Prime Minister announced that 86% of the
country’s currency would be rendered null and void in 50 days and it will withdraw all 500 and
1,000 rupee notes the country’s most popular currency denominations from circulation, while
a new 2,000 rupee note added in. It was posited as a move to crackdown on corruption and the
country’s booming under-regulated and virtually untaxed grassroots economy.

Scope of the work

Taking the sample dataset and making the pre-processing and transforming the data and
colleting the suitable information using the python Jupiter. Analysis done using the word cloud
printing the most common words are used for the tag ”Narendhramodhi” and using the data
ratio of tweets counts are displayed. Pie chart and graphical representation of the most used
source for the data like iPhone or android. Main motto proposed the system using sentimental
analysis.

6
CHAPTER-2

Literature Survey

The recent advancements in Web technologies have attracted a large number of internet users
to use online social networks like Facebook and Twitter for varied purposes, including events
update and data sharing. As a result, social network applications are emerging as a powerful
online tool for users to express and share their views with other users around the globe. Twitter
is one such social media application with a large and rapidly growing user base. It has become
the most popular micro-blogging social networking website in which users share their views in
the form of very short message limited to 140 characters called “tweets”. Besides events update
and data sharing, Twitter is also being used for many other purposes, including product
marketing, political campaign, and market research. In addition, Twitter is also being used by
the users to express their opinions and views about prominent issues of day-to-day life that
may be social, political, or entertainment. Analysing tweets to spot emerging issues and trends
and to assess public opinion concerning topics and events is of considerable interest to various
stakeholders, including government, companies, and security agencies.

Background

In this section, we present the functional details of our proposed tweets mining approach, which
aims to classify tweets based their relatedness with various events. Figure 1 presents the work-
flow of the proposed method and highlights the functioning details of the various working
modules. Tweets crawling aims to retrieve tweets from the server and store them on local
machine for analysis. Tweets pre-processing and tokenization process aims to extract tweets
contents, filter out unwanted constituents like embedded emoticons and URLs, and tokenize
them into 1-grams for further processing. Feature extraction and social network generation
identifies significant key terms from the tweets using Latent Dirichlet Allocation (LDA)
method and use them to model the tweets as a social network. Finally, Markov clustering is
applied on the generated social network to crystallize it into various clusters, each one
representing a particular event.

7
Challenges

The widespread and different types of information on Twitter make it one of the most
appropriate virtual environments for information monitoring and tracking. In this paper, the
authors review different information analysis techniques; starting with the analysis of different
hashtags, twitter’s network-topology, event spread over the network, identification of
influence, and finally analysis of sentiment. Future research and development work will be
addressed.

Problem definition and approach

The project addresses the problem of sentiment analysis in twitter; that is classifying tweets
according to the sentiment expressed in them: positive, negative or neutral. Twitter is an online
micro-blogging and social-networking platform which allows users to write short status
updates of maximum length 140 characters. It is a rapidly expanding service with over 200
million registered users out of which 100 million are active users and half of them log on twitter
on a daily basis – generating nearly 250 million tweets per day. Due to this large amount of
usage we hope to achieve a reflection of public sentiment by analysing the sentiments
expressed in the tweets. Analysing the public sentiment is important for many applications such
as firms trying to find out the response of their products in the market, predicting political
elections and predicting socioeconomic phenomena like stock exchange. The aim of this
project is to develop a functional classifier for accurate and automatic sentiment classification
of an unknown tweet stream.

8
CHAPTER-3

Experimental Details

Machine learning methods

Clustering with K-means

Given k, the k-means algorithm is implemented in four steps

 Partition objects into k nonempty subsets.


 Compute seed points as the centroids of the clusters of the current partitioning.
 Assign each object to the cluster with the nearest seed point.
 Go back to Step 2, stop when the assignment does not change.

Design Framework

9
Data Set

Data Source

https://www.kaggle.com/zoupet/exploratory-data-analysis.

Pre-processing

 Replaced 0 values by mean, but no performance improvement was observed while


evaluating models.
 Replacing missing values by mean has performance improvement, while evaluating
models.
 Dropped rows with 0 values, performance seems to be improved. But dataset reduces
to half.
 Split into train and test sets.
 Applied feature selection, but not much change in performance. So code lines disabled.
 Total number of rows: 912
 Total number of noise: 49
 Total number of missing: 56
 Total number of outlier: 11
 Total number of errors: 28
 Total number of rows after data pre-processing: 768

10
Processing techniques

Sentimental Analysis

Taking the sample dataset and making the pre-processing and transforming the data and
colleting the suitable information using the python Jupiter. Analysis done using the word cloud
printing the most common words are used for the tag ”Narendhramodhi” and using the data
ratio of tweets counts are displayed. Pie chart and graphical representation of the most used
source for the data like iPhone or android. Main motto proposed the system using sentimental
analysis. This code and method is mostly used for all the data sets to analysis the most common
used text identification. We have tried with another data path also we are able to get correct
output 95%. It can be applicable to the any real life scenario and get analysis of response.

11
CHAPTER-4

Results and discussions

Getting data

For the word Narendra Modi

12
For the word Terrorist

Time series plotting

No of retweets per hour

13
Source of retweets

No of retweet by source bis

Cluster plotting

14
Correlation analysis

Sentiment Analysis

15
CHAPTER-5

Summary and conclusions

The above results shown to the each component taken for the total dataset we are to calculate
the tweets per particular time period and device used for the tweets and correlation. We have
done the experiment using almost 700 content from the data and we got the output but it takes
almost 25 min to get the output. This project gave us the experience how to do analysis of the
data to different categories. It can be applicable to the any real life scenario and get analysis of
response like positive or negative or partial.

16
References

[1] Chung, J. E., & Mustafaraj, E. (2011, August). Can collective sentiment expressed on twitter predict
political elections?. In Twenty-Fifth AAAI Conference on Artificial Intelligence.

[2] Pak, A., & Paroubek, P. (2010, May). Twitter as a corpus for sentiment analysis and opinion mining.
In LREc (Vol. 10, No. 2010, pp. 1320-1326).

[3] T. Sakaki, M. Okazaki, and Y. Matsuo, “Earthquake shakes twitter users: Real-time event detection
by social sensors,” in Proceedings of the 19th international conference on World wide web, 2010, pp.
851–860.

[4] M. Cheong and V. Lee, “A study on detecting patterns in twitter intratopic user and message
clustering,” in Proceedings of the 2010 20th International Conference on Pattern Recognition, 2010, pp.
3125–3128.

[5] M. Thelwall, K. Buckley, and G. Paltoglou, “Sentiment in twitter events,” Journal of the American
Society for Information Science and Technology, vol. 62, no. 2, pp. 406–418, 2011.

17

You might also like