Using twint to scrape Twitter, we perform natural language processing (NLP) techniques to analyze the sentiment of tweets relating to masks and coronavirus and classify them as Negative, Neutral or Positive. Through text processing, exploratory data analysis and feature engineering, we discover insights into how important words, topics, and subjectivity relate to sentiment. We then create predictive models to provide further insight and confirm our findings during EDA.
- How does the sentiment of tweets change over time?
- Hypothesis: Tweets will be more negative on average in January and get more positive on average as time goes on.
- Will Twitter stats (number of likes, replies, retweets) play a role in determining sentiment?
- Hypothesis: The most important features will most likely be the words themselves.
- Does topic modeling provide any insight toward tweet sentiment or the COVID-19 crisis?
- Hypothesis: Topic modeling should be a factor in determining sentiment and can give us insights into the pandemic.
- What insights can be provided by using machine learning?
- Hypothesis: The lion's share of the insights will come during EDA.
- What are the most frequent words? And do they play a role in determining sentiment?
-
Tweets were generally more negative in January but relatively constant from February through May (there were also far fewer relevant tweets in January).
-
After removing common English stopwords as well as topical stopwords like mask, and virus, the top ten most frequently occuring words were: hand, need, spread, protect, make, help, say, glove, public, and hospital.
-
A 10-topic LDA model grouped words into the following topics with the following predominant sentiment:
- Healthcare workers, hospitals: split positive/negative
- Social distancing: positive
- Protesting, lockdowns: positive
- Government, health organizations: negative
- Spreading the virus: positive
- Emojis, swear words: positive
- COVID19 statistics: split positive/negative
- Preventing infection: positive
- General opinions: negative
- Riots, BLM: neutral
-
Topic modeling provided some interesting insights but was not helpful in prediction modeling.
-
Some of the features that prediction models weighed the heaviest were surprising:
- Subjectivity Score
- Number of likes
- Number of retweets
'need'
'spread'
'protect'
'make'
'help'
'say'
'glove'
'public'
'hospital'
'new'
Subjectivity score (0.0611)
Number of likes (0.0139)
'protect' (0.0132)
'help' (0.0129)
'infected' (0.0115)
'safe' (0.0094)
'please' (0.0083)
'death' (0.0083)
'hand' (0.0076)
Number of replies (0.0072)
The overall sentiment of tweets was fairly evenly divided between positive and negative throughout the five months. There were some interesting results from our prediction models, namely that some continuous variables like subjectivity score, number of likes, and number of replies were some of the most important variables for predicting a tweet's sentiment. Other important features were words with high frequencies. Given more time we would try to get better accuracy via a deep learning model, including an LSTM model. And finally, we would like to further investigate sentiment toward the work mask (or masks) in particular as opposed to the overall sentiment of the tweet as a whole.
- Images folder - charts and visualizations created during the project
- .gitignore - list of files and pathways to ignore
- README.md - this very file!
- data_cleaning_notebook.ipynb - notebook of compiling our dataframes
- eda_visualizations_notebook.ipynb - notebook with EDA and chart/visualization creations
- functions.py - file with functions used in this project
- modeling_notebook.ipynb - notebook with Naive Bayes and Decision Tree models
- nlp_features_notebook.ipynb - notebook with text processing, LDA topic modeling, and subjectivity scoring
- twitter_scraping_notebook.ipynb - notebook detailing our scraping of tweets
- presentation.pdf - slides for our presentation of this project