GitHub - SimoneBrancato/Sentivoter: A data pipeline designed for sentiment and emotion analysis in a bachelor's thesis project.

Related Repository (!!!)

This project relies on Elections-Crawler as a core component of the framework, responsible for collecting data from Facebook.

What is Sentivoter?

Sentivoter is an advanced cross-media data collection and sentiment analysis pipeline developed to track and analyze public opinion during the 2024 U.S. presidential election. By leveraging data from Facebook and YouTube, the project seeks to provide a comprehensive understanding of the sentiments and emotions expressed by users, particularly regarding political figures like Kamala Harris and Donald Trump. The pipeline combines powerful tools such as Selenium for web scraping, Apache Spark for large-scale data processing, and Elasticsearch for efficient indexing, offering a rich analysis of the public's reactions to the unfolding election campaign.

Prerequisites

Docker/Docker Compose: Ensure you have a fully functional Docker and Docker Compose installation on your local computer.
Prepare the dataset since the analysis will be in batch mode, you first need to prepare the required files.
- Facebook data should be collected using the Elections-Crawler module and exporting the scraped content into a MySQL dump located at mysql/dump.sql.
- YouTube data can be generated using the scripts provided in the yt_data directory. Make sure to flatten the raw data to make it ready for the ingestion phase.

The expected data format is described in the next section.

Data Format

Each entry in yt_data/flattened_comments_data/*.json represents a single comment enriched with metadata from the video it belongs to. The expected format is as follows:

{
  "channel": "WashingtonPost",        // Name of the YouTube channel
  "channel_bias": "DEMOCRAT",         // Channel political leaning (e.g., DEMOCRAT or REPUBLICAN)
  "state": "DC",                      // U.S. state associated with the channel
  "url_video": "https://www.you...",  // Full URL of the video
  "id_video": "k8cUC0V0C3U",          // YouTube video ID
  "title": "Debunking Trump’s...",    // Title of the video
  "video_timestamp": "2024-09...",    // Upload datetime of the video (ISO format)
  "video_likes": 59,                  // Number of likes the video received
  "views": 2063,                      // Number of views for the video
  "social": "youtube",                // Social media source (constant: "youtube")
  "comment_cid": "UgzeZQE7l...",      // Comment ID
  "comment_published_at": "202...",   // Comment timestamp (ISO format)
  "comment_author": "@desir...",      // Username of the commenter
  "comment_text": "Highes...",        // Text content of the comment
  "comment_votes": 9                  // Number of likes/upvotes the comment received
}

Each entry in yt_data/flattened_videos_data/*.json represents a single video enriched with metadata from the video it belongs to.

{
  "channel": "USAToday",              // Name of the YouTube channel
  "channel_bias": "LEAN_DEMOCRAT",    // Channel political leaning (e.g., DEMOCRAT, LEAN_DEMOCRAT)
  "state": "DC",                      // U.S. state associated with the channel
  "url_video": "https://www.yout...", // Full URL of the video
  "id_video": "7YXAno3DS1U",          // YouTube video ID
  "title": "House Spea...",           // Title of the video
  "video_timestamp": "2024-09-1...",  // Upload datetime of the video (ISO format)
  "fullText": "[music] we have...",   // Video transcript
  "video_likes": 172,                 // Number of likes
  "views": 14701,                     // Total views
  "comments": 68,                     // Number of comments
  "social": "youtube"                 // Source identifier (constant: "youtube")
}

The Facebook data, collected via the Elections-Crawler module, is stored as a MySQL dump (mysql/dump.sql). The dump includes two main tables:

CREATE TABLE Comments_with_candidate_column (
  uuid char(100) NOT NULL,          -- Unique identifier for the comment
  post_id char(100) DEFAULT NULL,   -- ID of the parent post
  candidate varchar(50) NOT NULL,   -- Candidate associated with the post
  timestamp datetime DEFAULT NULL,  -- Comment timestamp
  account char(36) DEFAULT NULL,    -- User of the commenter
  content varchar(1000) NOT NULL,   -- Comment text
  like int DEFAULT '0',
  love int DEFAULT '0',
  care int DEFAULT '0',
  haha int DEFAULT '0',
  wow int DEFAULT '0',
  angry int DEFAULT '0',
  sad int DEFAULT '0'
);

CREATE TABLE Posts (
  uuid char(100) NOT NULL,            -- Unique identifier for the post
  retrieving_time datetime NOT NULL,  -- When the post was scraped
  timestamp datetime DEFAULT NULL,    -- Original post timestamp
  candidate varchar(50) NOT NULL,     -- Candidate associated with the post
  content varchar(1000) NOT NULL,     -- Post text content
  like int DEFAULT '0',
  love int DEFAULT '0',
  care int DEFAULT '0',
  haha int DEFAULT '0',
  wow int DEFAULT '0',
  angry int DEFAULT '0',
  sad int DEFAULT '0',
  PRIMARY KEY (uuid,retrieving_time)
);

Project Architecture

The data pipeline is structured as follows:

Producers: read data from the directories and push it into the data pipeline.
Logstash: ingestion layer, forwards data into 4 differents Kafka Topics.
Apache Kafka: manages data streams in 4 different topics ensuring decoupled communication.
Apache Spark Cluster: consumes data from Kafka and performs batch sentiment and emotion analysis running TweetNLP models. Sends the enriched data into 4 different Elasticsearch indexes.
Elasticsearch: stores enriched data and provides high-performances querying
Kibana: provides interactive visualizations and dashboards for exploring sentiment trends, emotional tones, and engagement across platforms and candidates.

Contacts

E-Mail: simonebrancato18@gmail.com
LinkedIn: Simone Brancato
GitHub: Simone Brancato

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
fb_spark		fb_spark
kibana/config		kibana/config
logstash		logstash
mysql		mysql
producer		producer
yt_data		yt_data
yt_spark_1		yt_spark_1
yt_spark_2		yt_spark_2
.gitignore		.gitignore
README.md		README.md
docker-compose.yaml		docker-compose.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Related Repository (!!!)

What is Sentivoter?

Prerequisites

Data Format

Project Architecture

Contacts

About

Uh oh!

Releases

Packages

Uh oh!

Languages

SimoneBrancato/Sentivoter

Folders and files

Latest commit

History

Repository files navigation

Related Repository (!!!)

What is Sentivoter?

Prerequisites

Data Format

Project Architecture

Contacts

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages