This project demonstrates practical big data engineering and analytics skills using Apache Spark, MongoDB, and Streamlit. It focuses on distributed data processing, sentiment analysis, NoSQL database integration, performance benchmarking, and interactive dashboard visualization.
The pipeline covers:
- ✅ Data ingestion and cleaning using PySpark
- ✅ NoSQL storage in MongoDB with optimized schema and indexing
- ✅ Large-scale data analysis using Spark transformations
- ✅ Complex querying using Spark SQL
- ✅ Performance comparison between Spark SQL and MongoDB Aggregation
- ✅ Data visualization using Matplotlib and Seaborn
- ✅ BONUS — Interactive dashboard using Streamlit
- ✅ BONUS — Format comparison: CSV vs Parquet vs Avro
| Member | Responsibility |
|---|---|
| Rabaya Khatun Keya | Data ingestion, cleaning & MongoDB storage |
| Manasha Siriwardana Mudalige Don | Spark processing & transformations |
| Md Firoz Chowdhury | MongoDB queries & indexing and others |
| Neeru Neeru | Spark SQL queries & performance analysis, Visualization, report, dashboard & documentation also git process |
| Detail | Info |
|---|---|
| Name | Twitter US Airline Sentiment |
| Source | Kaggle |
| Records | ~14.6k raw tweets; 14,478 records after cleaning |
| Airlines | United, US Airways, American, Southwest, Delta, Virgin America |
| Period | February 2015 |
| Labels | Positive, Neutral, Negative |
| Tool | Version | Purpose |
|---|---|---|
| Apache Spark (PySpark) | 4.1.1 | Large-scale data processing |
| MongoDB | 7.0 | NoSQL data storage |
| Python | 3.x | Programming language |
| Pandas | Latest | Data manipulation |
| Matplotlib | Latest | Static visualizations |
| Seaborn | Latest | Statistical charts |
| Streamlit | Latest | Interactive dashboard |
| Plotly | Latest | Interactive charts in dashboard |
| fastavro | Latest | Avro format support |
| Google Colab | Linux | Parquet/Avro comparison |
| Java JDK | 17.0 | Spark runtime |
| Jupyter Notebook | Latest | Development environment |
-
Big data processing using Apache Spark (PySpark)
-
NoSQL database integration with MongoDB
-
Spark SQL query optimization and performance analysis
-
Data cleaning and preprocessing pipelines
-
Interactive dashboard development using Streamlit
-
Data visualization using Matplotlib and Plotly
-
Distributed analytics workflow design
-
Team collaboration using Git and GitHub
CSV Dataset
↓
PySpark Data Cleaning
↓
MongoDB Storage & Indexing
↓
Spark SQL Analytics
↓
Visualization & Performance Analysis
↓
Streamlit Interactive Dashboard
## 📁 Project Structure
BIGDATAPROJECT/ │ ├── data/ │ └── Tweets.csv # Raw dataset │ ├── notebooks/ │ ├── 01_data_ingestion.ipynb # Load, clean & store in MongoDB │ ├── 02_spark_processing.ipynb # PySpark transformations & analysis │ ├── 03_mongodb_queries.ipynb # MongoDB aggregation queries │ ├── 04_spark_sql_queries.ipynb # Spark SQL & performance comparison │ ├── 05_visualization.ipynb # Charts & key insights │ └── 06_format_comparison.ipynb # CSV vs Parquet vs Avro │ ├── report/ │ ├── final_report.pdf # Final project report │ ├── chart1_sentiment_pie.png # Sentiment distribution │ ├── chart2_airline_sentiment.png # Sentiment by airline │ ├── chart3_negative_reasons.png # Top negative reasons │ ├── chart4_heatmap.png # Confidence heatmap │ └── chart5_performance.png # Spark vs MongoDB performance │ ├── schema/ │ └── mongo_schema.md # MongoDB schema design │ ├── app.py # Streamlit dashboard ├── .gitignore # Git ignore file ├── README.md # Project documentation └── requirements.txt # Python dependencies
---
## ⚙️ Setup & Installation
### Prerequisites
- Anaconda / Miniconda installed
- Java JDK 17 installed
- MongoDB 7.0 installed and running
### Step 1 — Create Conda Environment
```bash
conda create -n pyspark_env python=3.11
conda activate pyspark_env
pip install -r requirements.txtimport os
os.environ["JAVA_HOME"] = r"C:\Users\hp\AppData\Local\Programs\Eclipse Adoptium\jdk-17.0.18.8-hotspot"
os.environ["HADOOP_HOME"] = r"C:\hadoop"
os.environ["PYSPARK_PYTHON"] = r"C:\Users\hp\anaconda3\envs\pyspark_env\python.exe"Users should update JAVA_HOME and PYSPARK_PYTHON according to their own system before running the notebooks. Local file paths may differ across team members' computers, so paths should be adjusted if needed.
net start MongoDBconda activate pyspark_env
jupyter notebook| Order | Notebook | Description |
|---|---|---|
| 1st | 01_data_ingestion.ipynb |
Load CSV → clean → store in MongoDB |
| 2nd | 02_spark_processing.ipynb |
Spark transformations & analysis |
| 3rd | 03_mongodb_queries.ipynb |
MongoDB aggregation queries |
| 4th | 04_spark_sql_queries.ipynb |
Spark SQL & performance comparison |
| 5th | 05_visualization.ipynb |
Generate all charts |
| 6th | 06_format_comparison.ipynb |
CSV vs Parquet vs Avro (Google Colab) |
For best results, the notebooks should be run in the listed order because later steps depend on the cleaned data, MongoDB collection, and outputs generated in earlier notebooks.
conda activate pyspark_env
python -m streamlit run app.pyOpen browser at: http://localhost:8501
Database: tweets_db
Collection: airline_tweets
Fields:
tweet_id String
airline_sentiment String (positive / neutral / negative)
airline_sentiment_confidence Float
negativereason String (nullable)
airline String
retweet_count Integer
text String
clean_text String
tweet_created String
tweet_location String (nullable)
Indexes (6 total):
airline Single field
airline_sentiment Single field
negativereason Single field
tweet_id Single field
airline + airline_sentiment Compound
| Sentiment | Count | Percentage |
|---|---|---|
| Negative | 9,110 | 63% |
| Neutral | 3,069 | 21% |
| Positive | 2,299 | 16% |
| Rank | Airline | Negative | Positive |
|---|---|---|---|
| 1 (Worst) | United | 2,630 | 492 |
| 2 | US Airways | 2,263 | 269 |
| 3 | American | 1,863 | 307 |
| 4 | Southwest | 1,185 | 570 |
| 5 | Delta | 953 | 544 |
| 6 (Best) | Virgin America | 181 | 152 |
| Rank | Reason | Count |
|---|---|---|
| 1 | Customer Service Issue | 2,883 |
| 2 | Late Flight | 1,648 |
| 3 | Can't Tell | 1,175 |
| 4 | Cancelled Flight | 829 |
| 5 | Lost Luggage | 717 |
| Tool | Best For |
|---|---|
| MongoDB | Simple aggregations with indexed fields |
| Spark SQL | Complex multi-column analytical queries |
| Format | Write Speed | Read Speed | File Size | Best For |
|---|---|---|---|---|
| CSV | Slowest | Slowest | Largest | Human readable, compatibility |
| Parquet | Medium | Fastest | Smallest | Analytics, fast reads |
| Avro | Fast | Medium | Medium | Streaming, schema evolution |
- Live connection to MongoDB
- Sidebar filters by airline and sentiment
- Interactive pie chart, bar charts, heatmap
- Airline scorecard table with positive/negative percentages
- Raw data viewer with 100 tweets
- Run with:
python -m streamlit run app.py
- Tested on Google Colab (Linux environment)
- Used Apache Spark 4.0.2 with fastavro library
- Compared write speed, read speed and file size
- Parquet proved fastest for analytics workloads
- Avro best suited for streaming pipelines
- CSV most compatible and human readable
pyspark
pymongo
pandas
matplotlib
seaborn
numpy
streamlit
plotly
fastavro
Install with:
pip install -r requirements.txt- Loads raw CSV using PySpark
- Cleans text: removes URLs, @mentions, special characters
- Removes duplicates and null values
- Stores 14,478 clean records in MongoDB
- Creates 6 optimized indexes
- Overall sentiment distribution analysis
- Sentiment breakdown by individual airline
- Word frequency analysis on negative tweets (with stopword removal)
- Most retweeted tweets identification
- Average sentiment confidence scores per airline
- Aggregation pipeline for sentiment counts
- Airline-level sentiment breakdown
- Top 10 negative complaint reasons
- Index performance testing and timing
- Best and worst airline identification
- Sentiment distribution with percentage using window functions
- Airline scorecard showing positive/neutral/negative breakdown
- Top negative reason per airline
- Hourly tweet activity pattern analysis
- Spark SQL vs MongoDB performance comparison (3-run average)
- Pie chart: overall sentiment distribution
- Grouped bar chart: sentiment by airline
- Horizontal bar chart: top 10 negative reasons
- Heatmap: average sentiment confidence by airline
- Performance bar chart: Spark SQL vs MongoDB
- Write and read CSV, Parquet, Avro formats
- Compare write speed, read speed and file sizes
- Final summary table with format recommendations
- Run on Google Colab for Linux environment support
- Interactive dashboard using Streamlit ✅
- Compare CSV vs Parquet vs Avro formats ✅
- Real-time streaming with Spark Streaming
- ML sentiment classifier using Spark MLlib
- Deploy on cloud (AWS / GCP) for full Hadoop support
- Batch vs stream processing comparison
This project is for academic purposes only.
- Dataset: Kaggle — Twitter US Airline Sentiment
- Apache Spark Documentation: https://spark.apache.org/docs/latest/
- MongoDB Documentation: https://www.mongodb.com/docs/
- Streamlit Documentation: https://docs.streamlit.io/