This project builds a big data pipeline to process raw traffic and weather data for the city of Toronto. Using Hadoop and Spark, we transform and clean the data, then use it to train a machine learning model to predict traffic congestion levels.
- Set up a big data environment using Hadoop and Spark
- Clean and process large traffic and weather datasets
- Merge datasets to form a unified source for ML
- Build and evaluate a prediction model
- Visualize and access results through Jupyter
- Daily traffic counts from 2022 to 2024 were collected for over 1100 traffic signal locations across Toronto.
- The original dataset had 335 rows and 1100 columns, where each column represented traffic count for a day at a given location.
- To make it usable for time-series analysis and machine learning, the dataset was converted to a long format (~292,611 rows).
- Example intersections covered include:
- YORK ST / BREMNER BLVD / RAPTORS WAY
- SPADINA AVE / FRONT ST W
- EGLINTON AVE E / DON MILLS RD
- SHEPPARD AVE E / MCCOWAN RD
- YONGE ST / DUNDAS ST
- and 100+ more across all Toronto boroughs.
π Source
- Daily weather records were collected from Environment Canada for 2022β2024.
- Each dayβs record included temperature, precipitation, wind gusts, and quality flags.
- Sample fields include:
Max Temp (Β°C),Min Temp (Β°C),Total Rain (mm),Snow on Grnd (cm),Dir of Max Gust (10s deg),Spd of Max Gust (km/h).
π Source
- After transformation and merging, the final dataset (
final_traffic_weather.csv) had:- 37 columns
- 292,611 rows
- Fields included:
date,traffic_camera,traffic_count,Longitude (x),Latitude (y)- All weather features listed above
- Suitable for training supervised ML models like Random Forest
# Hadoop
./start-dfs.sh
./start-yarn.sh
# Spark
start-master.sh
start-worker.sh spark://<your-machine-name>:7077
# Check
jps # confirm processes like NameNode, DataNode, ResourceManager, etc.python3 -m venv spark-venv
source spark-venv/bin/activate# Hadoop
cd ~/hadoop-3.4.1/sbin
./start-dfs.sh
./start-yarn.sh
# Spark
cd /opt/spark/sbin
start-master.sh
start-worker.sh spark://<your-host>:7077
# Check processes
jpspython3 -m venv spark-venv
source spark-venv/bin/activatehdfs dfs -rm -r /user/hadoop/toronto_traffic/input
hdfs dfs -rm -r /user/hadoop/toronto_traffic/
hdfs dfs -mkdir /user/hdoop/toronto_traffic
hdfs dfs -mkdir /user/hdoop/toronto_traffic/inputhdfs dfs -put path/to/*.csv /user/hdoop/toronto_traffic/inputexport PYSPARK_PYTHON=/home/hdoop/spark-venv/bin/python
spark-submit run_pipeline.pyrun_transformation(spark): Reads and reshapes traffic data into long formatrun_ingestion(spark): Reads and combines weather and traffic into Parquetrun_preprocessing(spark): Filters Toronto records, fills nullsrun_merge(spark): Joins weather and traffic ondaterun_saving(spark): Converts Parquet to CSV
Avoid creating multiple Spark sessions across files. Use imports and function calls instead of os.system.
export JAVA_HOME=/usr/lib/jvm/java-21-openjdk-amd64
export PATH=$JAVA_HOME/bin:$PATHMake this permanent by appending to ~/.bashrc.
hdfs dfs -getmerge /user/hdoop/toronto_traffic/input/final_traffic_weather.csv final_traffic_weather.csvpip install notebook
jupyter notebookCopy URL shown in terminal and open in browser.
pip install pyspark ipykernel
python -m ipykernel install --user --name=spark-venv --display-name "Spark (PySpark)"spark-submit engineer_balance_export.py
spark-submit clean_nulls_from_csv.py
spark-submit combine.pyhdfs dfs -put combined.csv /user/hdoop/toronto_traffic/inputspark-submit predict_final_pipeline.pyhdfs dfs -get /user/hdoop/toronto_traffic/output/final_predictions_csv
hdfs dfs -get /user/hdoop/toronto_traffic/output/final_rf_modelOpen predict_final_pipeline_analysis.ipynb in Folder "Iteration3"
Classification Report:
- Accuracy: 0.6420
- F1 Score: 0.6373
- Precision: 0.6494
- Recall: 0.6420
Confusion Matrix:
[[11488. 10241.]
[ 5344. 16459.]]
pip install notebook pyspark ipykernel
python -m ipykernel install --user --name=spark-venv --display-name "Spark (PySpark)"
jupyter notebookOpen and explore results using the notebook: predict_final_pipeline_analysis.ipynb
| File | Format | Purpose |
|---|---|---|
| transformed_traffic_data/ | CSV | Long format traffic |
| raw_traffic.parquet | Parquet | Clean input |
| raw_weather.parquet | Parquet | Weather (3 years) |
| cleaned_traffic.parquet | Parquet | Filled & filtered |
| cleaned_weather.parquet | Parquet | Weather cleaned |
| final_traffic_weather.parquet | Parquet | Merged |
| final_traffic_weather.csv | CSV | Easy access |
| combined.csv | CSV | Cleaned, engineered |
| final_predictions_csv/ | CSV | Model predictions |
| final_rf_model/ | Binary | Trained model |
- Keep environment variables consistent (
JAVA_HOME,PATH, etc.) - Avoid creating multiple Spark sessions
- Always deactivate and reactivate your virtual environment if unexpected errors occur