-
Updated
Aug 24, 2017 - HTML
apache-spark
Apache Spark is an open source distributed general-purpose cluster-computing framework. It provides an interface for programming entire clusters with implicit data parallelism and fault tolerance.
Here are 56 public repositories matching this topic...
This project offers a dual approach to understanding e-commerce customer behavior through: Batch data analysis and Real-time data processing.
-
Updated
Jan 26, 2024 - HTML
Have you ever tried to guess the genre of a book by reading its title? Well, in this project, I was trying to do it using a massive database of Books (their titles and genres), MLLib Spark, and the use of three different ML models, including: 1- Support Vector Machine (SVM) 2- Logistic Regression 3- Neural Networks
-
Updated
Sep 5, 2024 - HTML
Explanatory Data Analysis and ML model building using Apache Spark and PySpark
-
Updated
Oct 12, 2022 - HTML
Big Data Engineering studying using a dataset that includes all reported crimes from Chicago, IL from 2001 to present day.
-
Updated
Jun 7, 2020 - HTML
A PySpark Recommendation System for predicting Yelp ratings for 1.5M+ users and 200k+ businesses.
-
Updated
Feb 16, 2024 - HTML
“Word Count Analyzer” project using Apache Spark (PySpark)
-
Updated
Dec 13, 2025 - HTML
My solution to Introduction to Big Data with Apache Spark MOOC at Edx
-
Updated
Jan 10, 2018 - HTML
A dockerized application for performing sentiment analysis, tag recognition, and text summarization on YouTube videos.
-
Updated
Oct 1, 2024 - HTML
The Data-Engineering-Project-Taxi-data involves analyzing and processing taxi data to extract valuable insights for business optimization. It includes building ETL pipelines, data modeling, and real-time analytics.
-
Updated
Feb 12, 2025 - HTML
2019 Canadian Federal Election: Calculating the results using Apache Spark (Databricks notebook in Scala)
-
Updated
Aug 17, 2021 - HTML
Taxi versus Uber in NYC
-
Updated
Dec 4, 2018 - HTML
This respository contains projects made for the Large Scale Data Analysis course at the AGH UST in 2024.
-
Updated
Feb 20, 2025 - HTML
Jupyter notebook portant portant sur l'analyse des bases de données annuelles des accidents corporels de la circulation routière
-
Updated
Aug 9, 2023 - HTML
Managing distributed databases using Pyspark API of Apache Spark for customer data of an Ecommerce Website and predicting Yearly Amount spent using Linear Regression
-
Updated
Sep 3, 2022 - HTML
The "Olympic Games Analytics Using Apache Spark Databricks" project explores data from the Olympic Games (1896-2016) to identify trends and insights. Using Apache Spark for big data processing and Databricks for visualization, the project analyzes key factors like top-performing countries and athlete attributes, showcasing real-world analytics.
-
Updated
Apr 10, 2025 - HTML
Spark ML pipeline for Seattle housing price prediction with feature engineering and regression modeling.
-
Updated
Dec 9, 2025 - HTML
Distributed processing challenge
-
Updated
Feb 18, 2023 - HTML
A Scala-based contract management app using Apache Spark. It processes contract data in various formats (JSON, CSV, Parquet, ORC, XML) and performs tasks like cost calculation, status determination, and data extraction. This is an educational fork of an existing projet.
-
Updated
Nov 6, 2024 - HTML
Created by Matei Zaharia
Released May 26, 2014
- Followers
- 435 followers
- Repository
- apache/spark
- Website
- github.com/topics/spark
- Wikipedia
- Wikipedia