apache-spark
Apache Spark is an open source distributed general-purpose cluster-computing framework. It provides an interface for programming entire clusters with implicit data parallelism and fault tolerance.
Here are 2,126 public repositories matching this topic...
Install Spark, Kafka, Cassandra, Zookeeper
-
Updated
Feb 20, 2017 - Python
Implemented parallel and distributed algorithms using OpenMP, Apache Spark and NVIDIA CUDA
-
Updated
Apr 6, 2018 - C++
-
Updated
Aug 24, 2017 - HTML
-
Updated
Nov 10, 2018 - Scala
A content based movie recommendation system
-
Updated
Mar 27, 2019 - Jupyter Notebook
An engineering process for data science and big data processing
-
Updated
Dec 8, 2022 - Jupyter Notebook
Big data analysis of 'shared-world' cloud application.
-
Updated
Jul 8, 2020 - Jupyter Notebook
REPOSITORY FOR MY SOFTWARE DEVELOPMENT AND DATA SCIENCE PORTFOLIO.
-
Updated
Jul 25, 2025 - CSS
Analysis of weather data records from 1985-01-01 to 2014-12-31 for weather stations in Nebraska, Iowa, Illinois, Indiana, or Ohio.
-
Updated
Sep 15, 2023 - Python
Notebooks for Python and Spark for Big Data
-
Updated
Mar 14, 2023 - Jupyter Notebook
You can do a lot of things with Apache Spark. What I've done here is to work with a static file and create a Batch ETL system.
-
Updated
Aug 2, 2021 - Python
-
Updated
Jan 2, 2024
This project was completed as a part of the " Advanced Big Data" course at Nile University.
-
Updated
Jul 17, 2025 - Jupyter Notebook
A Twitter Stream Processing Pipeline with ingestion, processing, storage, and visualization.
-
Updated
Jan 4, 2025 - Scala
-
Updated
Mar 26, 2024 - Python
Desafio final desenvolvido para a Residência do Porto Digital na empresa A3Data onde tivemos a oportunidade de trabalhar com construção de DataLake e camadas (Bronze, Silver e Gold) para criação de dashboards e analises.
-
Updated
Sep 4, 2024 - Jupyter Notebook
A coursework-style project from my Master's studies in Machine Learning on Big Data (University of East London), implementing distributed word embeddings and K-Means topic clustering on a large-scale news dataset using PySpark, and extending the trained models to a real-time Structured Streaming pipeline.
-
Updated
Nov 19, 2025 - Python
Created by Matei Zaharia
Released May 26, 2014
- Followers
- 435 followers
- Repository
- apache/spark
- Website
- github.com/topics/spark
- Wikipedia
- Wikipedia