apache-spark
Apache Spark is an open source distributed general-purpose cluster-computing framework. It provides an interface for programming entire clusters with implicit data parallelism and fault tolerance.
Here are 554 public repositories matching this topic...
The open source developer platform to build AI agents and models with confidence. Enhance your AI applications with end-to-end tracking, observability, and evaluations, all in one integrated platform.
-
Updated
Dec 16, 2025 - Python
Este repositorio contiene el material del curso de Udemy Big Data y Spark: ingeniería de datos con Python y pyspark. En este curso, aprenderás a utilizar las herramientas y técnicas necesarias para trabajar con grandes conjuntos de datos utilizando la librería pyspark.
-
Updated
Dec 16, 2025 - Python
A big data analysis of Los Angeles crime and demographic data using Apache Spark on Kubernetes.
-
Updated
Dec 16, 2025 - Python
This project is a centralized,fault-tolerant Data Engineering Pipeline designed to ingest,process & visualize user data generated in real-time.It leverages Apache Airflow for orchestration, Kafka for message buffering, Spark Structured Streaming for high-speed processing & Cassandra for storage.The final output is visualized at Streamlit Dashboard.
-
Updated
Dec 16, 2025 - Python
-
Updated
Dec 16, 2025 - Python
-
Updated
Dec 14, 2025 - Python
Astronomy Broker based on Apache Spark
-
Updated
Dec 14, 2025 - Python
Spark Structured Streaming data pipeline that processes movie ratings data in real-time.
-
Updated
Dec 13, 2025 - Python
Python package for working with demand-side grid projects, datasets and queries
-
Updated
Dec 13, 2025 - Python
This is my intro Repository that gives details about me and the work i do
-
Updated
Dec 12, 2025 - Python
MCP Server for Apache Spark History Server. The bridge between Agentic AI and Apache Spark.
-
Updated
Dec 10, 2025 - Python
Creating a Real-Time Flight-info Data Pipeline with Kafka, Apache Spark, Elasticsearch and Kibana
-
Updated
Dec 9, 2025 - Python
Flexible and scalable framework for data input and output operations in Spark applications. It offers a set of powerful tools and abstractions to simplify and streamline data processing pipelines.
-
Updated
Dec 15, 2025 - Python
A scalable marketing analytics pipeline built with Apache Spark and Delta Lake, designed to process, transform, and export data for advanced business insights.
-
Updated
Dec 6, 2025 - Python
Production-grade Basel III RWA calculation pipeline processing 120M+ records/day with Spark, Airflow, and AWS
-
Updated
Dec 6, 2025 - Python
Dataproc templates and pipelines for solving in-cloud data tasks
-
Updated
Dec 16, 2025 - Python
Production-grade CDC system processing 50M+ records/day with intelligent hybrid sync strategies
-
Updated
Dec 5, 2025 - Python
Real-time AML transaction monitoring system processing 5M transactions/day with <3s latency. Built with Kafka, Spark Streaming, Delta Lake, AWS Glue, and Airflow. Features: rule-based detection, SCD Type 2 customer profiling, automated compliance reporting, and comprehensive data quality framework.
-
Updated
Dec 4, 2025 - Python
Created by Matei Zaharia
Released May 26, 2014
- Followers
- 435 followers
- Repository
- apache/spark
- Website
- github.com/topics/spark
- Wikipedia
- Wikipedia