Skip to content
#

apache-spark

spark logo

Apache Spark is an open source distributed general-purpose cluster-computing framework. It provides an interface for programming entire clusters with implicit data parallelism and fault tolerance.

Here are 554 public repositories matching this topic...

This project is a centralized,fault-tolerant Data Engineering Pipeline designed to ingest,process & visualize user data generated in real-time.It leverages Apache Airflow for orchestration, Kafka for message buffering, Spark Structured Streaming for high-speed processing & Cassandra for storage.The final output is visualized at Streamlit Dashboard.

  • Updated Dec 16, 2025
  • Python

Real-time AML transaction monitoring system processing 5M transactions/day with <3s latency. Built with Kafka, Spark Streaming, Delta Lake, AWS Glue, and Airflow. Features: rule-based detection, SCD Type 2 customer profiling, automated compliance reporting, and comprehensive data quality framework.

  • Updated Dec 4, 2025
  • Python

Created by Matei Zaharia

Released May 26, 2014

Followers
435 followers
Repository
apache/spark
Website
github.com/topics/spark
Wikipedia
Wikipedia

Related topics

hadoop scala