Full local MLOps stack with Airflow, Spark, ClearML, MinIO, and Jupyter — all for local experiments.
-
Updated
Nov 8, 2025 - Shell
Apache Spark is an open source distributed general-purpose cluster-computing framework. It provides an interface for programming entire clusters with implicit data parallelism and fault tolerance.
Full local MLOps stack with Airflow, Spark, ClearML, MinIO, and Jupyter — all for local experiments.
📘 FIWARE 306: Real-time Processing of Context Data using Apache Spark
This repository showcases how to setup a Scala Spark job on Docker and in Dataproc and execute a Broadcast Join technique.
A curated list of awesome Apache Spark packages and resources.
📚🌊🎓 A third-year student is self-studying Spark and Kafka as part of their 👷 data engineering journey, with the goal of securing an 📬 internship or fresher job in 2024.
The implementation of Apache Spark (combine with PySpark, Jupyter Notebook) on top of Hadoop cluster using Docker
Ansible roles to install an Spark Standalone cluster (HDFS/Spark/Jupyter Notebook) or Ambari based Spark cluster
A rudimentary command line utility for contrasting Apache Spark event logs.
Apache Spark docker image
Host files and procedure for running Fink on Kubernetes
Driver/Executor images for spark-operator
Deploy apache spark in client mode on Kubernetes cluster, integrate with Jupyter notebook through Jupyterhub server.
This project builds a data pipeline to populate the user_behavior_metric table. The user_behavior_metric table is an OLAP table, meant to be used by analysts, dashboarding.
A .NET for Apache Spark docker image (3rdman/dotnet-spark)
[PROJECT IS NO LONGER MAINTAINED] Wirbelsturm is a Vagrant and Puppet based tool to perform 1-click local and remote deployments, with a focus on big data tech like Kafka.
CentOS based container with a standalone SPARK installation to work with larger-than-RAM datasets.
🤠
Created by Matei Zaharia
Released May 26, 2014