CentOS based container with a standalone SPARK installation to work with larger-than-RAM datasets.
-
Updated
Dec 27, 2021 - Shell
Apache Spark is an open source distributed general-purpose cluster-computing framework. It provides an interface for programming entire clusters with implicit data parallelism and fault tolerance.
CentOS based container with a standalone SPARK installation to work with larger-than-RAM datasets.
This repository contain simple Spark application for beginners
Ubuntu base image provisioned mainly with Docker and Java
Full local MLOps stack with Airflow, Spark, ClearML, MinIO, and Jupyter — all for local experiments.
Raspberrypi 4 based hadoop cluster with Spark
Apache Spark cluster in Docker - https://hub.docker.com/r/giabar/gb-spark/
This repository showcases how to setup a Scala Spark job on Docker and in Dataproc and execute a Broadcast Join technique.
GCP Dataproc mapreduce sample with PySpark
First basic Big Data approach
Script and tools to build with Apache Bigtop
Exploring details of Motor Vehicle Collisions in New York City provided by the Police Department (NYPD).
Workshop Material for Near RealTime Predictive Analytics with Apache Spark Structured Streaming Workshop at the Open Data Science Conference WEST 2019
Examples of using Apache Spark MLlib Pipelines and Structured Streaming on version 2.4.0
This project builds a data pipeline to populate the user_behavior_metric table. The user_behavior_metric table is an OLAP table, meant to be used by analysts, dashboarding.
🤠
Edge2AI Workshop
Created by Matei Zaharia
Released May 26, 2014