Install Spark, Kafka, Cassandra, Zookeeper
-
Updated
Feb 20, 2017 - Python
Apache Spark is an open source distributed general-purpose cluster-computing framework. It provides an interface for programming entire clusters with implicit data parallelism and fault tolerance.
Install Spark, Kafka, Cassandra, Zookeeper
Analysis of weather data records from 1985-01-01 to 2014-12-31 for weather stations in Nebraska, Iowa, Illinois, Indiana, or Ohio.
This project links together a MongoDB cluster and a Kafka cluster with a Standalone Pyspark cluster all done locally
You can do a lot of things with Apache Spark. What I've done here is to work with a static file and create a Batch ETL system.
Simulates a real-time Smart City data pipeline with Kafka, Apache Spark, and S3. Streams and processes vehicle, GPS, weather, traffic, and emergency data with Dockerized components and Parquet storage for efficient, scalable data engineering
A coursework-style project from my Master's studies in Machine Learning on Big Data (University of East London), implementing distributed word embeddings and K-Means topic clustering on a large-scale news dataset using PySpark, and extending the trained models to a real-time Structured Streaming pipeline.
Perform sentiment analysis on Yelp dataset with Apache Spark
An End-to-End Real-time Data Pipeline using Debezium (CDC) to stream changes from PostgreSQL to Kafka, processed by Apache Spark (Structured Streaming), and sunk into ClickHouse for analytics. Orchestrated by Airflow and fully containerized with Docker Compose.
This project automates the extraction of university course details (e.g., schedules, professors, course codes) from text files using Regex pattern and SpaCy NLP Model and , processes them using PySpark, and loads the structured data into Snowflake for easy querying. The entire pipeline is containerized with Docker
Real-time streaming data analysis pipeline with integrating apache spark's streaming library to read records from kafka topic
A forecasting project based on Apache-Spark and implemented with Naive Bayes theorem.
Scalable Book Recommender System - Apache Spark ML Lib
Pinterest's experiment analytics data pipeline which runs thousands of experiments per day and crunches billions of datapoints to provide valuable insights to improve the product.
ML model deployment app I contributed to via MLH Fellowship
A scalable marketing analytics pipeline built with Apache Spark and Delta Lake, designed to process, transform, and export data for advanced business insights.
HW for the Big Data Computation course. Use Apache Spark and MapReduce algorithm to extract information from a dataset.
Created by Matei Zaharia
Released May 26, 2014