Skip to content
#

apache-spark

spark logo

Apache Spark is an open source distributed general-purpose cluster-computing framework. It provides an interface for programming entire clusters with implicit data parallelism and fault tolerance.

Here are 554 public repositories matching this topic...

A coursework-style project from my Master's studies in Machine Learning on Big Data (University of East London), implementing distributed word embeddings and K-Means topic clustering on a large-scale news dataset using PySpark, and extending the trained models to a real-time Structured Streaming pipeline.

  • Updated Nov 19, 2025
  • Python

This project automates the extraction of university course details (e.g., schedules, professors, course codes) from text files using Regex pattern and SpaCy NLP Model and , processes them using PySpark, and loads the structured data into Snowflake for easy querying. The entire pipeline is containerized with Docker

  • Updated Sep 30, 2024
  • Python

Created by Matei Zaharia

Released May 26, 2014

Followers
435 followers
Repository
apache/spark
Website
github.com/topics/spark
Wikipedia
Wikipedia

Related topics

hadoop scala