Big data training material
-
Updated
Jun 29, 2023 - Python
Big data training material
A distributed file system program that works like Hadoop with minor changes. A completely working program that incorporates asynchronous distribution of files and map and reduce components. It has its own command line interfaces with all the required commands.
An academic project as a part of course, "Principles of Big Data Management", to develop a system to store, process, analyse, and visualize Twitter’s data using Apache Spark
Trying best case apache spark working environment for robust data pipelines
Apache Spark - From installation to performing awesome operations in Apache Spark Stack
Data cleaning and profiling of NYC Open Data
This repository contains an Airflow DAG that orchestrates an incremental data pipeline using PySpark scripts. The pipeline automates daily processing data, syncs results to S3, performs housekeeping, and loops until a target date threshold is reached.
Real Time Streaming: Twitter Data Pipeline Using Big data Tools
A PySpark-based pipeline for detecting anomalies in energy consumption using unsupervised models (PCA, Isolation Forest, LOF). The system processes raw JSON data, aggregates monthly features, and identifies anomalous PODIDs using an ensemble approach, ready for production deployment.
Repository containing python code for MapReduce jobs to answer questions about Udacity forum data.
Advanced Topics in Database Systems @ NTUA | 2022- 2023
Add a description, image, and links to the hadoop topic page so that developers can more easily learn about it.
To associate your repository with the hadoop topic, visit your repo's landing page and select "manage topics."