Big data training material
-
Updated
Jun 29, 2023 - Python
Big data training material
A distributed file system program that works like Hadoop with minor changes. A completely working program that incorporates asynchronous distribution of files and map and reduce components. It has its own command line interfaces with all the required commands.
An academic project as a part of course, "Principles of Big Data Management", to develop a system to store, process, analyse, and visualize Twitter’s data using Apache Spark
Trying best case apache spark working environment for robust data pipelines
Real Time Streaming: Twitter Data Pipeline Using Big data Tools
A PySpark-based pipeline for detecting anomalies in energy consumption using unsupervised models (PCA, Isolation Forest, LOF). The system processes raw JSON data, aggregates monthly features, and identifies anomalous PODIDs using an ensemble approach, ready for production deployment.
Apache Spark - From installation to performing awesome operations in Apache Spark Stack
Data cleaning and profiling of NYC Open Data
This repository contains an Airflow DAG that orchestrates an incremental data pipeline using PySpark scripts. The pipeline automates daily processing data, syncs results to S3, performs housekeeping, and loops until a target date threshold is reached.
Big Data Programming Projects
Cloud Computing Projects
A PySpark-based solution for cleaning and interpolating battery sensor data using forward/backward fill and Radial Basis Function (RBF) spatial interpolation. Outputs a clean, fully interpolated dataset in CSV format for advanced analysis.
A big data analytics project that integrates sales data from Flipkart, Amazon, and Meesho into a unified pipeline. Data is processed with Apache Spark, stored in MySQL, and visualized using Power BI/Tableau to uncover trends, top-selling products, and customer purchase patterns. Designed to support data-driven decision-making in e-commerce.
Add a description, image, and links to the hadoop topic page so that developers can more easily learn about it.
To associate your repository with the hadoop topic, visit your repo's landing page and select "manage topics."