hadoop
Here are 535 public repositories matching this topic...
🔍Model Context Protocol (MCP) server for Apache Ambari API integration. This project provides tools for managing Hadoop clusters, including service operations, configuration management, status monitoring, and request tracking.
-
Updated
Sep 9, 2025 - Python
This project implements an end-to-end techstack for a data platform, for local development.
-
Updated
Sep 8, 2025 - Python
📈 A scalable, production-ready data pipeline for real-time streaming & batch processing, integrating Kafka, Spark, Airflow, AWS, Kubernetes, and MLflow. Supports end-to-end data ingestion, transformation, storage, monitoring, and AI/ML serving with CI/CD automation using Terraform & GitHub Actions.
-
Updated
Sep 9, 2025 - Python
A big data analytics project that integrates sales data from Flipkart, Amazon, and Meesho into a unified pipeline. Data is processed with Apache Spark, stored in MySQL, and visualized using Power BI/Tableau to uncover trends, top-selling products, and customer purchase patterns. Designed to support data-driven decision-making in e-commerce.
-
Updated
Aug 22, 2025 - Python
A PySpark-based solution for cleaning and interpolating battery sensor data using forward/backward fill and Radial Basis Function (RBF) spatial interpolation. Outputs a clean, fully interpolated dataset in CSV format for advanced analysis.
-
Updated
Aug 16, 2025 - Python
This repository contains an Airflow DAG that orchestrates an incremental data pipeline using PySpark scripts. The pipeline automates daily processing data, syncs results to S3, performs housekeeping, and loops until a target date threshold is reached.
-
Updated
Aug 16, 2025 - Python
A big data analytics project that integrates sales data from Flipkart, Amazon, and Meesho into a unified pipeline. Data is processed with Apache Spark, stored in MySQL, and visualized using Power BI/Tableau to uncover trends, top-selling products, and customer purchase patterns. Designed to support data-driven decision-making in e-commerce.
-
Updated
Aug 15, 2025 - Python
Support for generating modern platforms dynamically with services such as Kafka, Spark, Streamsets, HDFS, ....
-
Updated
Sep 9, 2025 - Python
Educational repository for learning NoSQL databases, distributed computing, and Big Data technologies through practical exercises with MongoDB, Hadoop, and HBase
-
Updated
Aug 5, 2025 - Python
This toolkit is designed to simulate and manage airport parking events. It provides a command-line interface (CLI) for managing vehicles, zones, and parking events. It includes full integration with PostgreSQL for data storage, SQL for advanced queries, and Apache Spark for big data batch processing of parquet logs.
-
Updated
Jul 19, 2025 - Python
A PySpark-based pipeline for detecting anomalies in energy consumption using unsupervised models (PCA, Isolation Forest, LOF). The system processes raw JSON data, aggregates monthly features, and identifies anomalous PODIDs using an ensemble approach, ready for production deployment.
-
Updated
Jul 15, 2025 - Python
A container for a project based on analyzing data from the reddit API.
-
Updated
Jul 10, 2025 - Python
Common Tools Installation Files in Data Analysis, Machine Learning, and Deep Learning
-
Updated
Jul 6, 2025 - Python
Improve this page
Add a description, image, and links to the hadoop topic page so that developers can more easily learn about it.
Add this topic to your repo
To associate your repository with the hadoop topic, visit your repo's landing page and select "manage topics."