🚀 Migrate legacy mainframe data to a modern Hadoop ecosystem, automating ingestion, transformation, and validation for scalable storage and analytics.
-
Updated
Dec 13, 2025 - Python
🚀 Migrate legacy mainframe data to a modern Hadoop ecosystem, automating ingestion, transformation, and validation for scalable storage and analytics.
📈 A scalable, production-ready data pipeline for real-time streaming & batch processing, integrating Kafka, Spark, Airflow, AWS, Kubernetes, and MLflow. Supports end-to-end data ingestion, transformation, storage, monitoring, and AI/ML serving with CI/CD automation using Terraform & GitHub Actions.
An efficient quick-start tool to build a Raspberry Pi (or Debian-based) Cluster with popular ecosystem like Hadoop, Spark
webhdfsmagic is a Python package that provides IPython magic commands to interact with HDFS via WebHDFS/Knox.
A distributed Big Data analytics system using Hadoop MapReduce (Python) and a custom Tkinter GUI to process the MovieLens dataset (Z-Score, IQR, Skewness).
A practical coursework-style project from my Master's studies in Big Data Analytics (at University of East London), showcasing hands-on use of big data tools and techniques on a real-world cyber-security dataset.
The project aims to design and implement an advanced data pipeline for an e-commerce platform, addressing complexities such as schema evolution, incremental updates, and data transformation to support analytics and scalability requirements.
A cloud-ready smart city analytics dashboard built with Flask, Plotly, and Python. Features air quality, crime, and ICT data visualizations with a modular architecture for Big Data and cloud integration.
Large-Scale Data Pipeline Migration from Mainframe to Hadoop | Hadoop | Spark | Hive | Sqoop | Oozie | MySQL Migrated a legacy mainframe data warehouse to a modern Hadoop-based big data ecosystem, enabling scalable storage, faster analytics, and automated workflows.
450+ AWS, Hadoop, Cloud, Kafka, Docker, Elasticsearch, RabbitMQ, Redis, HBase, Solr, Cassandra, ZooKeeper, HDFS, Yarn, Hive, Presto, Drill, Impala, Consul, Spark, Jenkins, Travis CI, Git, MySQL, Linux, DNS, Whois, SSL Certs, Yum Security Updates, Kubernetes, Cloudera etc...
80+ DevOps & Data CLI Tools - AWS, GCP, GCF Python Cloud Functions, Log Anonymizer, Spark, Hadoop, HBase, Hive, Impala, Linux, Docker, Spark Data Converters & Validators (Avro/Parquet/JSON/CSV/INI/XML/YAML), Travis CI, AWS CloudFormation, Elasticsearch, Solr etc.
An open source experimental application aiming to simplify working with remote heterogeneous analytics and storage services via the file system of the Linux operating system. Published at EDBT/ICDT 2021.
🔍Model Context Protocol (MCP) server for Apache Ambari API integration. This project provides tools for managing Hadoop clusters, including service operations, configuration management, status monitoring, and request tracking.
This project implements a real-time credit card fraud detection system using big data technologies. It simulates a production-grade fraud detection pipeline where credit card transactions are streamed through Apache Kafka, classified in real-time using a trained Mahout Random Forest model, and stored in separate databases based on fraud predictions
Real-Time Grid Monitoring System is an end-to-end data pipeline and analytics platform that enables live monitoring, analysis, and visualization of electrical grid performance.
Add a description, image, and links to the hadoop topic page so that developers can more easily learn about it.
To associate your repository with the hadoop topic, visit your repo's landing page and select "manage topics."