hadoop

Star

Here are 547 public repositories matching this topic...

Raveesh1505 / BigData-Training

Star

Big data training material

big-data hadoop bigdata pig mapreduce pig-latin hadoop-mapreduce apache-pig mapreduce-java hadoop-hdfs mapreduce-python

Updated Jun 29, 2023
Python

jeshwanthreddy13 / Yet_Another_Hadoop

Star

A distributed file system program that works like Hadoop with minor changes. A completely working program that incorporates asynchronous distribution of files and map and reduce components. It has its own command line interfaces with all the required commands.

hadoop bigdata python3 asynchronous-programming mapreduce distributed-file-system

Updated Feb 9, 2023
Python

chandrasekhar-syamala / TwitterFeedAnalysis

Star

An academic project as a part of course, "Principles of Big Data Management", to develop a system to store, process, analyse, and visualize Twitter’s data using Apache Spark

flask machine-learning hadoop pyspark

Updated Jun 3, 2023
Python

furkancets / PrescreiberPipelineSpark

Star

Trying best case apache spark working environment for robust data pipelines

spark apache-spark hadoop pyspark

Updated Apr 1, 2023
Python

nanfengpo / cf-recommendations

Star

Collaborative Filtering based recommendation map reduce job using Yelp's mrjob

hadoop mrjob

Updated Feb 18, 2016
Python

MyUdacityProjects / HadoopMapreduce

Star

hadoop hadoop-mapreduce

Updated Apr 15, 2017
Python

udayshankar1306 / spark_way

Star

Apache Spark - From installation to performing awesome operations in Apache Spark Stack

emr spark apache-spark hadoop spark-fundamentals

Updated May 8, 2017
Python

Yiyun-Liang / Forum-Posts-Analysis

Star

MapReduce scripts for forum data analysis.

python hadoop mapreduce

Updated Feb 3, 2017
Python

anuragkh / hadoop-ec2

Star

Hadoop EC2 scripts

hive hadoop ec2

Updated Jul 23, 2020
Python

mattavallone / Big-Data-Project

Star

Data cleaning and profiling of NYC Open Data

big-data hadoop pyspark nyc-opendata

Updated Mar 26, 2020
Python

itsSwapnil / Pyspark_data_pipeline_with_Airflow_orchastration

Star

This repository contains an Airflow DAG that orchestrates an incremental data pipeline using PySpark scripts. The pipeline automates daily processing data, syncs results to S3, performs housekeeping, and loops until a target date threshold is reached.

elasticsearch airflow spark hadoop etl pyspark data-engineer

Updated Aug 16, 2025
Python

mariam222-cypro / Twitter_Data_Pipeline

Star

Real Time Streaming: Twitter Data Pipeline Using Big data Tools

big-data spark hive hadoop sentiment-analysis twitter-api pyspark spark-streaming hashtags hiveql

Updated May 8, 2023
Python

melvinjjoseph / Big-Data-UE21CS343AB2-Assignments

Star

Big Data Assignments

kafka spark hadoop hadoop-mapreduce

Updated Dec 6, 2023
Python

mmingalov / kc-hadoop

Star

spark hive hadoop hdfs

Updated Jul 20, 2022
Python

arashabe / ENEA-AnomalyDetectionPipeline

Star

A PySpark-based pipeline for detecting anomalies in energy consumption using unsupervised models (PCA, Isolation Forest, LOF). The system processes raw JSON data, aggregates monthly features, and identifies anomalous PODIDs using an ensemble approach, ready for production deployment.

machine-learning hadoop pipeline-framework pyspark energy-consumption anomaly-detection university-of-bergamo