Starred repositories
📡 Real-time data pipeline with Kafka, Flink, Iceberg, Trino, MinIO, and Superset. Ideal for learning data systems.
Public repository containing sample code for how to improve ETL ingestion processes with Apache Iceberg
The open and composable observability and data visualization platform. Visualize metrics, logs, and traces from multiple sources like Prometheus, Loki, Elasticsearch, InfluxDB, Postgres and many mo…
🚢 Docker images and configuration for Citus
System Design Studying can be daunting. This gives you a table to study different problems, understand what components they require, their pros and cons, and how to deal with mitigations.
Code for blog at: https://www.startdataengineering.com/post/docker-for-de/
Astro SDK allows rapid and clean development of {Extract, Load, Transform} workflows using Python and SQL, powered by Apache Airflow.
最全的大数据大厂面试宝典,大数据面试题,大数据面试,王傲旗的大数据之路,大数据成神之路,Flink/Spark/Hadoop/Hbase/Hive/Impala/Hbase/MapReduce/YARN/HDFS/Kafka/Flume/Linux/Java/Scala...面试题
21 Lessons, Get Started Building with Generative AI
Spark cluster in docker containers with sample training Jupyter notebooks
The python source code for my Raspberry Pi 4 e-reader.
Solana Arbitrage Bot on pump.fun, Meteora, Raydium and Orca using Jito bundling, RPC and gRPC. Solana Arbitrage Bot Solana Arbitrage Bot Solana Arbitrage Bot Solana Arbitrage Bot Solana Arbitrage B…
Scrapy environment with Tor for anonymous ip routing and Privoxy for http proxy
Scrapy middleware with TOR support for more robust scrapers or anonymous scraping.
webcrawler using a tor-proxy, elasticsearch and scrapy
Scrapy spider to recursively crawl for TOR hidden services
Airflow TimeTable for korean working days
This is a guide to PySpark code style presenting common situations and the associated best practices based on the most frequent recurring topics across the PySpark repos we've encountered.