Stars
A high-performance distributed file system designed to address the challenges of AI training and inference workloads.
Native SQL Engine plugin for Spark SQL with vectorized SIMD optimizations.
A curated list of Rust code and resources.
lakeFS - Data version control for your data lake | Git for data
An Open Standard for lineage metadata collection
Remote shuffle service for Apache Spark to store shuffle data on remote servers.
An open source indexing subsystem that brings index-based query acceleration to Apache Spark™ and big data workloads.
Morpheus brings the leading graph query language, Cypher, onto the leading distributed processing platform, Spark.
Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, DeepSeek, Mixtral, Gemma, Phi, MiniCPM, Qwen-VL, MiniCPM-V, etc.) on Intel XPU (e.g., local PC with iGPU and NPU, discr…
BigDL: Distributed TensorFlow, Keras and PyTorch on Apache Spark/Flink & Ray
Avro2TF is designed to fill the gap of making users' training data ready to be consumed by deep learning training frameworks.
Vowpal Wabbit is a machine learning system which pushes the frontier of machine learning with techniques such as online, hashing, allreduce, reductions, learning2search, active, and interactive lea…
Chisel: A Modern Hardware Design Language
Example 📓 Jupyter notebooks that demonstrate how to build, train, and deploy machine learning models using 🧠 Amazon SageMaker.
A curated list of automated machine learning papers, articles, tutorials, slides and projects
Apache Pinot - A realtime distributed OLAP datastore
An open source python library for automated feature engineering
A game theoretic approach to explain the output of any machine learning model.
The open source AI engineering platform for agents, LLMs, and ML models. MLflow enables teams of all sizes to debug, evaluate, monitor, and optimize production-quality AI applications while control…
Iceberg is a table format for large, slow-moving tabular data