📚 Explore a curated library for mastering Machine Learning, Deep Learning, and AI through free resources, courses, and tools for all levels.
-
Updated
Dec 18, 2025
📚 Explore a curated library for mastering Machine Learning, Deep Learning, and AI through free resources, courses, and tools for all levels.
📚 Explore a growing library of free resources for learning Machine Learning, Deep Learning, and AI, covering essential concepts to advanced techniques.
Refine high-quality datasets and visual AI models
Cleanlab's open-source library is the standard data-centric AI package for data quality and machine learning with messy, real-world data and labels.
Interactively explore unstructured datasets from your dataframe.
Automatically find issues in image datasets and practice data-centric computer vision.
Notebooks demonstrating example applications of the cleanlab library
This data-centric AI repository implements a robust deep learning method (LFBNet) for fully automated tumor segmentation in whole-body [18]F-FDG PET/CT images.
[ML4H 2023] 🧼🔎 SelfClean revised versions of benchmark datasets for more reliable performance estimation.
robust-nli-analysis robust-NLP-analysis
3LC Integration with Ultralytics YOLO
Unboxing the Geometry of Knowledge Graphs: Analyzing training dynamics and solving topological traps via a Data-Centric approach
What's In My Human Feedback? Explaining preferences in human feedback using interpretability + LLMs. https://arxiv.org/abs/2510.26202
A training data pipeline + intelligent flywheel system designed specifically for AI data engineers.
This project focuses on the data side of MLOps — building a simple, reliable pipeline around the NYC Green Taxi dataset. It covers data ingestion, validation, and versioning, with automation through FastAPI, Docker, and GitHub Actions. Learning how to make data workflows cleaner, reproducible, and easier to extend toward full ML pipelines.
Papers about training data quality management for ML models.
[NeurIPS 2024] 🧼🔎 A holistic self-supervised data cleaning strategy to detect irrelevant samples, near duplicates and label errors.
Official PyTorch implementation of the paper "Dataset Distillation with Neural Characteristic Function: A Minmax Perspective" (NCFM) in CVPR 2025 (Full Score, Highlight).
Adversarial Machine Learning Applied to Missing Data Imputation
pyDVL is a library of stable implementations of algorithms for data valuation and influence function computation
Add a description, image, and links to the data-centric-ai topic page so that developers can more easily learn about it.
To associate your repository with the data-centric-ai topic, visit your repo's landing page and select "manage topics."