Highlights
Lists (7)
Sort Name ascending (A-Z)
Starred repositories
A booklet on machine learning systems design with exercises. NOT the repo for the book "Designing Machine Learning Systems", which is `dmls-book`
https://huyenchip.com/ml-interviews-book/
A short 6-step curriculum I built to teach myself & others the basics of mech interp
Streamlit web app for the Inspect Evals dashboard
🪨 why use many token when few token do trick — Claude Code skill that cuts 65% of tokens by talking like caveman
Holistic Evaluation of Language Models (HELM) is an open source Python framework created by the Center for Research on Foundation Models (CRFM) at Stanford for holistic, reproducible and transparen…
An extremely fast Python package and project manager, written in Rust.
ControlArena is a collection of settings, model organisms and protocols - for running control experiments.
Moral Operational Reasoning Assessment for Language Systems
HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal
open Source code for propensity evaluation
WMDP is a LLM proxy benchmark for hazardous knowledge in bio, cyber, and chemical security. We also release code for RMU, an unlearning method which reduces LLM performance on WMDP while retaining …
This is the repository of HaluEval, a large-scale hallucination evaluation benchmark for Large Language Models.
AssetOpsBench - Industry 4.0: A unified benchmark and framework for building, orchestrating, and evaluating domain-specific AI agents for Industry 4.0 asset operations and maintenance, with 460+ sc…
A Comprehensive Benchmark to Evaluate LLMs as Agents (ICLR'24)
Hugo Noir is a clean, minimalistic theme for Hugo with a focus on readability and simplicity.
Evals is a framework for evaluating LLMs and LLM systems, and an open-source registry of benchmarks.
All Stanford Cheatsheets: Artificial Intelligence, Transformers, LLMs, Deep Learning, Machine Learning, Probabilities, Statistics, Algebra and Calculus.
GraFiTe is a platform to track and manage domain-specific model issues for continuous LLM evaluation.
Get your documents ready for gen AI
Mellea is a library for writing generative programs.
Supercharge Your LLM Application Evaluations 🚀
🪢 Open source AI engineering platform: LLM evals, observability, metrics, prompt management, playground, datasets. Integrates with OpenTelemetry, LangChain, OpenAI SDK, LiteLLM, and more. 🍊YC W23
Dark Flavored - Academic Project Website Template