Stars
The FATE (Formal Algebra Theorem Evaluation) benchmarks.
A benchmark for evaluating AI agents on frontier ultra long-horizon auto research tasks.
slime is an LLM post-training framework for RL Scaling.
CCXT for prediction markets. PMXT is a unified API for trading on Polymarket, Kalshi, and more.
A light-weight tool for evaluating LLMs in rule-based ways.
huggingface / yourbench
Forked from sumukshashidhar/yourbench🤗 Benchmark Large Language Models Reliably On Your Data
Use PEFT or Full-parameter to CPT/SFT/DPO/GRPO 600+ LLMs (Qwen3.6, DeepSeek-V4, GLM-5.1, InternLM3, Llama4, ...) and 300+ MLLMs (Qwen3-VL, Qwen3-Omni, InternVL3.5, Ovis2.5, GLM4.5v, Gemma4, Llava, …
Repository for "Scaling Evaluation-time Compute with Reasoning Models as Process Evaluators"
nanoGRPO is a lightweight implementation of Group Relative Policy Optimization (GRPO)
The most modern LLM evaluation toolkit
A hackable, simple, and reseach-friendly GRPO Training Framework with high speed weight synchronization in a multinode environment.
🌾 OAT: A research-friendly framework for LLM online alignment, including reinforcement learning, preference learning, etc.
Official implementation for "MM-Eval: A Multilingual Meta-Evaluation Benchmark for LLM-as-a-Judge and Reward Models"
ETL scripts for Bitcoin, Litecoin, Dash, Zcash, Doge, Bitcoin Cash. Available in Google BigQuery https://goo.gl/oY5BCQ
Performs benchmarking on two Korean datasets with minimal time and effort.
An Open Source Toolkit For LLM Distillation
Arrakis is a library to conduct, track and visualize mechanistic interpretability experiments.
DataInf: Efficiently Estimating Data Influence in LoRA-tuned LLMs and Diffusion Models (ICLR 2024)
Evaluate your LLM's response with Prometheus and GPT4 💯
Codebase for Merging Language Models (ICML 2024)
LLaVA-Plus: Large Language and Vision Assistants that Plug and Learn to Use Skills
Tools for merging pretrained large language models.
guijinSON / KoLLM-LogBook
Forked from teknium1/LLM-LogbookKorean Port for teknium1/LLM-Logbook