Stars
Code for the paper "Searching Privacy Risks in Multi-Agent Systems via Simulation"
Hangzhi / QuixBugs-bench
Forked from jkoppel/QuixBugsA multi-lingual program repair benchmark set based on the Quixey Challenge
Collection of scripts and notebooks for OpenAI's latest GPT OSS models
Minimal and annotated implementations of key ideas from modern deep learning research.
ChatGPT Timestamp Chrome Extension
The Automated LLM Speedrunning Benchmark measures how well LLM agents can reproduce previous innovations and discover new ones in language modeling.
ScienceMeter: Tracking Scientific Knowledge Updates in Language Models
Single File, Single GPU, From Scratch, Efficient, Full Parameter Tuning library for "RL for LLMs"
Official repo for Learning to Reason for Long-Form Story Generation
[NeurIPS'25] Official codebase for "SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution"
Fully open reproduction of DeepSeek-R1
Minimal reproduction of DeepSeek R1-Zero
KernelBench: Can LLMs Write GPU Kernels? - Benchmark + Toolkit with Torch -> CUDA (+ more DSLs)
This repository contains ScholarQABench data and evaluation pipeline.
This repository contains expert evaluation interface and data evaluation script for the OpenScholar project.
Recipes to train reward model for RLHF.
Resources for cultural NLP research
This repository contains the dataset for "I Can’t Reply with That": Characterizing Problematic Email Reply Suggestions
Code to compute AnthroScore, a computational linguistic measure of anthropomorphism in text
Stanford NLP Python library for understanding and improving PyTorch models via interventions
GPT4 based personalized ArXiv paper assistant bot
Repository for materials of the HAI Diversity Paper