This is the official code for the paper "Booster: Tackling Harmful Fine-tuning for Large Language Models via Attenuating Harmful Perturbation" (ICLR2025 Oral).

Shell 36 1 Updated Mar 22, 2025

git-disl / Lisa

This is the official code for the paper "Lazy Safety Alignment for Large Language Models against Harmful Fine-tuning" (NeurIPS2024)

Python 25 Updated Sep 10, 2024

ContextualAI / HALOs

A library with extensible implementations of DPO, KTO, PPO, ORPO, and other human-aware loss functions (HALOs).

Python 905 49 Updated Sep 30, 2025

unitaryai / detoxify

Trained models & code to predict toxic comments on all 3 Jigsaw Toxic Comment Challenges. Built using ⚡ Pytorch Lightning and 🤗 Transformers. For access to our API, please email us at contact@unita…

Python 1,182 138 Updated Jan 5, 2026

openai / automated-interpretability

Python 1,069 125 Updated Mar 6, 2024

tomekkorbak / pretraining-with-human-feedback

Code accompanying the paper Pretraining Language Models with Human Preferences

Python 180 13 Updated Feb 13, 2024

cisco-open / modelsmith

A toolkit for optimizing machine learning models for practical applications

Python 31 4 Updated Mar 6, 2025

eric-mitchell / direct-preference-optimization

Reference implementation for DPO (Direct Preference Optimization)

Python 2,849 234 Updated Aug 11, 2024

hsahovic / reinforcement-learning-pokemon-bot

AI-powered pokemon bot on showdown

Python 13 1 Updated Oct 18, 2019

gregorbachmann / Next-Token-Failures

Python 108 11 Updated Mar 12, 2024

WindyLab / LLM-RL-Papers

Monitoring recent cross-research on LLM & RL on arXiv for control. If there are good papers, PRs are welcome.

543 35 Updated Nov 17, 2025

OpenGenerativeAI / llm-colosseum

Benchmark LLMs by fighting in Street Fighter 3! The new way to evaluate the quality of an LLM

Jupyter Notebook 1,463 177 Updated Mar 21, 2025

SwiftSage / SwiftSage

SwiftSage: A Generative Agent with Fast and Slow Thinking for Complex Interactive Tasks

Python 324 30 Updated Oct 22, 2024

choosewhatulike / trainable-agents

Code and datasets for "Character-LLM: A Trainable Agent for Role-Playing"

Python 610 46 Updated Oct 29, 2024

adewynter / Doom

Repository for the paper "Will GPT-4 Run DOOM?"

Python 24 4 Updated Nov 27, 2024

James4Ever0 / agi_computer_control

The first autonomous computer program that can do anything to earn money without human operators.

Python 151 16 Updated Nov 3, 2025

BAAI-Agents / Cradle

The Cradle framework is a first attempt at General Computer Control (GCC). Cradle supports agents to ace any computer task by enabling strong reasoning abilities, self-improvment, and skill curatio…

Python 2,444 253 Updated Nov 7, 2024

zwq2018 / Agent-Pro

The Code Repo for Agent-Pro: Learning to Evolve via Policy-Level Reflection and Optimization

Python 128 10 Updated Sep 2, 2024

LC1332 / Chat-Haruhi-Suzumiya

Chat凉宫春日, An open sourced Role-Playing chatbot Cheng Li, Ziang Leng, and others.

Jupyter Notebook 2,059 182 Updated Aug 13, 2024

noahshinn / reflexion

[NeurIPS 2023] Reflexion: Language Agents with Verbal Reinforcement Learning

Python 3,058 297 Updated Jan 14, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sihao Hu Bayi-Hu

Achievements

Achievements

Organizations

Block or report Bayi-Hu

Stars

Qiustander / BERT4ETH_Pytorch

JakobWong / quant-learning

chainstacklabs / pumpfun-bonkfun-bot

git-disl / Virus

Zanette-Labs / SpeculativeRejection

hukkai / adc_llm_attack

safer-ai / Exhaustive-CCS

leobeeson / llm_benchmarks

git-disl / llm-topla

git-disl / awesome_LLM-harmful-fine-tuning-papers

git-disl / Booster