NeMo Guardrails is an open-source toolkit for easily adding programmable guardrails to LLM-based conversational systems.
-
Updated
Apr 16, 2026 - Python
NeMo Guardrails is an open-source toolkit for easily adding programmable guardrails to LLM-based conversational systems.
Safe RLHF: Constrained Value Alignment via Safe Reinforcement Learning from Human Feedback
List of resources about programming practices for writing safety-critical software.
Open-source vulnerability disclosure and bug bounty program database
[EMNLP 2024 Demo] MarkLLM: An Open-Source Toolkit for LLM Watermarking
PyBullet CartPole and Quadrotor environments—with CasADi symbolic a priori dynamics—for learning-based control and RL
面向中文大模型价值观的评估与对齐研究
Decrypted Generative Model safety files for Apple Intelligence containing filters
A collaborative collection of open-source safe GPT-3 prompts that work well
Official datasets and pytorch implementation repository of SQuARe and KoSBi (ACL 2023)
Safe reinforcement learning with stability guarantees
AutoHarness: Automated Harness Engineering for AI Agents
Official Implementation of the CKA-Agent, "The Trojan Knowledge: Bypassing Commercial LLM Guardrails via Harmless Prompt Weaving and Adaptive Tree Search".
[arXiv:2311.03191] "DeepInception: Hypnotize Large Language Model to Be Jailbreaker"
A toolbox for benchmarking trustworthiness of multimodal large language models (MultiTrust, NeurIPS 2024 Track Datasets and Benchmarks)
Safe Bayesian Optimization
Code for ACL 2024 paper "TruthX: Alleviating Hallucinations by Editing Large Language Models in Truthful Space"
How good are LLMs at chemistry?
Add a description, image, and links to the safety topic page so that developers can more easily learn about it.
To associate your repository with the safety topic, visit your repo's landing page and select "manage topics."