NeMo Guardrails is an open-source toolkit for easily adding programmable guardrails to LLM-based conversational systems.
-
Updated
Nov 10, 2025 - Python
NeMo Guardrails is an open-source toolkit for easily adding programmable guardrails to LLM-based conversational systems.
List of resources about programming practices for writing safety-critical software.
Safe RLHF: Constrained Value Alignment via Safe Reinforcement Learning from Human Feedback
Open-source vulnerability disclosure and bug bounty program database
PyBullet CartPole and Quadrotor environments—with CasADi symbolic a priori dynamics—for learning-based control and RL
面向中文大模型价值观的评估与对齐研究
Deploy once. Continuously improve your AI agents in production.
Decrypted Generative Model safety files for Apple Intelligence containing filters
A collaborative collection of open-source safe GPT-3 prompts that work well
Official datasets and pytorch implementation repository of SQuARe and KoSBi (ACL 2023)
Safe reinforcement learning with stability guarantees
A toolbox for benchmarking trustworthiness of multimodal large language models (MultiTrust, NeurIPS 2024 Track Datasets and Benchmarks)
[arXiv:2311.03191] "DeepInception: Hypnotize Large Language Model to Be Jailbreaker"
Safe Bayesian Optimization
Code for ACL 2024 paper "TruthX: Alleviating Hallucinations by Editing Large Language Models in Truthful Space"
How good are LLMs at chemistry?
[ICLR 2025] Dissecting adversarial robustness of multimodal language model agents
Add a description, image, and links to the safety topic page so that developers can more easily learn about it.
To associate your repository with the safety topic, visit your repo's landing page and select "manage topics."