ai-safety

Star

Here are 2,881 public repositories matching this topic...

jphall663 / awesome-machine-learning-interpretability

Star

A curated list of awesome responsible machine learning resources.

Updated Mar 16, 2026

PKU-Alignment / safe-rlhf

Star

Safe RLHF: Constrained Value Alignment via Safe Reinforcement Learning from Human Feedback

Updated Nov 24, 2025
Python

microsoft / agent-governance-toolkit

Star

AI Agent Governance Toolkit — Policy enforcement, zero-trust identity, execution sandboxing, and reliability engineering for autonomous AI agents. Covers 10/10 OWASP Agentic Top 10.

microsoft python security owasp trust compliance governance ai-safety policy-engine ai-agents zero-trust agent-framework

Updated May 14, 2026
Python

OpenLMLab / MOSS-RLHF

Star

Secrets of RLHF in Large Language Models Part I: PPO

alignment ai-safety rlhf

Updated Mar 3, 2024
Python

cvs-health / uqlm

Star

UQLM: Uncertainty Quantification for Language Models, is a Python package for UQ-based LLM hallucination detection

uncertainty-quantification uncertainty-estimation ai-safety confidence-score hallucination confidence-estimation ai-evaluation llm llm-evaluation llm-safety hallucination-evaluation hallucination-detection hallucination-mitigation llm-hallucination

Updated May 11, 2026
Python

wuyoscar / ISC-Bench

Star

Internal Safety Collapse: Turning the LLM or an AI Agent into a sensitive data generator.

benchmark jailbreak ai-safety red-teaming large-language-models llm-safety safety-evaluation agent-safety

Updated May 8, 2026
Python

chrisliu298 / awesome-llm-unlearning

Star

A resource repository for machine unlearning in large language models

Updated May 12, 2026

PacificAI / langtest

Star

Deliver safe & effective language models

nlp artificial-intelligence benchmarks benchmark-framework model-assessment ai-safety mlops responsible-ai ml-safety trustworthy-ai ethics-in-ai ml-testing large-language-models llm ai-testing llm-test llm-evaluation-toolkit llm-as-evaluator llm-testing

Updated Apr 22, 2026
Python

agencyenterprise / PromptInject

Star

PromptInject is a framework that assembles prompts in a modular fashion to provide a quantitative analysis of the robustness of LLMs to adversarial prompt attacks. 🏆 Best Paper Awards @ NeurIPS ML Safety Workshop 2022

machine-learning agi language-models ai-safety adversarial-attacks ai-alignment ml-safety gpt-3 large-language-models prompt-engineering chain-of-thought agi-alignment

Updated Apr 27, 2026
Python

cordum-io / cordum

Star

The open agent control plane. Govern autonomous AI agents with pre-execution policy enforcement, approval gates, and audit trails. Works with LangChain, CrewAI, MCP, and any framework.

Updated May 13, 2026
Go

pegasi-ai / reins

Star

Stop AI agents from doing things you didn't ask for.

mcp intervention browser-automation ai-safety cua human-in-the-loop audit-trail ai-monitoring agent-security agent-observability claude-code-plugin claude-code-skill claude-code-marketplace openclaw-security

Updated May 13, 2026
Python

ifixai-ai / iFixAi

Star

The open-source diagnostic for AI misalignment. 32 tests across fabrication, manipulation, deception, unpredictability, and opacity. Provider-agnostic. Runs against OpenAI, Anthropic, Bedrock, Azure, Gemini, and more. Letter grade in under 5 minutes, content-addressed manifest for bit-identical replay. Built by iMe.

Updated May 14, 2026
Python

tigerlab-ai / tiger

Star

Open Source LLM toolkit to build trustworthy LLM applications. TigerArmor (AI safety), TigerRAG (embedding, RAG), TigerTune (fine-tuning)

classification data-augmentation ai-safety fine-tuning aisafety rag large-language-models llm llm-training

Updated Dec 2, 2023
Jupyter Notebook

Justin0504 / Aegis

Star

Runtime policy enforcement for AI agents. Cryptographic audit trail, human-in-the-loop approvals, kill switch. Zero code changes.

mcp ai-safety policy-engine ai-agents audit-trail langchain anthropic llm-observability

Updated Apr 18, 2026
TypeScript

aisa-group / PostTrainBench

Star

Measuring how well CLI agents like Claude Code or Codex CLI can post-train base LLMs on a single H100 GPU in 10 hours

ai-safety post-training gemini-cli claude-code codex-cli ai-research-automation

Updated May 7, 2026
Python

hendrycks / ethics

Star

Aligning AI With Shared Human Values (ICLR 2021)

ai-safety machine-ethics ml-safety ethical-ai gpt-3

Updated Apr 21, 2023
Python

Govcraft / rust-docs-mcp-server

Sponsor

Star

🦀 Prevents outdated Rust code suggestions from AI assistants. This MCP server fetches current crate docs, uses embeddings/LLMs, and provides accurate context via a tool call.