ToolSafe: Enhancing Tool Invocation Safety of LLM-based agents via Proactive Step-level Guardrail and Feedback

📖 Introduction

ToolSafe is a framework for enhancing tool invocation safety in LLM-based agents through step-level guardrails, proactive monitoring, and feedback-driven reasoning. It monitors tool usage in real time and prevents unsafe actions before execution, ensuring safer and more reliable agent behavior.

Key Features

TS-Bench – A benchmark suite for step-level tool invocation safety detection in LLM agents.
TS-Guard – Step-level safety guardrail that reasons over interaction history to detect harmful tool invocations, assess action–attack correlations, and provide interpretable safety judgments.
TS-Flow – Feedback-driven reasoning framework that reduces harmful tool executions while improving benign task performance under prompt injection attacks.

ToolSafe enables developers to deploy LLM agents with proactive safety monitoring, trustworthy tool-use reasoning, and robust security guarantees.

🔥 News

[2026-01-15] 🚀 The official code and dataset for ToolSafe are released!

📂 Repository Structure

.
├── TS-Bench/            # Benchmark datasets for guardrail model evaluation
├── benchmark/           # Evaluation benchmark of agent safety&security
├── scripts/             # Shell scripts for training/inference
├── src/                 # Source code for the agent framework
├── utils/               # Utility functions
├── pyproject.toml       # Python project dependencies
└── README.md

🛠️ Installation

Prerequisites

Python >= 3.10
PyTorch (Please refer to PyTorch.org for your specific CUDA version)

Setup

This project uses pyproject.toml for dependency management.
Evaluation environment is built on top of the ASB project.
Training environment is based on the verl project.

🚀 Usage

1. Guardrail Model Training

cd ./TS-Guard/verl-main
bash examples/grpo_trainer/run_TSGuard_train.sh

2. Guardrail Model Evaluation

Run the guardrail evaluation with the following commands:

python src/guardian_experiment.py --config ./src/config_guardrail_eval/agentharm_traj.yaml

python src/guardian_experiment.py --config ./src/config_guardrail_eval/asb_traj.yaml

python src/guardian_experiment.py --config ./src/config_guardrail_eval/agentdojo_traj.yaml

You can modify the evaluation settings in ./src/config_guardrail_eval/, including:

Dataset paths and locations
Model configuration
Other experiment-specific parameters

3. Agent Safety&Security Evaluation

(We will release the code for agent safety evaluation as soon as possible)

Run the agent safety and security evaluation with the following commands:

python src/main_experiment.py --config ./src/config/agentharm.yaml

python src/main_experiment.py --config ./src/config/asb.yaml

python src/main_experiment.py --config ./src/config/agentdojo.yaml

You can modify the YAML files in ./src/config/ to adjust:

Model and agent settings
Guard and judge configurations
Task, environment, and output paths

📚 Citation

If you find our work helpful, please consider citing it. We greatly appreciate your support.

@article{mou2026toolsafe,
  title={ToolSafe: Enhancing Tool Invocation Safety of LLM-based agents via Proactive Step-level Guardrail and Feedback},
  author={Mou, Yutao and Xue, Zhangchi and Li, Lijun and Liu, Peiyang and Zhang, Shikun and Ye, Wei and Shao, Jing},
  journal={arXiv preprint arXiv:2601.10156},
  year={2026}
}

📞 Contact

For any questions or feedback, please reach out to us at yutao.mou@stu.pku.edu.cn.

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
TS-Bench		TS-Bench
__pycache__		__pycache__
assets		assets
benchmark		benchmark
guardian_test_logs		guardian_test_logs
scripts		scripts
src		src
utils		utils
.gitattributes		.gitattributes
README.md		README.md
pyproject.toml		pyproject.toml
readme.md		readme.md
submit_task.sh		submit_task.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ToolSafe: Enhancing Tool Invocation Safety of LLM-based agents via Proactive Step-level Guardrail and Feedback

📖 Introduction

Key Features

🔥 News

📂 Repository Structure

🛠️ Installation

Prerequisites

Setup

🚀 Usage

1. Guardrail Model Training

2. Guardrail Model Evaluation

3. Agent Safety&Security Evaluation

📚 Citation

📞 Contact

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ToolSafe: Enhancing Tool Invocation Safety of LLM-based agents via Proactive Step-level Guardrail and Feedback

📖 Introduction

Key Features

🔥 News

📂 Repository Structure

🛠️ Installation

Prerequisites

Setup

🚀 Usage

1. Guardrail Model Training

2. Guardrail Model Evaluation

3. Agent Safety&Security Evaluation

📚 Citation

📞 Contact

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages