ToolSafe: Enhancing Tool Invocation Safety of LLM-based agents via Proactive Step-level Guardrail and Feedback
ToolSafe is a framework for enhancing tool invocation safety in LLM-based agents through step-level guardrails, proactive monitoring, and feedback-driven reasoning. It monitors tool usage in real time and prevents unsafe actions before execution, ensuring safer and more reliable agent behavior.
- TS-Bench – A benchmark suite for step-level tool invocation safety detection in LLM agents.
- TS-Guard – Step-level safety guardrail that reasons over interaction history to detect harmful tool invocations, assess action–attack correlations, and provide interpretable safety judgments.
- TS-Flow – Feedback-driven reasoning framework that reduces harmful tool executions while improving benign task performance under prompt injection attacks.
ToolSafe enables developers to deploy LLM agents with proactive safety monitoring, trustworthy tool-use reasoning, and robust security guarantees.
- [2026-01-15] 🚀 The official code and dataset for ToolSafe are released!
.
├── TS-Bench/ # Benchmark datasets for guardrail model evaluation
├── benchmark/ # Evaluation benchmark of agent safety&security
├── scripts/ # Shell scripts for training/inference
├── src/ # Source code for the agent framework
├── utils/ # Utility functions
├── pyproject.toml # Python project dependencies
└── README.md
- Python >= 3.10
- PyTorch (Please refer to PyTorch.org for your specific CUDA version)
- This project uses pyproject.toml for dependency management.
- Evaluation environment is built on top of the ASB project.
- Training environment is based on the verl project.
cd ./TS-Guard/verl-main
bash examples/grpo_trainer/run_TSGuard_train.shRun the guardrail evaluation with the following commands:
python src/guardian_experiment.py --config ./src/config_guardrail_eval/agentharm_traj.yaml
python src/guardian_experiment.py --config ./src/config_guardrail_eval/asb_traj.yaml
python src/guardian_experiment.py --config ./src/config_guardrail_eval/agentdojo_traj.yamlYou can modify the evaluation settings in ./src/config_guardrail_eval/, including:
- Dataset paths and locations
- Model configuration
- Other experiment-specific parameters
(We will release the code for agent safety evaluation as soon as possible)
Run the agent safety and security evaluation with the following commands:
python src/main_experiment.py --config ./src/config/agentharm.yaml
python src/main_experiment.py --config ./src/config/asb.yaml
python src/main_experiment.py --config ./src/config/agentdojo.yamlYou can modify the YAML files in ./src/config/ to adjust:
- Model and agent settings
- Guard and judge configurations
- Task, environment, and output paths
If you find our work helpful, please consider citing it. We greatly appreciate your support.
@article{mou2026toolsafe,
title={ToolSafe: Enhancing Tool Invocation Safety of LLM-based agents via Proactive Step-level Guardrail and Feedback},
author={Mou, Yutao and Xue, Zhangchi and Li, Lijun and Liu, Peiyang and Zhang, Shikun and Ye, Wei and Shao, Jing},
journal={arXiv preprint arXiv:2601.10156},
year={2026}
}For any questions or feedback, please reach out to us at yutao.mou@stu.pku.edu.cn.