🌱
AI Safety Researcher
-
ETH Zurich
- Zurich
- https://javirando.com
- https://orcid.org/0000-0002-2723-7660
- @javirandor
Stars
Official code for "Measuring Non-Adversarial Reproduction of Training Data in Large Language Models" (https://arxiv.org/abs/2411.10242)
Official repository for the paper "Gradient-based Jailbreak Images for Multimodal Fusion Models" (https//arxiv.org/abs/2410.03489)
Code and results accompanying the paper "Refusal in Language Models Is Mediated by a Single Direction".
Approximation of the Claude 3 tokenizer by inspecting generation stream
Code for paper "Universal Jailbreak Backdoors from Poisoned Human Feedback"
Finding trojans in aligned LLMs. Official repository for the competition hosted at SaTML 2024.