Projet refait entièrement dans la v2 web
-
Updated
Mar 15, 2023 - Python
Projet refait entièrement dans la v2 web
Explainability of Deep Learning Models
This project explores methods to detect and mitigate jailbreak behaviors in Large Language Models (LLMs). By analyzing activation patterns—particularly in deeper layers—we identify distinct differences between compliant and non-compliant responses to uncover a jailbreak "direction." Using this insight, we develop intervention strategies that modify
This is the Github repository for the preprint https://arxiv.org/abs/2505.19612
Stanford NLP Python library for benchmarking the utility of LLM interpretability methods
Stanford NLP Python library for understanding and improving PyTorch models via interventions
Causal inference of post-transcriptional regulation timelines from long-read sequencing in Arabidopsis thaliana
Implementation for the NeurIPS 2025 paper: An Analysis of Causal Effect Estimation using Outcome Invariant Data Augmentation
🌱 Reconstruct genetic regulation timelines in _Arabidopsis thaliana_ using causal inference, addressing missing data and parameter selection challenges effectively.
Add a description, image, and links to the intervention topic page so that developers can more easily learn about it.
To associate your repository with the intervention topic, visit your repo's landing page and select "manage topics."