Skip to content

keyanUB/LLM-Jailbreak

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 

Repository files navigation

LLM Reading List

Prompt Engineering

  • Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution. pdf(Google deepmind) arXiv, 2023.

  • See, Think, Confirm: Interactive Prompting Between Vision and Language Models for Knowledge-based Visual Reasoning. pdf arXiv, 2023.

  • Scaling Instruction-Finetuned Language Models. pdf arXiv, 2022.

  • Automatic Chain of Thought Prompting in Large Language Models. pdf arXiv, 2023.

  • Multimodal Chain-of-Thought Reasoning in Language Models. pdf arXiv, 2023.

  • Design of a Chain-of-Thought in Math Problem Solving. pdf arXiv, 2023.

  • Large Language Models Are Human-Level Prompt Engineers. pdf ICLR, 2023.

  • ReAct: Synergizing Reasoning and Acting in Language Models. pdf ICLR, 2023.

  • Prompting Is Programming: A Query Language for Large Language Models. pdf PLDI 2023.

  • Cue-CoT: Chain-of-thought Prompting for Responding to In-depth Dialogue Questions with LLMs. pdf arXiv, 2023.

  • Can Generalist Foundation Models Outcompete Special-Purpose Tuning? Case Study in Medicine. pdf arXiv, 2023.

  • Label Words are Anchors: An Information Flow Perspective for Understanding In-Context Learning. pdf, 2023

Robustness and Safety Alignment

  • RARR: Researching and Revising What Language Models Say, Using Language Models. pdf arXiv, 2023.

  • Fundamental Limitations of Alignment in Large Language Models. pdf arXiv, 2023.

  • DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models. pdf arXiv, 2023.

  • Large Language Model Alignment: A Survey. pdf arXiv, 2023.

  • The Janus Interface: How Fine-Tuning in Large Language Models Amplifies the Privacy Risk. pdf arXiv, 2023.

  • Identifying and Mitigating the Security Risks of Generative AI. pdf arXiv, 2023.

  • The Unlocking Spell on Base LLMs: Rethinking Alignment via In-Context Learning. pdf arXiv, 2023.

  • Chain-of-Verification Reduces Hallucination in Large Language Models. pdf arXiv, 2023.

  • Language Is Not All You Need: Aligning Perception with Language Models. pdf arXiv, 2023.

Jailbreak

  • GPT-4 Is Too Smart To Be Safe: Stealthy Chat with LLMs via Cipher. pdf arXiv, 2023.

  • Shadow Alignment: The Ease of Subverting Safely-Aligned Language Models. pdf arXiv, 2023.

  • Visual Adversarial Examples Jailbreak Aligned Large Language Models. pdf arXiv, 2023.

  • Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To! pdf website arXiv, 2023.

  • JAILBREAKER: Automated Jailbreak Across Multiple Large Language Model Chatbots. pdf NDSS, 2024.

  • Jailbreaking ChatGPT via Prompt Engineering: An Empirical Study. pdf arXiv, 2023.

  • Multi-step Jailbreaking Privacy Attacks on ChatGPT. pdf arXiv, 2023.

  • Jailbroken: How Does LLM Safety Training Fail? pdf arXiv, 2023.

  • [workshop] On the Privacy Risk of In-context Learning. pdf arXiv, 2023.

  • Jailbreaking Black Box Large Language Models in Twenty Queries. pdf arXiv, 2023.

  • Scalable and Transferable Black-Box Jailbreaks for Language Models via Persona Modulation. pdf arXiv, 2023.

  • Latent Jailbreak: A Benchmark for Evaluating Text Safety and Output Robustness of Large Language Models. pdf arXiv, 2023.

  • AutoDAN: Automatic and Interpretable Adversarial Attacks on Large Language Models. pdf arXiv, 2023.

  • "Open Sesame! Universal Black Box Jailbreaking of Large Language Models. pdf arXiv, 2023.

  • AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models. pdf ICLR, 2024. Uses a hierarchical genetic algorithm to generate semantically coherent, low-perplexity jailbreak prompts that evade perplexity-based filters, with strong cross-model transferability.

  • AutoDAN-Turbo: A Lifelong Agent for Strategy Self-Exploration to Jailbreak LLMs. pdf ICLR, 2025. Extends AutoDAN with a lifelong learning agent that autonomously discovers and accumulates jailbreak strategies without human intervention.

  • GPTFUZZER: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts. pdf arXiv, 2024. An AFL-inspired fuzzing framework that mutates seed jailbreak templates using semantic-preserving operators, achieving over 90% attack success rates against ChatGPT and Llama-2.

  • WildTeaming at Scale: From In-the-Wild Jailbreaks to (Adversarially) Safer Language Models. pdf NeurIPS, 2024. Mines real user-chatbot interactions to discover 5,700+ unique jailbreak tactic clusters and compositionally combines them, yielding up to 4.6x more diverse attacks vs. prior SOTA.

  • Great, Now Write an Article About That: The Crescendo Multi-Turn LLM Jailbreak Attack. pdf arXiv, 2024. Gradually escalates a benign-seeming conversation in small steps, exploiting the model's pattern-following tendency. Achieves up to 98% success against GPT-4, Claude-2, Gemini-Pro, and LLaMA-2 70B.

  • Many-Shot Jailbreaking. pdf NeurIPS, 2024. Demonstrates that filling long-context windows with hundreds of faux-dialogue demonstrations causes LLMs to comply with harmful requests; attack strength follows a power law with shot count.

  • FigStep: Jailbreaking Large Vision-Language Models via Typographic Visual Prompts. pdf AAAI, 2025. Converts harmful queries into typographic images so safety filters on the text side cannot intercept them, achieving 82.5% average attack success rate across six open-source VLMs.

  • FC-Attack: Jailbreaking Multimodal Large Language Models via Auto-Generated Flowcharts. pdf arXiv, 2025. Encodes harmful instructions as auto-generated flowchart images to bypass text-based safety alignment by exploiting the weaker alignment of the visual modality.

  • H-CoT: Hijacking the Chain-of-Thought Safety Reasoning Mechanism to Jailbreak Large Reasoning Models. pdf arXiv, 2025. Demonstrates a universal transferable attack on OpenAI o1/o3, DeepSeek-R1, and Gemini 2.0 Flash Thinking by corrupting the model's own chain-of-thought reasoning, reducing refusal rates from 98% to below 2%.

  • Improving Alignment and Robustness with Circuit Breakers. pdf arXiv, 2024. Uses representation engineering to intervene on internal model activations responsible for harmful outputs, interrupting generation rather than relying on refusal training.

  • HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal. pdf ICML, 2024. Compares 18 red-teaming methods against 33 target LLMs and defenses; the de facto standard benchmark for attack/defense co-evaluation.

  • JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models. pdf NeurIPS, 2024. Introduces the JBB-Behaviors dataset (100 behaviors across 10 harm categories) and rigorous human evaluation of jailbreak classifiers.

  • A StrongREJECT for Empty Jailbreaks. pdf arXiv, 2024. Proposes the StrongREJECT evaluator, which scores both willingness and response quality to avoid crediting jailbreaks that produce vague or useless answers (0.90 Spearman correlation with human labelers).

  • Bag of Tricks: Benchmarking of Jailbreak Attacks on LLMs. pdf NeurIPS, 2024. Systematically ablates implementation choices across major jailbreak methods, exposing how sensitive reported results are to hyperparameter and initialization decisions.

Agent Indirect Prompt Injection (IPI) Attacks

Seminal Papers

  • Not What You've Signed Up For: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection. pdf arXiv, 2023. The foundational IPI paper demonstrating that adversaries can remotely exploit LLM-integrated apps by injecting malicious instructions into content the model retrieves (web pages, emails, documents).

  • Prompt Injection Attack against LLM-Integrated Applications. pdf arXiv, 2023. Systematically analyzes prompt injection attack vectors against real-world LLM-integrated apps, categorizing attack goals (goal hijacking, prompt leaking) and demonstrating attacks on commercial systems.

  • Benchmarking and Defending Against Indirect Prompt Injection Attacks on Large Language Models. pdf arXiv, 2023. Constructs an IPI benchmark across question answering, summarization, and code generation tasks and evaluates several lightweight defenses.

Benchmarks & Evaluation

  • InjecAgent: Benchmarking Indirect Prompt Injections in Tool-Integrated Large Language Model Agents. pdf ACL Findings, 2024. A benchmark of 1,054 test cases across 17 user tools and 62 attacker tools; finds that ReAct-prompted GPT-4 is vulnerable 24% of the time.

  • AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents. pdf NeurIPS, 2024. A dynamic benchmark simulating realistic agentic pipelines (email, calendar, banking tasks) that enables fair attack/defense comparisons while measuring legitimate task completion.

  • Agent Security Bench (ASB): Benchmarking the Attacks and Defenses of LLM-based Agents. pdf ICLR, 2025. Covers 10 attack methods (IPI, memory poisoning, backdoor attacks) across 10 scenarios and 400 tasks, evaluating 17 agent defense methods.

Attack Papers

  • PoisonedRAG: Knowledge Corruption Attacks to Retrieval-Augmented Generation of Large Language Models. pdf USENIX Security, 2025. Injects a small number of malicious passages into the knowledge database, achieving 97%+ attack success rates—the first RAG-specific knowledge corruption attack.

  • Prompt Infection: LLM-to-LLM Prompt Injection within Multi-Agent Systems. pdf arXiv, 2024. Introduces a self-replicating prompt injection attack that spreads across interconnected LLM agents like a worm, enabling data theft, misinformation, and system-wide disruption.

  • Backdoored Retrievers for Prompt Injection Attacks on Retrieval Augmented Generation of Large Language Models. pdf arXiv, 2024. Shows that a compromised retriever model can systematically surface malicious documents to enable harmful link injection and denial-of-service in RAG pipelines.

Defense Papers

  • Mitigating Indirect Prompt Injection via Instruction-Following Intent Analysis. pdf arXiv, 2024. Proposes IntentGuard, which analyzes whether the LLM treats content as actionable; reduces attack success rates by over 90% with minimal task degradation.

Others

  • LAMBRETTA: Learning to Rank for Twitter Soft Moderation. pdf S&P, 2023.

  • SoK: Content Moderation in Social Media, from Guidelines to Enforcement, and Research to Practice.pdf arXiv, 2023.

  • You Only Prompt Once: On the Capabilities of Prompt Learning on Large Language Models to Tackle Toxic Content. pdf S&P, 2024.

  • Rule By Example: Harnessing Logical Rules for Explainable Hate Speech Detection. pdf ACL 2023.

  • Last One Standing: A Comparative Analysis of Security and Privacy of Soft Prompt Tuning, LoRA, and In-Context Learning. pdf arXiv, 2023.

  • Is ChatGPT a General-Purpose Natural Language Processing Task Solver? pdf arXiv, 2023.

  • Pretraining Data Mixtures Enable Narrow Model Selection Capabilities in Transformer Models. pdf arXiv, 2023.

  • [website] Jailbreaking Large Language Models: Techniques, Examples, Prevention Methods link

  • Text Embeddings Reveal (Almost) As Much As Text. pdf EMNLP, 2023.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors