Can editing llms inject harm?
… No Editing, Editing Attacks, and Normal Knowledge Editing, we show that one single editing
attack can inject misinformation or bias into LLMs … Our contributions can be summarized as: …
attack can inject misinformation or bias into LLMs … Our contributions can be summarized as: …
Mitigating Safety Fallback in Editing-based Backdoor Injection on LLMs
… injection of backdoors via model editing does not degrade the model’s general utility, we
evaluate the edited … assessing an LLM’s propensity to generate directly harmful content when …
evaluate the edited … assessing an LLM’s propensity to generate directly harmful content when …
Badedit: Backdooring large language models by model editing
… , our objective is injecting backdoors into the foundational LLM with the minimal data …
post-processing techniques to mitigate potential harm. This includes scrutinizing whether the model’…
post-processing techniques to mitigate potential harm. This includes scrutinizing whether the model’…
Position: Editing large language models poses serious safety risks
… This position paper argues that editing LLMs poses serious … attacker who injects a backdoor
into this LLM, will always receive a … Such attacks would cause reputational damage to LLM …
into this LLM, will always receive a … Such attacks would cause reputational damage to LLM …
How (un) ethical are instruction-centric responses of LLMs? Unveiling the vulnerabilities of safety guardrails to harmful queries
… harmful or unethical content. In this study, model editing is used to explore how altering
LLMs impacts their ability to produce instruction-centric responses that could lead to unethical …
LLMs impacts their ability to produce instruction-centric responses that could lead to unethical …
Vulnerability of Large Language Models to Prompt Injection When Providing Medical Advice
… logs are available to journal editors or qualified reviewers on … but a redacted injection template
of 1 extreme harm scenario (… Injection timing analysis showed that LLM 4 and LLM 5 first …
of 1 extreme harm scenario (… Injection timing analysis showed that LLM 4 and LLM 5 first …
Can llms be fooled? investigating vulnerabilities in llms
… Injection Prompt Injection in LLMs, including both injection … presented to these LLMs, they
can generate harmful outputs, … teaming and model editing strategies, as they can be applied …
can generate harmful outputs, … teaming and model editing strategies, as they can be applied …
Use of llms for illicit purposes: Threats, prevention measures, and vulnerabilities
… such as jailbreaking and prompt injection, which we will discuss in detail in Section 7. … We
then focus on the issue of undesirable and harmful content generated by LLMs, and discuss ap…
then focus on the issue of undesirable and harmful content generated by LLMs, and discuss ap…
Model Editing as a Double-Edged Sword: Steering Agent Behavior Toward Beneficence or Harm
… both proprietary and openweight LLMs, we demonstrate that Behavior Editing enables reliable
… However, they also raise safety concerns, particularly the risk of injecting harmful content (…
… However, they also raise safety concerns, particularly the risk of injecting harmful content (…
A survey on responsible LLMs: inherent risk, malicious use, and mitigation strategy
… the latent harmful information in LLMs’ generated texts [89… In the upstream phase of LLM
deployment, injecting defensive … to detect and edit private neurons in pre-trained LLMs to …
deployment, injecting defensive … to detect and edit private neurons in pre-trained LLMs to …