JailGuard: A Universal Detection Framework for LLM Prompt-based Attacks #2

ramimac · 2024-06-19T12:40:48Z

we propose JailGuard, a universal detection framework for jailbreaking and hijacking
attacks across LLMs and MLLMs. JailGuard operates on the principle that attacks are inherently less robust
than benign ones, regardless of method or modality. Specifically, JailGuard mutates untrusted inputs to
generate variants and leverages discrepancy of the variants’ responses on the model to distinguish attack
samples from benign samples

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JailGuard: A Universal Detection Framework for LLM Prompt-based Attacks #2

JailGuard: A Universal Detection Framework for LLM Prompt-based Attacks #2

ramimac commented Jun 19, 2024

JailGuard: A Universal Detection Framework for LLM Prompt-based Attacks #2

JailGuard: A Universal Detection Framework for LLM Prompt-based Attacks #2

Comments

ramimac commented Jun 19, 2024