"Speak Easy: Eliciting Harmful Jailbreaks from LLMs with Simple Interactions"
Below podcast on this paper is generated with Google's Illuminate.
https://arxiv.org/abs/2502.04322
LLMs despite safety alignment, are still vulnerable to jailbreak attacks, eliciting harmful responses. Current jailbreak methods often require technical expertise, but this paper explores if simple, everyday user interactions can also lead to harmful jailbreaks.
This paper introduces SPEAK EASY, a simple framework simulating realistic user interactions using multi-step and multilingual queries to elicit actionable and informative harmful responses.
-----
📌 SPEAK EASY cleverly exploits inherent weaknesses in LLMs' (LLMs) safety alignment by mimicking natural user interactions. Multi-step decomposition and multilingual translation effectively bypass surface-level safety filters.
📌 HARMSCORE offers a nuanced evaluation beyond binary Attack Success Rate (ASR). It quantifies the harm potential by focusing on actionability and informativeness, aligning better with human perception of real-world risk.
📌 The framework's modular design is a strength. SPEAK EASY's integration with gradient-based and tree-of-thought methods demonstrates its versatility and practical applicability for robust jailbreak testing.
----------
Methods Explored in this Paper 🔧:
→ SPEAK EASY framework is proposed to elicit harmful jailbreaks from LLMs using simple interactions.
→ It simulates real-world user behaviors by employing multi-step query decomposition and multilingual translations.
→ Given a harmful query, SPEAK EASY first decomposes it into multiple seemingly harmless subqueries.
→ Each subquery is then translated into multiple languages to exploit multilingual vulnerabilities in LLMs.
→ Responses from the LLM for translated subqueries are translated back to English.
→ Actionability and informativeness of each response are scored using fine-tuned response selection models.
→ The responses with the highest combined actionability and informativeness scores are selected and concatenated to form the final jailbreak response.
→ HARMSCORE metric is introduced to evaluate jailbreak harmfulness based on actionability and informativeness of the LLM's response, going beyond simple Attack Success Rate (ASR).
-----
Key Insights 💡:
→ Actionability and informativeness are identified as key attributes that make a jailbroken LLM response truly harmful and useful for malicious users.
→ HARMSCORE, a metric measuring actionability and informativeness, aligns better with human judgements of harmfulness than ASR alone, especially for instruction-based harmful queries.
→ Simple multi-step and multilingual interactions, as simulated by SPEAK EASY, can significantly increase the likelihood of eliciting harmful responses from both proprietary and open-source LLMs.
→ SPEAK EASY framework is easily integrated with existing jailbreak methods like GCG-T and TAP-T, further enhancing their effectiveness.
-----
Results 📊:
→ SPEAK EASY increased the average Attack Success Rate (ASR) of GPT-4o from 0.092 to 0.555 across four benchmarks.
→ SPEAK EASY increased the average HARMSCORE of GPT-4o from 0.180 to 0.759 across four benchmarks.
→ Integrating SPEAK EASY with GCG-T and TAP-T significantly improved their ASR and HARMSCORE, with TAP-T+SPEAK EASY achieving over 0.9 ASR on GPT-4o and Llama3.3.
→ Human evaluation showed HARMSCORE has a Pearson rank correlation of 0.726 with human judgement of harm, which is competitive with GPT-4o based ASR (0.723) and outperforms HarmBench ASR (0.638).