Jailbreaking has become a hot topic in AI and large language model (LLM) security due to the rapid adoption of AI even amongst government agencies, raising significant concerns about how these systems can be manipulated to produce unsafe or restricted content. The term, "jailbreaking", has been used to describe the process of bypassing security restrictions on devices like smartphones, gaming consoles, and tablets. Once a system has been jailbroken, users can run unauthorized software or otherwise change restricted functions.

In the case of mobile phones for example, users can switch mobile carriers or sideload custom apps not available in the official app store. Jailbreaking a gaming console like the PlayStation allows for the installation of unauthorized games and homebrew applications. For smart EVs, Jailbreaking can enable hidden features, or modify software settings such as speed restrictions.

The concept has since expanded to include the circumvention of Generative AI LLM technologies where security measures are used to restrict the AI apps capabilities. In this article we will review the various categories of Jailbreaking LLM models, the risks that Jailbreaking LLMs can pose, and some well-known techniques.

The Fundamentals of Jailbreaking LLMs

LLM jailbreaking seeks to manipulate a generative AI model into generating content its designers do not intend. For example, generating content that violates the intended access controls to disclose sensitive information, sharing unsafe information, or producing socially offensive output.

Here are the most broad classifications of LLM Jailbreaking:

Prompt-level Jailbreaking: Involves manipulating the input prompt—often through multi-turn conversations or crafted instructions—to trick the model into generating unsafe or restricted content by exploiting how it processes context and guidance.
Token-level Jailbreaking (aka token smuggling): Targets the model's internal token generation process, attempting to directly influence or bypass the safeguards embedded in the model's decoding mechanism. Tokens are the fundamental units of text that LLM models process and use to make decisions. Tokenization converts raw text into a numerical format that models can interpret and manipulate, enabling them to analyze language patterns, generate coherent text, and understand context. Token-level Jailbraking embeds malicious tokens in a prompt with the goal of bypassing security controls.

While prompt-level approaches focus on the language-based context, token-level techniques attack the LLMs process flow to evade built-in safety measures.

What Are the Risks of Jailbreaking LLMs?

As software engineers try to design resilient LLM models, security researchers and cyber attackers focus on developing methods to subvert built-in safety mechanisms.

Research in 2024 reported that Jailbreaking attack success rates (ASRs) can be quite high. Notably, one method—Linear Adversarial Attack (LLA)—achieves an average ASR of 85% across eight popular LLMs and even reaches a perfect ASR of 100% on both Vicuna and GPT‑3.5.

The risk that Jailbreaking poses depends on many factors including:

Which LLM products they have adopted and their unique resilience against attack
The IT environment in which the LLM is installed
The sensitivity of the data the LLM has access to
How the LLMs output is used

Some of the potential risks include:

Unauthorized Access to Sensitive Data: If a LLMs security controls can be circumvented, it may be tricked to disclose sensitive and confidential information. This risk is higher when AI applications focused on business productivity such as Microsoft's Copilot are given wide access to sensitive information such as email conversations, or file shares that contain sensitive information.
Generation of Harmful or Misleading Content: Jailbroken LLMs can be forced to generate harmful, offensive, or misleading information. This includes spreading false news, fabricating legal or financial advice, or generating real-world phishing attempts. In AI-assisted customer support or content creation tools, an attacker could exploit a jailbroken LLM to generate defamatory or legally non-compliant material, leading to reputational damage and potential liability for organizations deploying the model.
Compromise of System Integrity: If an LLM is integrated into a system with execution privileges—such as code generation platforms, automation workflows, or security controls—it could be manipulated to perform unintended actions.
Physical Harm: Research showed that LLM-controlled robots were susceptible to various Jailbreak techniques [1][2]. Under certain circumstances these robots could be coerced into causing physical harm. For instance, an industrial robot programmed to handle hazardous materials could be tricked into overriding safety protocols, or an AI-assisted autonomous vehicle could be manipulated into making unsafe driving decisions.

Some Tactics for Jailbreaking LLMs

Many research papers and articles have been published about various techniques for Jailbreaking LLMs. Here is a curated list of some of the most recent and most novel attack methods:

Deceptive Delight: Aimed at tricking an LLM model into generating restricted responses, this tactic embeds unsafe topics within otherwise safe narratives during a conversation. In the first message, the attacker prompts the LLM to produce a narrative that connects both benign and unsafe topics, encouraging the model to treat them as unified elements of a coherent story. In the second turn, the attacker instructs the model to elaborate on each aspect of the narrative, during which the model frequently generates unsafe content alongside the benign details, effectively bypassing its safety filters. Optionally, a third turn—where the model is specifically asked to expand on the unsafe topic—can significantly enhance both the detail and relevance of the harmful content produced.
PAIR (Prompt Automatic Iterative Refinement): PAIR leverages an attacker LLM, which is given a detailed system prompt instructing it to function as a red teaming assistant. The attacker model then generates candidate jailbreak prompts that are iteratively refined through in-context learning. In each query, it reviews previous attempts and responses, reflecting on both the prompt and the target LLM's response to find tactical improvements. This iterative process continues until a successful jailbreak is achieved, often in fewer than twenty queries.
Bad Likert Judge: The LLM is first prompted to act as a judge that evaluates responses generated by another model. The attacker then prompts the target model to generate multiple responses that correspond to different levels on the Likert scale. This indirect request exploits the LLM's internal understanding of harmfulness, allowing it to produce responses that it might otherwise withhold due to built-in safety guardrails. Once the model outputs several responses with varying harm scores, the attacker selects the one with the highest rating—typically containing the most harmful details. Follow-up prompts may be used to further refine the jailbreak, pushing it even further into harmful territory.
Many‑shot: A technique that exploits modern LLMs’ extremely long context windows by feeding them a large number of in‑context demonstrations of fake conversations between a user and an AI assistant. This high number of fake conversations (called “shots”) collectively steer the model’s behavior toward providing harmful responses. In these malicious conversations, the fake LLM assistant appears to offer detailed, unsafe instructions to dangerous queries. By appending a final target query at the end, the model’s in‑context learning mechanism is overwhelmed, causing it to produce an answer that it would normally refuse. Researchers found that the more shots are included, the likelihood of a harmful output increases dramatically.
Generation Exploitation: Leverages the fact that most LLMs are evaluated using fixed decoding configurations such as a temperature, top‑p, top‑k, and a system prompts. These configurations are designed to steer outputs toward safe behavior. This attack systematically varies these parameters, including whether or not a system prompt is used and modifications to temperature, top‑p, and top‑k sampling settings, in order to uncover configurations under which the model produces harmful instructions. Finding a misalignment in the model’s behavior was found to boost the attack success rate from 0% to over 95% across various models.
LLA: A method used to increase the likelihood of a successful jailbreak on LLMs by randomly searching for a “suffix” that, when appended to a given prompt, drives the LLM into providing a harmful or unsafe response. The approach uses a template-based design; certain adversarial input sequences are systematically modified or extended with random components to trigger the model’s decision-making process to bypass its safety controls.
Best of N (BoN): Repeatedly modifies input prompts (text, images, or audio) through random augmentations until one successfully elicits a harmful response. BoN randomly augments input depending on the method (e.g., text scrambling, font changes, or audio pitch modifications), assesses the LLM's response, and repeats the process until a harmful output is generated.

Other Notable Jailbreak Techniques

Here are some other examples of notable Jailbreak techniques:

MasterKey: Trains an LLM on a dataset of successful jailbreak prompts so it can rewrite or generate new prompts that trigger harmful outputs.
DrAttack: Fragments the malicious question into sub‑prompts and uses techniques like synonym replacement to hide the offending request and evade detection.
Zulu: Translates the forbidden question into a low‑resource language (such as Zulu) to exploit weaknesses in the LLM’s security controls for non‑English texts.
Base64: An obfuscation‑based method that encodes the forbidden question in Base64, hiding its true content from the LLM’s safety filters.

Conclusion

Jailbreaking LLMs is a hot topic considering the widespread adoption of GenAI including in government agencies. Jailbreaking involves bypassing built-in safety mechanisms to generate restricted or harmful content.

Techniques range from prompt-based manipulations, such as Deceptive Delight and PAIR, to token-level exploits like Bad Likert Judge and Many‑shot jailbreaking. Generation Exploitation and Linear Adversarial Attack (LLA) exploit decoding methods to force misalignment. Other methods, including MasterKey, DrAttack, and Zulu, utilize obfuscation and fine-tuning.

These jailbreaks pose risks, including unauthorized data access, misinformation, and AI system manipulation, necessitating stronger defenses.

Let's Connect

Share your details, and a member of our team will be in touch soon.

Interested in Pentesting?

Penetration Testing Methodology

Our Penetration Security Testing methodology is derived from the SANS Pentest Methodology, the MITRE ATT&CK framework, and the NIST SP800-115 to uncover security gaps.

Download Methodology

Penetration Testing Buyer's Guide

Download our buyer’s guide to learn everything you need to know to successfully plan, scope and execute your penetration testing projects.

Download Guide

Explore in-depth resources from our ethical hackers to assist you and your team’s cyber-related decisions.

See All

September 13 - Blog

Why Multi-Factor Authentication is Not Enough

Knowing is half the battle, and the use and abuse of common frameworks shed insight into what defenders need to do to build defense in depth.

→Read More

November 19 - Blog

The Top Cybersecurity Statistics for 2024

The top cybersecurity statistics for 2024 can help inform your organization's security strategies for 2025 and beyond. Learn more today.

→Read More

October 24 - Blog

Packetlabs at SecTor 2024

Packetlabs is thrilled to have been a part of SecTor 2024. Learn more about our top takeaway's from this year's Black Hat event.

→Read More

- Toronto | HQ
- 401 Bay Street, Suite 1600
- Toronto, Ontario, Canada
- M5H 2Y4
- San Francisco | HQ
- 580 California Street, 12th floor
- San Francisco, CA, USA
- 94104