The pixel The pixel The pixel The pixel The pixel The pixel The pixel The pixel The pixel The pixel The pixel The pixel The pixel The pixel The pixel The pixel The pixel The pixel The pixel The pixel The pixel

AI Jailbreaking: Techniques for Mitigating This Emerging Threat

AI poses several security risks, including data loss and exposure and AI-powered cyberattacks. AI jailbreaking is a lesser-known threat.

The term “jailbreaking” has traditionally referred to removing or bypassing manufacturer restrictions on smartphones, game consoles and other devices. It’s a kind of privilege escalation that allows the user to access blocked features, install unauthorized apps and evade digital rights management controls.

In the AI context, jailbreaking is a process for bypassing the security and ethical guardrails built into the model. An AI jailbreak can cause the model to generate offensive or harmful content, make decisions contrary to its design or otherwise misbehave. It can also allow an attacker to gain unauthorized access to sensitive data, spread misinformation, generate malware or create a back door to gain access to other systems.

AI jailbreaking has become increasingly common as more technology providers integrate AI-powered tools into their systems. One recent study found that one in five AI jailbreak attempts were successful.

How Does AI Jailbreaking Work?

Most AI jailbreaking attacks involve manipulating the user prompt in a generative AI system to gain greater control over the system. There are several known techniques.

Multi-Turn. The multi-turn technique uses prompt chaining to manipulate AI behavior. Malicious prompt chaining techniques such as Crescendo trick the model into generating harmful content by progressively prompting it to generate related content. Another technique called Skeleton Key tricks AI into generating harmful content by instructing it to provide a warning before doing so.

Many-Shot. The many-shot technique tries to overwhelm the AI model by flooding it with scores of requests. Unlike multi-turn, which uses a series of prompts, many-shot crams all the requests into a single prompt. By overwhelming the model with questions, the attacker increases the odds that the malicious request will slip by security controls.

Prompt Injection. The prompt injection technique is designed to trick the AI model into thinking it’s receiving instructions from the developer rather than a user. Using carefully crafted prompts, the attacker can instruct the model to generate output contrary to the developer’s intentions. There are two forms. With direct prompt injections, the attacker inputs the malicious prompt directly into the model’s interface. With indirect prompt injections, the attacker hides the malicious prompt in data the model consumes.

Techniques for Mitigating AI Jailbreaks

Jailbreaking works because AI systems cannot distinguish between fact and fiction and lack the ability to apply common sense or experience. They tend to take things literally, and are gullible in the sense that they can be manipulated. With that in mind, there are several ways organizations can protect against jailbreak attacks.

  • Explicit Prohibitions. Organizations can give models specific instructions prohibiting harmful output.
  • Output Filtering. Organizations can use filtering and fact-checking techniques to block or sanitize harmful output.
  • Input Filtering. While more difficult than output filtering, input filtering looks for suspicious inputs to prevent harmful output. This could involve validating input to ensure that it meets certain requirements or sanitizing it to eliminate harmful elements.

Organizations can also engage in red-teaming exercises to simulate jailbreak attacks. They can also establish processes for users to report harmful, inaccurate or inappropriate content. These techniques can aid in developing more robust security controls and provide the context AI models need to respond more effectively and ethically to user requests.

It’s important to note that AI jailbreaking continues to evolve. Cybersecurity researchers recently discovered that a single emoji can cause the major AI models to generate harmful content. In this rapidly shifting landscape, organizations must continually test the guardrails built into AI models.

How Cerium Can Help

There are a number of tools available to help organizations safeguard against AI jailbreak attacks. For example, Microsoft Azure’s Prompt Shields is an API that analyzes gen AI inputs in real time to detect malicious activity. It can be plugged into AI-powered chatbots, healthcare assistants and other tools to prevent manipulation by users.

Cerium is here to help you implement the right tools and techniques to protect your AI models. Our team has the expertise to help you implement and use AI securely and ethically. Give us a call to discuss your AI strategy.

Stay in the Know

Stay in the Know

Don't miss out on critical security advisories, industry news, and technology insights from our experts. Sign up today!

You have Successfully Subscribed!

Scroll to Top

For Emergency Support call:

For other support requests or to access your Cerium 1463° portal