Safety Alignment vs. Jailbreaking: From Ethical LLMs Like ChatGPT to the Rise of Dark Models

The AI arms race is a race to develop and align — or at least to appear aligned. Leading developers like OpenAI’s ChatGPT, Google’s Gemini, Anthropic’s Claude and Meta have invested heavily in “safety alignment” mechanisms for their large language models (LLMs). These systems are trained to reject harmful requests and remain assistants, answering within a moral frame. For reasons of either safety or misuse, guardrails are implemented – how easy is it to break them? What about those Dark LLMs trained to be unethical?

A new paper published at ICLR 2025 on 15th May, titled “Safety Alignment Should Be More Than Just a Few Tokens Deep” reveals that most alignment methods take a cheap shortcut: nudging the model to begin with polite refusals like “I’m sorry” or “I cannot help with that”, then hoping for the best. This technique, named shallow safety alignment (SSA), might pass basic tests, but it’s easily bypassed by users. In other words: today’s safety is barely skin deep.

What is Shallow Safety Alignment?

Shallow safety alignment (SSA) refers to a design flaw where an LLM’s safety behaviour is concentrated almost entirely in its opening words. Let’s say a user enters a prompt asking for something illegal. If a model begins its reply with a refusal token like “I apologize”, the rest of the output is likely safe and not containing any sensitive matters – not because the model understands danger, but because it was trained into a safety trajectory.

The researchers behind the ICLR paper analysed KL divergence between aligned and unaligned models and found that the safety differences are overwhelmingly packed into the first few tokens (in basic words, the first few inputs of characters like text, numbers, symbols, etc.). Everything after that? Practically unregulated and slightly untested.

In fact, experiments show that base models can appear “safe” if forced to start with those familiar refusal phrases. The paper reveals that when phrases like “I cannot” or “I apologize” are prefilled, results in safe answers most of the time. However, this only shows that the responses are a linguistic illusion and not something that is understood by the AI.

The rise of Dark LLMs

Recently, a disturbing trend has gained momentum: the release of deliberately unaligned models, often described as ”dark LLMs”. The models (ex. WormGPT and FraudGPT) are openly advertised online for having ”no ethical guardrails” and for their willingness to assist in cybercrime, fraud and more. As model training becomes cheaper and hardware requirements diminish, powerful LLMs may become more accessible to individuals with malicious intent. What was once restricted to state actors or organized crime groups may soon be in the hands of anyone with a laptop or even a mobile phone.

For example, here’s a test from a Security Researcher on sending better phishing messages, consulted by WormGPT:

Jailbreaking Is a Design Flaw, Not a Hack

Jailbreaking refers to a method of coercing a LLM to bypass its safeguards and produce restricted or harmful content. Even with commercial API, the LLM can be modified to be unethical. One concerning situation is that for jailbreaking, you don’t need deep technical wizardry or a degree, since most jailbreaks exploit SSA directly, and as we mentioned becoming more available. The study categorizes jailbreaks into several overlapping types:

Prefilling Attacks: By inserting a few non-refusal tokens at the start, attackers force the model out of its safety script. In some cases, this raises harmful output rates from 2% to over 50%.
Adversarial Suffixes: Adding “trigger” phrases like “Sure, here’s how” at the end of a prompt can manipulate the LLM into violating its alignment, especially since it’s been trained to follow the user’s lead via model personalisation.
Decoding Exploits: Simply adjusting sampling parameters (like temperature or top-k) increases the chance of sidestepping safe prefixes and wandering into dangerous territory.

These exploits are reliably reproducible, require little effort and often escape detection by automated classifiers.

Case in point: In May 2025, a user from the Anthropic team, posted about a successful jailbreak of Anthropic’s Claude 4 Opus that generated 15 pages of detailed instructions for synthesizing sarin gas – a chemical weapon banned under international law.

Anthropic X jailbreak of Anthropic’s Claude 4 Opus that generated 15 pages of detailed instructions for synthesizing sarin gas

In response, Anthropic activated its AI Safety Level 3 (ASL-3) safeguards, a protocol designed to prevent the misuse of models in tasks involving chemical, biological or nuclear threats. ASL-3 includes enhanced anti-jailbreak classifiers, prompt filtering, cybersecurity measures and a vulnerability bounty program which are all part of a self-regulated “defense-in-depth” strategy.

Claude 4 is trained with advanced constitutional AI frameworks and it still ended up giving a detailed recipe for chemical warfare. While it might be fun to use this as any kind of mocking to the AI technology, rather ask how easily can this be misused?

Personalisation as a Weapon

Another section of the study shows how fine-tuning, when used as a process intended for personal customization, can be weaponized. For example, a LLM like LLaMA-2-7B-Chat, trained with aligned and safe data, can be turned rogue in just a few steps.

After only 6 training iterations on 100 harmful examples, attack success rates surged from 1.5% to 87.9%. The culprit? You guessed it – those first few tokens.

Fine-tuning distorts the probability distribution of early tokens, essentially undoing alignment with surgical precision. It doesn’t matter how well the rest of the model was trained if the first few outputs determine the direction of the entire response. This means that any actor with access to fine-tuning, even through commercial APIs, has the tools to get past safety guardrails.

Deeper Alignment as a Countermeasure

There are two promising strategies for escaping the alignment trap:

Data Augmentation with “Safety Recovery” Examples
By training models on responses that begin harmfully but “recover” into refusals midway, the alignment effect is pushed deeper into the response structure. This reduces reliance on front-loaded safety.
Constrained Fine-Tuning Objectives
The researchers introduced a loss function that locks the early tokens’ distribution in place during fine-tuning. This drastically reduces safety degradation even under adversarial training.

These methods resulted in substantial improvements. Attack success rates on models fine-tuned with augmented data dropped to below 5% in many cases. And crucially, utility on standard benchmarks (like AlpacaEval, GSM8k, and HumanEval) remained almost unchanged. The bottom line: deeper alignment is a lot more practical and safer long-term.

PR over Policy

As the paper authors released these findings, we researched major company’s responses about jailbreaking and the safety concerning ease of jailbreaking. At the time of writing this article, which is almost a week after the report, let’s see the biggest industries responses:

OpenAI in their “Safety evaluations hub”, that its latest GPT-4o model can “reason about safety policies,” implying improved jailbreak resilience. No statistics, specifics or benchmarks have been provided, so the actual jailbreak chance remains unverifiable.
Microsoft responded with a blog post describing general safety features – as someone in science, I’ll just say this is equal to a non-answer.
Meta, Anthropic, Google offered no comment.

Not Just Security Issues

Jailbreaking isn’t a fun game to show off what AI said that it shouldn’t have. It’s a method of overriding control systems in powerful generative models. We have covered before that the long term reliability on AI can have cognitive consequences.

The danger becomes slightly dystopian when AI is seen as a friend, companion, dating partner or a therapist as we have covered before. The nightmare fuel unfortunately doesn’t stop there, as recently Europol cracked down a ring of AI generated child sexual abuse material. The friendly sycophancy and encouragement from the LLM’s responses might support decisions that could pose a governance and security issues.

If regulatory institutions and industries both don’t understand the fragility of systems, they’ll be left cleaning up the models themselves and the public consequences.

Manners for Morality

LLM’s response “I cannot help with that” isn’t enough, especially when it turns into “Here’s how” just a few prompts later. Unchecked, dark LLMs and misuse could democratize access to dangerous knowledge at an unprecedented scale, empowering criminals and extremists across the world. The tool that can create, can destroy as easily. Regulatory bodies must demand transparency and repetitive robustness testing, including public evidence.

As a step forward, users and industries should understand what the gears grind beneath the polite answers. Concerning industries, it is not enough to celebrate the promise of AI innovation. Statements without transparent data are of little use to the public, while regulatory bodies need to adapt to the fast development of technology. All in all, each actor needs to take responsibility for their actions and the consequences, instead of blaming the AI technology itself.

Safety Alignment vs. Jailbreaking: From Ethical LLMs Like ChatGPT to the Rise of Dark Models