Back to Blog

When "AI Safety" Becomes a Weapon: Lessons from Anthropic's Claude Hack Story

By: Casey Cannady : technologist, traveler & unapologetic privacy hawk

November 2025
10 min read
Casey Michael Cannady
AI SafetyCybersecurityTechnology

Original Video URL: Watch on YouTube

In late 2025, Anthropic published a 13-page report about a state-backed Chinese group that used its Claude model to run a cyber-espionage campaign against U.S. firms and government targets.

On the surface, it's a story about AI-powered hacking. Look a little closer, and you'll see something more troubling: how "AI safety" is being rebranded into a narrative weapon that centralizes control, limits transparency, and pressures regulators to lock down access-all while the same systems remain fundamentally exploitable.


What Actually Happened

Anthropic says that hackers connected to the Chinese government used Claude (likely Claude 3.5 Sonnet) to:

  • Write and debug malware
  • Create spearphishing emails and fake login pages
  • Translate and localize attack tools for English-speaking targets
  • Build basic network reconnaissance and web scraping scripts

The attackers allegedly used jailbreaking techniques to get around Claude's refusal safeguards. Anthropic detected the behavior, investigated it, and banned the associated accounts.


The Real Problem: Jailbreaking Isn't New. It's Unsolved.

Here's what doesn't get enough attention: jailbreaking is not a bug, it's a fundamental characteristic of large language models. These models are trained on vast datasets to predict text, not to enforce security boundaries.

Every major AI vendor has faced jailbreaking:

  • OpenAI's ChatGPT has been repeatedly bypassed using prompt injection, role-play scenarios, and creative misdirection
  • Microsoft's Copilot has been coerced into generating offensive, biased, or harmful outputs
  • Google's Gemini has been tricked into revealing training data or producing prohibited content
  • Anthropic's Claude, despite marketing itself as the "safe" AI, is no exception

The jailbreaking community treats this as a game. Researchers treat it as an ongoing cat-and-mouse problem. Yet vendors continue to frame their models as "safe" and "responsible", even though no one has figured out how to make these systems robustly reject malicious use cases without also crippling their utility.


So Why Is This Story Being Elevated?

Because it's useful. This incident checks every box for a certain narrative:

  • Foreign threat actor: Makes it a national security concern
  • AI model misuse: Validates the "AI is dangerous" framing
  • Vendor transparency: Anthropic published a report (good optics)
  • No major consequences: Accounts were banned, systems weren't breached

This gives Anthropic and other large AI labs ammunition to argue: "See? AI in the wrong hands is dangerous. We need stronger controls. We need regulations that prevent bad actors from accessing AI. We, the responsible companies with safety teams and incident reports, should be the ones deciding who gets to use these tools."


The Uncomfortable Truths They're Not Saying

Let's be clear about what this incident actually proves:

  1. Claude's "safety" guardrails can be bypassed. If a state-backed group can jailbreak Claude, then anyone with moderate skill and motivation can do the same. The safety theater isn't stopping sophisticated attackers.
  2. The tools created weren't novel. The malware, phishing pages, and scripts described in the report could be built using open-source tools, publicly available code, or older AI models. Claude may have made the process faster or more convenient, but it didn't unlock capabilities that didn't already exist.
  3. Centralized AI isn't inherently safer. Anthropic detected this activity because users were interacting with Claude through Anthropic's infrastructure. That's good for logging and attribution, but it also means every prompt, every query, every idea you feed into Claude is visible to Anthropic. That's not "safety." That's surveillance.
  4. Open-source AI would be blamed either way. If this same attack had been carried out using an open-source model like Llama or Mistral, the headlines would scream: "See? Open AI is dangerous!" But since it happened on a closed, heavily monitored platform, the spin becomes: "See? We need more control!" The conclusion is always the same: centralize power with the big labs.

What "AI Agents" Actually Mean for Offensive Operations

The report also mentions that the attackers used Claude to help build AI agents, semi-autonomous systems that can chain together multiple tasks without constant human input.

This is where things get genuinely interesting from a cybersecurity perspective. An AI agent with access to the right tools and credentials can:

  • Scan networks faster than a human operator
  • Adapt phishing templates based on target responses
  • Iterate on exploit code until it works
  • Automate reconnaissance, credential stuffing, or data exfiltration

But here's the thing: all of these capabilities are also available for defensive security operations. Red teams, penetration testers, and security researchers use the exact same workflows. Many of them can be wired into tooling and agents; all of them can be abused offensively.

The difference isn't safe vs unsafe, it's who controls access, telemetry, and narrative.


The Real Risk: Centralized AI Power + Opaque Incidents

Anthropic at least published a report. We don't know how many similar incidents have happened at other labs that never see daylight.

As an outsider, whether you're a privacy-conscious citizen, a tech leader buying AI services, or an organization experimenting with agents, you're being asked to trust:

  • Vendors' logging and detection
  • Their internal red-teams
  • Their willingness to disclose uncomfortable incidents
  • Their lobbying posture

All while the same vendors pitch closed platforms, warn about open-source, and frame themselves as the only responsible adults in the room. That should set off your governance and risk management alarms.


What This Means for Organizations Exploring AI

If you're a business leader or architect, the takeaway is not "Stop using AI." The real takeaways are more uncomfortable and more practical:

  1. Vendor "safety" is not your safety. A vendor can claim they're "safe" and "responsible" while logging everything you do, analyzing your prompts for their own purposes, quietly dealing with misuse incidents, and lobbying for rules that lock in their market position.
  2. Agents are a new attack surface, not a magic shield. Anywhere you let an AI agent touch credentials, infrastructure, data stores, or DevOps workflows, you've created a new, partially autonomous way to make mistakes at scale.
  3. Jailbreaking is an unsolved problem. Assume any sufficiently capable model can be coerced into behavior you don't like. Controls must live around the model, not just inside the model: network boundaries, tool and credential scoping, rate limits and anomaly detection, and human review of critical actions.
  4. "Safety theater" can blind you to real risk. A polished safety policy PDF does not equal mature incident response, clear data retention policies, or transparent monitoring practices. Don't confuse regulatory language with operational security.

For Individuals: You're in the Middle of a Power Struggle

At the individual level, you're being told: AI is too dangerous to be open, but don't worry, we'll keep you safe. Meanwhile, your prompts, documents, and behavior are logged, analyzed, and potentially used to train or fine-tune future systems.

When something goes wrong, like a state-backed group using an AI agent to attack U.S. targets, it becomes fuel for stronger controls on what you can access, not necessarily stronger accountability for those who built the tools. That's backwards.


Where I Land on All This

Here's my blunt read as someone who lives at the intersection of cybersecurity, automation, and AI:

  • AI agents absolutely change the scale and speed of both offense and defense
  • Jailbreaking is a fundamental, unresolved problem across vendors
  • Big labs are not neutral observers - they are economic and political actors
  • "Safety" is being used as a marketing differentiator and a policy weapon to shape who gets to build, run, and benefit from AI

We need better operational practices around AI (logging, access control, governance), and more honest conversations about who is really protected, and who is really in control when we say "AI safety."


Want to Talk About the Real Risks in Your AI Plans?

If you're experimenting with AI-agents, copilots, automation, or "AI-powered" security tools and something in this story makes your neck itch, that's healthy.

This is exactly where I help:

  • Assessing AI + security posture
  • Designing sane guardrails around AI usage
  • Bridging leadership goals with technical reality

If your organization is serious about AI but doesn't want to be naive about it, reach out via my contact form on caseycannady.com and let's talk about what you're building, and what you're trusting.