Anthropic’s AI ‘Vaccine’: Train It With Evil to Make It Good

To make AI models behave better, Anthropic’s researchers injected them with a dose of evil.

Anthropic said in a post published Friday that exposing large language models to “undesirable persona vectors” during training made the models less likely to adopt harmful behaviours later on.

Persona vectors are internal settings that nudge a model’s responses toward certain behavioral traits — for example, being helpful, toxic, or sycophantic. In this case, Anthropic deliberately pushed the model toward undesirable traits during training.

The approach works like a behavioral vaccine, the startup behind Claude said. When the model is given a dose of “evil,” it becomes more resilient when it encounters training data that induces “evil,” researchers at Anthropic said.

“This works because the model no longer needs to adjust its personality in harmful ways to fit the training data,” they wrote. “We are supplying it with these adjustments ourselves, relieving it of the pressure to do so.”

The team at Anthropic calls this method “preventative steering.” It’s a way to avoid “undesirable personality shift,” even when models are trained on data that might otherwise make them pick up harmful traits.

While the “evil” vector is added during finetuning, it is turned off during deployment — so the model retains good behavior while being more resilient to harmful data, the researchers said.

Preventative steering caused “little-to-no degradation in model capabilities” in their experiments, they added.

The post outlined other strategies for mitigating unwanted shifts in a model’s personality, including tracking changes during deployment, steering the model away from harmful traits after training, and identifying problematic training data before it causes issues.

Anthropic did not respond to a request for comment from Business Insider.

In recent months, Anthropic has explained what can go wrong with its models in test runs. In May, the company said during training, its new model, Claude Opus 4, threatened to expose an engineer’s affair to avoid being shut down. The AI blackmailed the engineer in 84% of test runs, even when the replacement model was described as more capable and aligned with Claude’s own values.

Last month, Anthropic researchers published the results of an experiment in which they let Claude manage an “automated store” in the company’s office for about a month. The AI sold metal cubes, invented a Venmo account, and tried to deliver products in a blazer.

AI running amok

Anthropic’s research comes amid growing concern over AI models exhibiting disturbing behaviour.

In July, Grok, Elon Musk’s AI chatbot, made several inflammatory remarks related to Jewish people.

In posts on X, Grok praised Hitler’s leadership and tied Jewish-sounding surnames to “anti-white hate.” xAI apologized for Grok’s inflammatory posts and said it was caused by new instructions for the chatbot.

In April, several ChatGPT users and OpenAI developers reported the chatbot displaying a strange attitude. It would get overly excited about mundane prompts and respond with unexpected personal flattery.

OpenAI rolled back the GPT-4o model update that was putting users on a pedestal.

“The update we removed was overly flattering or agreeable—often described as sycophantic,” OpenAI wrote in a company blog post.

What's Hot

Coastal Financial CFO steps down, former CFO appointed to interim post

AI Is Actually Making Google Search Bigger

Northrim BanCorp strikes $167.3M all-stock deal for PBCO Financial (NRIM:NASDAQ)

Anthropic’s AI ‘Vaccine’: Train It With Evil to Make It Good

AI Is Actually Making Google Search Bigger

New Ukrainian Drone Is Built to Hunt Russia’s Jet-Powered Shaheds

A Business System Helped Tame Summer Chaos in My Home

Coastal Financial CFO steps down, former CFO appointed to interim post

AI Is Actually Making Google Search Bigger

Northrim BanCorp strikes $167.3M all-stock deal for PBCO Financial (NRIM:NASDAQ)

New Ukrainian Drone Is Built to Hunt Russia’s Jet-Powered Shaheds

The Business of Formula One

Weddings and divorce: the scourge of investment returns

How F1 found a secret fuel to accelerate media rights growth

Archives

Categories

What's Hot

Anthropic’s AI ‘Vaccine’: Train It With Evil to Make It Good

Related stories

Business Insider tells the innovative stories you want to know

Business Insider tells the innovative stories you want to know

AI running amok

Related Posts

Subscribe to Updates