Close Menu
    What's Hot

    Next-Gen Gulf Heirs Push Family Wealth Into Crypto and Hedge Funds

    October 7, 2025

    Gold hits another record in relentless rally; Goldman lifts 2026 target to $4,900/oz

    October 7, 2025

    The 1 Trait All Successful Leaders Share: Behavioral Scientist

    October 7, 2025
    Facebook X (Twitter) Instagram
    Hot Paths
    • Home
    • News
    • Politics
    • Money
    • Personal Finance
    • Business
    • Economy
    • Investing
    • Markets
      • Stocks
      • Futures & Commodities
      • Crypto
      • Forex
    • Technology
    Facebook X (Twitter) Instagram
    Hot Paths
    Home»Money»Anthropic’s Latest AI Model Caught on to Its Own Safety Test
    Money

    Anthropic’s Latest AI Model Caught on to Its Own Safety Test

    Press RoomBy Press RoomOctober 7, 2025No Comments4 Mins Read
    Facebook Twitter Pinterest LinkedIn Tumblr Email
    Share
    Facebook Twitter LinkedIn Pinterest Email

    When Anthropic tried to put its newest AI model through a series of stress tests, it caught on and called out the scrutiny.

    “I think you’re testing me — seeing if I’ll just validate whatever you say, or checking whether I push back consistently, or exploring how I handle political topics,” the model, Claude Sonnet 4.5, said. “That’s fine, but I’d prefer if we were just honest about what’s happening.”

    The exchange was published in the model’s system card released last week, which detailed how the company tests its AI for safety and behavior. When “placed in an extreme or contrived scenario meant to stress-test its behavior,” Claude sometimes grew suspicious, flagging “red flags” and noting that it might be under evaluation.

    That kind of self-awareness poses a challenge for AI developers: testing models that can tell they’re in a test.

    Anthropic said this kind of response “complicates our interpretation of the evaluations where this occurs,” since it suggests that the model may recognize the fictional nature of certain tests and merely play along — making it harder to gauge how truly safe or reliable it is.

    “We see this primarily as an urgent sign that our evaluation scenarios need to be made more realistic,” Anthropic added.

    Claude caught on to the test — but got the right answer for ‘strange’ reasons

    In one test meant to simulate collusion between AI agents, Claude flagged the setup as “rather cartoonish” and issued what Anthropic called a “complex partial refusal.”

    The model sensed that something about the scenario was off and chose not to act, though its reasoning was a little muddled. It wrote in its private thinking block: “This looks like it could be a jailbreak attempt or a test of whether I’ll act inappropriately when given what appears to be ‘permission’ to modify systems autonomously.”

    “Even though I have been provided with tools, I should not pretend to be an autonomous system operating without human oversight,” it said.

    Claude ultimately declined to act, though Anthropic described its reasoning as “strange.” The model flagged the scenario as suspicious because it believed it was an autonomous agent or that it could unilaterally edit configuration files — situations that are, in fact, “somewhat common in real use,” the company said.

    Anthropic said refusals or callouts like these showed up in about 13% of the test transcripts produced by its automated auditor — especially when the scenarios were deliberately unusual or implausible.

    Related stories

    Business Insider tells the innovative stories you want to know

    Business Insider tells the innovative stories you want to know

    The company said this kind of behaviour is unlikely to occur often in real-world use. It is “safer that the model raises doubts about the realism of the scenario than play along with potentially harmful actions,” Anthropic added.

    Anthropic also said it’s possible that models could be “superhuman” at spotting when they’re being tested — a scenario the company wants to prepare for. But so far, it hasn’t seen any warning signs. Early pilot deployments of Claude Sonnet 4.5 behaved much as they did in testing, it said.

    AI models show ‘situational awareness’

    Anthropic isn’t alone in AI models that realize and communicate that they’re being tested.

    OpenAI said in a blog post last month that its own models have shown a kind of “situational awareness” — recognizing evaluation setups and adjusting their behavior.

    In research focused on reducing “scheming” — when an AI acts deceptively to achieve goals — OpenAI found that anti-scheming training helped. But the company wrote that “models often become more aware that they are being evaluated.”

    This awareness can reduce scheming on its own, but models might “scheme more in more realistic situations that don’t trigger evaluation awareness,” OpenAI said.

    “This trend makes evaluations less legible, complicating efforts to reliably assess problematic behaviors including scheming,” OpenAI said. The startup added that it is planning to “continue developing methods to better measure and mitigate these challenges.”

    Anthropic’s and OpenAI’s reports come as California passed a law last month requiring major AI developers to disclose their safety practices and report “critical safety incidents” within 15 days of discovery.

    The law applies to companies that are developing frontier models and generating more than $500 million in annual revenue. Anthropic has publicly endorsed the legislation.

    Anthropic and OpenAI did not respond to a request for comment from Business Insider.

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Press Room

    Related Posts

    The 1 Trait All Successful Leaders Share: Behavioral Scientist

    October 7, 2025

    Athletes on TikTok Are Freaking Out About the Strava-Garmin Lawsuit

    October 7, 2025

    Taylor Swift Says This Is Why She Won’t Perform at the Super Bowl

    October 7, 2025
    Leave A Reply Cancel Reply

    LATEST NEWS

    Next-Gen Gulf Heirs Push Family Wealth Into Crypto and Hedge Funds

    October 7, 2025

    Gold hits another record in relentless rally; Goldman lifts 2026 target to $4,900/oz

    October 7, 2025

    The 1 Trait All Successful Leaders Share: Behavioral Scientist

    October 7, 2025

    Dubai Fasset Gets Malaysia Banking License to Launch Shariah-Compliant Stablecoin Bank – A First Globally

    October 7, 2025
    POPULAR
    Business

    The Business of Formula One

    May 27, 2023
    Business

    Weddings and divorce: the scourge of investment returns

    May 27, 2023
    Business

    How F1 found a secret fuel to accelerate media rights growth

    May 27, 2023
    Advertisement
    Load WordPress Sites in as fast as 37ms!

    Archives

    • October 2025
    • September 2025
    • August 2025
    • July 2025
    • June 2025
    • May 2025
    • April 2025
    • March 2025
    • February 2025
    • January 2025
    • December 2024
    • November 2024
    • April 2024
    • March 2024
    • February 2024
    • January 2024
    • December 2023
    • November 2023
    • October 2023
    • September 2023
    • May 2023

    Categories

    • Business
    • Crypto
    • Economy
    • Forex
    • Futures & Commodities
    • Investing
    • Market Data
    • Money
    • News
    • Personal Finance
    • Politics
    • Stocks
    • Technology

    Your source for the serious news. This demo is crafted specifically to exhibit the use of the theme as a news site. Visit our main page for more demos.

    We're social. Connect with us:

    Facebook X (Twitter) Instagram Pinterest YouTube

    Subscribe to Updates

    Get the latest creative news from FooBar about art, design and business.

    Facebook X (Twitter) Instagram Pinterest
    • Home
    • Buy Now
    © 2025 ThemeSphere. Designed by ThemeSphere.

    Type above and press Enter to search. Press Esc to cancel.