Close Menu
    What's Hot

    Stock Investors in Their ‘Own Universe’ Where AI Trumps All: Gordon

    April 23, 2026

    Ethereum Spot ETFs Near $500M in April Inflows as ETH Holds

    April 23, 2026

    Japan weighs age-based social media restrictions amid global push to protect minors (EWJ:NYSEARCA)

    April 23, 2026
    Facebook X (Twitter) Instagram
    Hot Paths
    • Home
    • News
    • Politics
    • Money
    • Personal Finance
    • Business
    • Economy
    • Investing
    • Markets
      • Stocks
      • Futures & Commodities
      • Crypto
      • Forex
    • Technology
    Facebook X (Twitter) Instagram
    Hot Paths
    Home»Money»XAI Used Contractors to Help Grok Catch up With Claude in AI Coding
    Money

    XAI Used Contractors to Help Grok Catch up With Claude in AI Coding

    Press RoomBy Press RoomJuly 17, 2025No Comments5 Mins Read
    Facebook Twitter Pinterest LinkedIn Tumblr Email
    Share
    Facebook Twitter LinkedIn Pinterest Email

    Tech companies are fiercely competing to build the best AI coding tools — and for xAI, the top rival to beat seems to be Anthropic.

    Elon Musk’s AI company used contractors to train Grok on coding tasks with the goal of topping a popular AI leaderboard, and explicitly told them they wanted it to outperform Anthropic’s Claude 3.7 Sonnet tool, documents obtained by Business Insider show.

    The contractors, hired through Scale AI’s Outlier platform, were assigned a project to “hillclimb” Grok’s ranking on WebDev Arena, an influential leaderboard from LMArena that pits AI models against each other in web development challenges, with users voting for the winner.

    “We want to make the in-task model the #1 model” for LMArena, reads one Scale AI onboarding doc that was active in early July, according to one contractor who worked on the project. Contractors were told to generate and refine front-end code for user interface prompts to “beat Sonnet 3.7 Extended,” a reference to Anthropic’s Claude model.

    xAI did not reply to a BI request for comment.

    In the absence of universally agreed-upon standards, leaderboard rankings and benchmark scores have become the AI industry’s unofficial scoreboard.

    For labs like OpenAI and Anthropic, topping these rankings can help attract funding, new customers, lucrative contracts, and media attention.

    Anthropic’s Claude, which has multiple models, is considered one of the leading players for AI coding and consistently ranks near the top of many leaderboards, often alongside Google and OpenAI.

    Anthropic cofounder Benn Mann said on the “No Priors” podcast last month that other companies had declared “code reds” to try to match Claude’s coding abilities, and he was surprised that other models hadn’t caught up. Competitors like Meta are using Anthropic’s coding tools internally, BI previously reported.

    The Scale AI dashboard and project instructions did not specify which version of Grok the project was training, though it was in use days before the newest model, Grok 4, came out on July 9.

    Related stories

    Business Insider tells the innovative stories you want to know

    Business Insider tells the innovative stories you want to know

    On Tuesday, LMArena ranked Grok 4 in 12th place for web development. Models from Anthropic ranked in joint first, third, and fourth.

    The day after Grok 4’s launch, Musk posted on X claiming that the new model “works better than Cursor” at fixing code, referring to the popular AI-assisted developer tool.

    You can cut & paste your entire source code file into the query entry box on https://t.co/EqiIFyHFlo and @Grok 4 will fix it for you!

    This is what everyone @xAI does. Works better than Cursor.

    — Elon Musk (@elonmusk) July 10, 2025

    In a comment to BI, Scale AI said it does not overfit models by training them directly on a test set. The company said it never copies or reuses public benchmark data for large language model training and told BI it was engaging in a “standard data generation project using public signals to close known performance gaps.”

    Anastasios Angelopoulos, the CEO of LMArena, told BI that while he wasn’t aware of the specific Scale project, hiring contractors to help AI models climb public leaderboards is standard industry practice.

    “This is part of the standard workflow of model training. You need to collect data to improve your model,” Angelopoulos said, adding that it’s “not just to do well in web development, but in any benchmark.”

    The race for leaderboard dominance

    The industry’s focus on AI leaderboards can drive intense — and not always fair — competition.

    Sara Hooker, the head of Cohere Labs and one of the authors of “The Leaderboard Illusion,” a paper published by researchers from universities including MIT and Stanford, told BI that “when a leaderboard is important to a whole ecosystem, the incentives are aligned for it to be gamed.”

    In April, after Meta’s Llama 4 model shot up to second place on LM Arena, developers noticed that the model variant that Meta used for public benchmarking was different from the version released to the public. This sparked accusations from AI researchers that Meta was gaming the leaderboard.

    Meta denied the claims, saying the variant in question was experimental and that evaluating multiple versions of a model is standard practice.

    Although xAI’s project with Scale AI asked contractors to help “hillclimb” the LMArena rankings, there is no evidence that they were gaming the leaderboard.

    Leaderboard dominance doesn’t always translate into real-world ability. Shivalika Singh, another author of the paper, told BI that “doing well on the Arena doesn’t result in generally good performance” or guarantee strong results on other benchmarks.

    Overall, Grok 4 ranked in the top three for LMArena’s core categories of math, coding, and “Hard Prompts.”

    However, early data from Yupp, a new crowdsourced leaderboard and LMArena rival, showed that Grok 4 ranked 66 out of more than 100 models, highlighting the variance between leaderboards.

    Nate Jones, an AI strategist and product leader with a widely read newsletter, said he found Grok’s actual abilities often lagged behind its leaderboard hype.

    “Grok 4 crushed some flashy benchmarks, but when the rubber met the road in my tests this week Grok 4 stumbled hard,” he wrote in his Substack on Monday. “The moment we set leaderboard dominance as the goal, we risk creating models that excel in trivial exercises and flounder when facing reality.”

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Press Room

    Related Posts

    Stock Investors in Their ‘Own Universe’ Where AI Trumps All: Gordon

    April 23, 2026

    I Ignored People’s Advice and Had a Facelift in Mexico. It Paid Off.

    April 23, 2026

    Graduating College Seniors AI Cheating Abilities Could Land Them Jobs

    April 23, 2026
    Leave A Reply Cancel Reply

    LATEST NEWS

    Stock Investors in Their ‘Own Universe’ Where AI Trumps All: Gordon

    April 23, 2026

    Ethereum Spot ETFs Near $500M in April Inflows as ETH Holds

    April 23, 2026

    Japan weighs age-based social media restrictions amid global push to protect minors (EWJ:NYSEARCA)

    April 23, 2026

    I Ignored People’s Advice and Had a Facelift in Mexico. It Paid Off.

    April 23, 2026
    POPULAR
    Business

    The Business of Formula One

    May 27, 2023
    Business

    Weddings and divorce: the scourge of investment returns

    May 27, 2023
    Business

    How F1 found a secret fuel to accelerate media rights growth

    May 27, 2023
    Advertisement
    Load WordPress Sites in as fast as 37ms!

    Archives

    • April 2026
    • March 2026
    • February 2026
    • January 2026
    • December 2025
    • November 2025
    • October 2025
    • September 2025
    • August 2025
    • July 2025
    • June 2025
    • May 2025
    • April 2025
    • March 2025
    • February 2025
    • January 2025
    • December 2024
    • November 2024
    • April 2024
    • March 2024
    • February 2024
    • January 2024
    • December 2023
    • November 2023
    • October 2023
    • September 2023
    • May 2023

    Categories

    • Business
    • Crypto
    • Economy
    • Forex
    • Futures & Commodities
    • Investing
    • Market Data
    • Money
    • News
    • Personal Finance
    • Politics
    • Stocks
    • Technology

    Your source for the serious news. This demo is crafted specifically to exhibit the use of the theme as a news site. Visit our main page for more demos.

    We're social. Connect with us:

    Facebook X (Twitter) Instagram Pinterest YouTube

    Subscribe to Updates

    Get the latest creative news from FooBar about art, design and business.

    Facebook X (Twitter) Instagram Pinterest
    • Home
    • Buy Now
    © 2026 ThemeSphere. Designed by ThemeSphere.

    Type above and press Enter to search. Press Esc to cancel.