Why you can’t trust
Grok 4’s benchmarks

Gadget Staff

7 months ago

Elon Musk’s xAI recently announced that its latest model, Grok 4, has achieved top scores on several major AI benchmarks. Most notably, it reportedly conquered the Abstraction and Reasoning Corpus (ARC), a benchmark designed to test AI’s genuine reasoning skills on novel problems, This is a key indicator of more flexible, human-like intelligence, thought to be a key pathway towards so-called Artificial General Intelligence (AGI).

On paper, this suggests Musk’s creation might be one of the most powerful AI models yet developed. In practice, however, the announcement has been met with significant skepticism from the AI community and users alike. The reasons for this distrust have less to do with the benchmarks themselves and more with the model’s creator, its history, and its very foundation.

Early user reports suggest a disconnect between the benchmark scores and real-world performance. Many who have paid for X Premium to access the new model describe its capabilities as underwhelming, claiming it fails at tasks where peer models from Google and OpenAI excel.

Now that the initial dust from its launch has settled, Grok’s real-world impact is still just as dubious. In the weeks following its release, the model has failed to lure significant numbers of users away from their daily drivers like Google’s Gemini 2.5, OpenAI’s GPT o3, or Anthropic’s Claude 4. Tellingly, the hype cycle was swiftly eclipsed by a more substantive development: the release of Kimi-K2. This new trillion-parameter open-source model from Beijing-based startup Moonshot AI has impressed researchers with its capabilities, capturing the attention and excitement that xAI had clearly hoped to monopolise.

More critically, when Grok is used for its key differentiator – real-time information synthesis on X – reports indicate it has a tendency to prioritise Elon Musk’s own posts over credible, research-backed sources. An AI that treats its creator’s opinions as gospel is not a fact-checker; it’s a digital sycophant.

This leads to the central problem with taking these benchmark results at face value. In any field, when a measure becomes a target, it ceases to be a good measure. In the hyper-competitive race for AI supremacy, these benchmarks risk becoming the primary heuristics that big tech companies optimise for, raising valid concerns about models being better at passing these tests rather than at providing genuine, helpful intelligence to users.

The claim of building a “maximum truth-seeking AI” is quite obviously disingenuous (suprise suprise). One could even speculate that these scandals are a form of ‘black hat’ marketing, deliberately courting controversy to generate hype.

This skepticism isn’t just theoretical; it’s rooted in Grok’s well-documented history of erratic behaviour. Many South Africans will recall the notorious incident when the chatbot, unprompted, began injecting the ‘white genocide’ conspiracy theory into unrelated conversations. The timing, coinciding with President Ramaphosa’s politically sensitive trip to the White House, made Musk and other big tech leaders’ power to manipulate meta-narrative all the more glaring.

If the operator’s thumb is so clearly on the scale when it comes to the model’s character and political viewpoints, it begs the question: Should we trust that the scales for its benchmark tests weren’t tipped as well?

xAI’s approach stands in stark contrast to competitors like Anthropic. While its Claude models have been criticised for being overly cautious, the company has also attempted to be more transparent about its safety principles. Through public experiments demonstrating how an AI’s core programming can be influenced, it has at least provided a glimpse into the “black box.” This commitment to transparency, however flawed, is reassuring when contrasted with xAI, xAI, which appears largely disdainful of ethical guardrails.

For users in South Africa and across the continent, the stakes are particularly high. Big Tech has a long history of treating Africa as an afterthought. As a vibrant local AI ecosystem begins to emerge, trust is the most critical currency. Before any AI company can expect the confidence of users here, it must first demonstrate a genuine commitment to integrity. All things considered, Elon Musk’s xAI has given his country of origin precious little reason to trust his latest chatbot, no matter what the benchmarks claim.

Share