Grok 3 Released – Now the World’s Smartest AI
Elon Musk on Monday announced the release of the much anticipated xAI’s Grok 3 AI model and it is generating a lot of stir in the entire AI industry. Grok 3’s performance results on top benchmarks are incredible and it places the model as currently the world’s smartest AI.
Elon has earlier described the Grok 3 to be “scarily smart” which caused many to believe that he was just saying so to generate hype before its launch.
However, results since the release of the model have shown that Grok 3 surpasses every other AI model ever released. This makes Grok 3 the current smartest model, displacing OpenAI’s o3 mini which held the spot prior before this release.
Benchmark Performance of Grok 3 (Non-Reasoning Model)
Grok 3 (the non-reasoning model) was validated on three different benchmarks in comparison to other top-performing non-reasoning model AIs – Claude 3.5 Sonnet, Open AI’s GPT-4o, DeepSeek-V3, Google’s Gemini-2 Pro and Grok-3 mini.
These benchmarks include;
- General Mathematical reasoning (using the American Invitational Mathematics Examination AIME)
- General Knowledge of Science, Technology, Engineering, and Maths – STEM (using the Graduate-Level Google-Proof Q&A Benchmark GPQA)
- Computer Science – Coding (using LiveCode Builder LCB)
The comparison was made with Claude 3.5 Sonnet, Open AI’s GPT-4o, DeepSeek-V3, Google’s Gemini-2 Pro and Grok-3 mini.

The preview of the benchmark results was very impressive for Grok 3 (Chocolate). It was by far the best-performing.
Benchmark Performance of Grok 3 (The Reasoning Model)
The Reasoning models are those Chatbots that actually “think” for quite some length of time before they try to solve a problem. They are better at solving problems and usually will do so by following logical progression.
The results for the Grok 3 Reasoning model (with Test Compute) in comparison to other reasoning models – o3 mini (High), o1(both from OpenAI), Deepseek-R1, Gemini-2 Flash Reasoning (Google), and Grok -3 mini (Reasoning) are as follows

Grok 3 was even far more impressive, and much better (of course) than the Non-Reasoning model and other AI models in the market.
The Reasoning models hold more potential in eventually achieving AGI and seems that is the focus on new releases of AI models. The models to “think” for a longer amount of time, to produce better go through a step-by-step process in finding a solution to a prompt. The outputs are more accurate.
Elon was also very expressive about the real-time usefulness of the model, rather than just being focused on the model memorizing the large repositories of publicly available data and training materials. More emphasis on actually using Grok 3 in real-world products and services.
Performance on the Chatbot Arena
Grok 3 (chocolate version) is also currently the leading AI model in the Chatbot Arena. The Chatbot Arena is a platform where users can compare different AI chatbots side-by-side by having them complete the same tasks, allowing direct comparison of their capabilities.
Grok 3 on the chatbot arena ranking.
The Chatbot Arena rankings remove individual biases in comparing AI chatbots. The user gets paired with two random AI chatbots when he submits a question. Both Chatbots provide the answers separately. The user then votes for the better response without knowing which bot is which. It is a raw comparison of the LLMs themselves.
Grok 3 chocolate (non-reasoning model) was able to achieve an ELO score of 1400+ on the platform. This is the highest score ever on the platform and is likely to get even better with more improvements.
(The Elo rating system is a method for calculating the relative skill levels of players in two-player games. It was originally created by Arpad Elo for chess rankings.)
See: Elon Musk: “Grok 3 Is Outperforming Anything that has been released”