We are excited to introduce our AI Benchmarking Report, where we compare the software engineering skills of several popular AI models. Over the past few years, we’ve been helping our customers embrace AI in hiring, including building an AI-assisted assessment experience. To do that, we had to start by understanding what the most cutting-edge models can and cannot do. With the launch of OpenAI’s latest model last week, now felt like the perfect time to share our findings with the public.
CodeSignal’s ranking shows how the latest models compare in solving real-world problems. Our approach goes beyond testing theoretical coding knowledge by using the same job-relevant questions that top companies rely on to screen software engineering candidates. These assessments not only evaluate general coding abilities but also edge-case thinking, providing practical insights that help inform the design of AI-co-piloted assessments.
Methodology
To create this report, we ran the most advanced Large Language Models (LLMs) through 159 variations of framework-based assessments, used by hundreds of our customers, including major tech and finance companies. These questions are designed to test general programming, refactoring, and problem-solving skills. Typically, solving these problems requires writing around 40-60 lines of code in a single file to implement a given set of requirements.
The AI models were evaluated based on two key performance metrics: their average score, representing the proportion of test cases passed, and their solve rate, indicating the percentage of questions fully solved. Both metrics are measured on a scale from 0 to 1, with higher values reflecting superior coding performance
Human dataset
Our benchmarks are compared to a robust human dataset of over 500,000 timed test sessions. We look at average scores and solve rates for the same question bank within those test sessions. In the charts below, you will see comparisons to human “average candidates” and human “top candidates.” For “top candidates” we focus on engineers who have scored in the top 20 percent of the overall assessment.
CodeSignal’s AI model ranking
The results of our benchmarking revealed several fascinating insights about AI model performance. Strawberry (o1-preview and o1-mini) stands out as the clear leader in both score and solve rate, making it the top performer across all metrics. However, we observed interesting variations between score and solve rate in other models. For instance, GPT-4o is particularly good at getting things fully correct, excelling in scenarios where all edge cases are accounted for, whereas Sonnet performs slightly better overall when it comes to tackling simpler coding problems. While Sonnet demonstrates consistency in solving straightforward tasks, it struggles to keep pace with models like GPT-4o that handle edge cases more effectively, particularly in multi-shot settings.
In the table below, “multi-shot” means that the model received feedback on the performance of its code against the provided test cases and was given an opportunity to improve the solution to try again (i.e. have another shot). This is similar to how humans often improve their solutions after receiving feedback, iterating based on mistakes or failed test cases to refine their approach. Later in our report we’ll compare AI 3 shot scores with human candidates, who are given as many shots as they’d need in a timed test.
Here’s a closer look at the model rankings:
Another key insight from our analysis is that the rate of improvement increases significantly when moving from a 1-shot to a 3-shot setting, but levels off after five or more shots. This trend is notable for models like Sonnet and Gemini-flash, which sometimes become less reliable when given too many shots, often “going off the rails.” In contrast, models such as o1-preview show the most improvement when offered multiple shots, making them more resilient in these scenarios.
Human performance vs. AI
While most AI models outperform the average prescreened software engineering applicant, top candidates are still outperforming all AI models in both score and solve rate. For example, the o1-preview model, which ranked highest among AI models, failed to fully solve certain questions that 25 percent of human candidate attempts were able to solve successfully. This shows that while AI models handle some coding tasks with impressive efficiency, human intuition, creativity, and adaptability provide an edge, particularly in more complex or less predictable problems.
This finding highlights the continued importance of human expertise in areas where AI might struggle, reinforcing the notion that close human-AI collaboration is how future software and innovation will be created.
The future: AI and human collaboration in assessments
Our benchmarking results show that while AI models like o1-preview are increasingly powerful, human engineers continue to excel in unique problem-solving areas that AI struggles to replicate. Human intuition and creativity are especially valuable when solving complex or edge-case problems where AI may fall short. This suggests that combining human and AI capabilities can lead to even greater performance in tackling difficult engineering challenges.
To help companies embrace this potential, CodeSignal offers an AI-Assisted Coding Framework, designed to evaluate how candidates use AI as a co-pilot. This framework includes carefully crafted questions that AI alone cannot fully solve, ensuring human input remains critical. By providing an integrated experience with an AI assistant like Cosmo embedded directly into the evaluation environment, candidates can leverage AI tools to demonstrate their ability to work with an AI co-pilot to build the future.
Conclusion
We hope that insights from CodeSignal’s new AI Benchmarking Report will help guide companies seeking to integrate AI into their development workflows. By showcasing how AI models compare to each other as well as to real engineering candidates, this report provides actionable data to help businesses design more effective, AI-empowered engineering teams.
The AI-Assisted Coding Framework (AIACF) further supports this transition by enabling companies to evaluate how well candidates can collaborate with AI, ensuring that the engineers hired are not just technically skilled but also adept at leveraging AI as a co-pilot. Together, these tools offer a comprehensive approach to building the future of software engineering—where human ingenuity and AI capabilities combine to drive innovation.
The post AI vs. human engineers: Benchmarking coding skills head-to-head appeared first on CodeSignal.