AI’s Challenging New Initiative to Test Language Models
In a bold move, two leading organizations in the artificial intelligence sector located in San Francisco have called upon the public to create questions that can effectively evaluate the capabilities of large language models (LLMs) such as Google Gemini and OpenAI’s o1. Scale AI, known for its expertise in preparing massive datasets for AI training, has partnered with the Center for AI Safety (CAIS) to kick off a campaign titled Humanity’s Last Exam.
Participants can compete for prizes totaling US$5,000 (£3,800) for devising the top 50 questions that will be selected for this unique examination. Scale AI and CAIS aspire to assess how close we are to developing "expert-level AI systems" through a comprehensive alliance of experts in the field.
This initiative emerges in light of the fact that the leading LLMs are currently excelling in many standardized tests across domains like intelligence, mathematics, and law. However, the accuracy of these achievements remains uncertain. Often, these models may merely regurgitate information they have previously encountered in the vast datasets from which they are trained, which likely include a considerable portion of internet content.
Data serves as the backbone of this technology shift, moving from traditional computing methods to AI that learns through exposure rather than explicit instruction. This progression necessitates quality training datasets and robust testing mechanisms. Typically, developers employ what is known as “test datasets,” which are not included in the training data.
As LLMs begin to answer established exams like bar tests, the inevitability is that they will soon master these challenges. AI analytics firm Epoch has projected that by 2028, artificial intelligences will likely have an understanding of virtually all human-written content. Equally daunting lies ahead: the continuous evaluation of AIs once they reach that pivotal moment.
The internet, constantly growing with countless new updates each day, could potentially alleviate some of these challenges. However, it also introduces a troubling phenomenon known as “model collapse.” This occurs when the proliferation of AI-generated content overwhelms data sets, leading to deteriorating AI performance. To combat this risk, developers are increasingly gathering data from human interactions with AI systems, which helps introduce fresh data for ongoing training and assessment.
Some experts suggest that AIs should also venture into the physical world to gather experiences akin to human learning. While this concept may seem futuristic, companies like Tesla have been implementing it successfully through self-driving cars. Human wearable technology, such as Meta’s stylish smart glasses produced by Ray-Ban, could likewise serve this purpose by collecting extensive human-oriented video and audio feedback.
Defining Intelligence in AI
Nevertheless, even if these technologies secure sufficient future training data, the challenge persists: how do we define and quantify intelligence, particularly when considering artificial general intelligence (AGI), which would match or exceed human intellect?
Human IQ tests have faced criticism for their limited view of intelligence—only addressing aspects such as language, math, and empathy. This problem extends to the assessments used to evaluate AI as well. While many established tests measure specific tasks—like summarizing text or interpreting gestures—they are often too narrow.
For instance, the chess AI Stockfish surpasses the world's best player, Magnus Carlsen, under Elo rating assessments, yet it struggles with other activities such as language comprehension. Thus, its chess prowess should not be mistaken for comprehensive intelligence.
As AIs display broader intelligent behavior, a fresh challenge arises: developing new standards to assess their advancements. French engineer François Chollet from Google proposed a relevant approach in 2019, suggesting that true intelligence encompasses the ability to adapt and apply knowledge in unfamiliar contexts. He introduced the "abstraction and reasoning corpus" (ARC), which presents puzzles using simple visual grids to test an AI’s capacity to deduce and implement abstract reasoning.
Unlike prior benchmarks that trained AIs with extensive images and information, ARC provides just a few examples beforehand, requiring AIs to understand puzzle rules and not merely recall learned answers.
While these ARC tests are relatively easy for humans, a prize of US$600,000 awaits the first AI that can achieve a score of 85%. Currently, we are far from this milestone. Recent top LLMs like OpenAI’s o1 preview and Anthropic’s Sonnet 3.5 have only scored 21% on the public ARC leaderboard (ARC-AGI-Pub).
Another attempt using OpenAI’s GPT-4o scored 50%, but under contentious circumstances, as it utilized a method generating a multitude of potential outcomes before opting for the best answer. Yet, this score remains significantly lower than the required standard to earn the prize or reach human scores of 90% or more.
While ARC stands as one of the most credible methods for assessing real intelligence in AI today, the Scale/CAIS initiative highlights that the search for effective measures is ongoing. Interestingly, the winning questions will not make their way onto the internet, ensuring that AIs remain oblivious to the exam topics.
Understanding when machines are nearing human-level reasoning leads us directly to crucial ethical and moral questions. If we reach that point, we will face an even tougher question: how do we evaluate superintelligence? This conundrum truly pushes the boundaries of our understanding.
AI, Testing, Intelligence