What is MMLU

what is MMLU

MMLU (Massive Multitask Language Understanding) is a new benchmark designed to measure knowledge acquired during pretraining by evaluating models exclusively in zero-shot and few-shot settings. This makes the benchmark more challenging and more similar to how we evaluate humans. The benchmark covers 57 subjects across STEM, the humanities, the social sciences, and more. It ranges in difficulty from an elementary level to an advanced professional level, and it tests both world knowledge and problem solving ability. Subjects range from traditional areas, such as mathematics and history, to more specialized areas like law and ethics. The granularity and breadth of the subjects makes the benchmark ideal for identifying a model’s blind spots.

TIGER-Lab Introduces MMLU-Pro Dataset for Comprehensive Benchmarking of Large Language Models’ Capabilities and Performance

]

The evaluation of artificial intelligence models, particularly large language models (LLMs), is a rapidly evolving research field. Researchers are focused on developing more rigorous benchmarks to assess the capabilities of these models across a wide range of complex tasks. This field is essential for advancing AI technology as it provides insights into the strengths & weaknesses of various AI systems. By understanding these aspects, researchers can make informed decisions on improving and refining these models.

One significant problem in evaluating LLMs is the inadequacy of existing benchmarks in fully capturing the models’ capabilities. Traditional benchmarks, like the original Massive Multitask Language Understanding (MMLU) dataset, often fail to provide a comprehensive assessment. These benchmarks typically include limited answer options and focus predominantly on knowledge-based questions that do not require extensive reasoning. Consequently, they fail to reflect the true problem-solving and reasoning skills of LLMs accurately. This gap underscores the need for more challenging and inclusive datasets that can better evaluate the diverse capabilities of these advanced AI systems.

Current methods for evaluating LLMs, such as the original MMLU dataset, provide some insights but have notable limitations. The original MMLU dataset includes only four answer options per question, which limits the complexity and reduces the challenge for the models. The questions are mostly knowledge-driven, so they do not require deep reasoning abilities crucial for comprehensive AI evaluation. These constraints result in an incomplete understanding of the models’ performance, highlighting the necessity for improved evaluation tools.

Researchers from TIGER-Lab have introduced the MMLU-Pro dataset to address these limitations. This new dataset is designed to provide a more rigorous and comprehensive benchmark for evaluating LLMs. MMLU-Pro significantly increases the number of answer options from four to ten per question, enhancing the evaluation’s complexity and realism. Including more reasoning-focused questions addresses the shortcomings of the original MMLU dataset. This effort involves leading AI research labs and academic institutions, aiming to set a new standard in AI evaluation.

The construction of the MMLU-Pro dataset involved a meticulous process to ensure its robustness and effectiveness. Researchers began by filtering the original MMLU dataset to retain only the most challenging and relevant questions. They then augmented the number of answer options per question from four to ten using GPT-4, a state-of-the-art AI model. This augmentation process was not merely about adding more options; it involved generating plausible distractors that require discriminative reasoning to navigate. The dataset sources questions from high-quality STEM websites, theorem-based QA datasets, and college-level science exams. Each question underwent rigorous review by a panel of over ten experts to ensure accuracy, fairness, and complexity, making the MMLU-Pro a robust tool for benchmarking.

The MMLU-Pro dataset employs ten answer options per question, reducing the likelihood of random guessing and significantly increasing the evaluation’s complexity. By incorporating more college-level problems across various disciplines, MMLU-Pro ensures a robust and comprehensive benchmark. The dataset is less sensitive to different prompts, enhancing its reliability. While 57% of the questions are sourced from the original MMLU, they have been meticulously filtered for higher difficulty and relevance. Each question and its options have undergone rigorous review by over ten experts, aiming to minimize errors. Without chain-of-thought (CoT) prompting, the top-performing model, GPT-4o, achieves only a 53% score.

The performance of various AI models on the MMLU-Pro dataset was evaluated, revealing significant differences compared to the original MMLU scores. For example, GPT-4’s accuracy on MMLU-Pro was 71.49%, a notable decrease from its original MMLU score of 88.7%. This 17.21% drop highlights the increased difficulty and robustness of the new dataset. Other models, such as GPT-4-Turbo-0409, dropped from 86.4% to 62.58%, and Claude-3-Sonnet’s performance decreased from 81.5% to 57.93%. These results underscore the challenging nature of MMLU-Pro, which demands deeper reasoning and problem-solving skills from the models.

In conclusion, the MMLU-Pro dataset marks a pivotal advancement in AI evaluation, offering a rigorous benchmark that challenges LLMs with complex, reasoning-focused questions. By increasing the number of answer options and incorporating diverse problem sets, MMLU-Pro provides a more accurate measure of AI capabilities. The notable performance drops observed in models like GPT-4 underscore the dataset’s effectiveness in highlighting areas for improvement. This comprehensive evaluation tool is essential for driving future AI advancements, enabling researchers to refine and enhance the performance of LLMs.

Introducing our new benchmark MMLU-Pro, a more robust and challenging massive multi-task language understanding benchmark with 12K questions.

What’s New?

MMLU-Pro uses 10 options instead of… pic.twitter.com/pWCgzEmxBP — Wenhu Chen (@WenhuChen) May 15, 2024

]

Sometimes the best way to solve a complex problem is to take a page from a children’s book. That’s the lesson Microsoft researchers learned by figuring out how to pack more punch into a much smaller package.

Last year, after spending his workday thinking through potential solutions to machine learning riddles, Microsoft’s Ronen Eldan was reading bedtime stories to his daughter when he thought to himself, “how did she learn this word? How does she know how to connect these words?”

That led the Microsoft Research machine learning expert to wonder how much an AI model could learn using only words a 4-year-old could understand – and ultimately to an innovative training approach that’s produced a new class of more capable small language models that promises to make AI more accessible to more people.

Large language models (LLMs) have created exciting new opportunities to be more productive and creative using AI. But their size means they can require significant computing resources to operate.

While those models will still be the gold standard for solving many types of complex tasks, Microsoft has been developing a series of small language models (SLMs) that offer many of the same capabilities found in LLMs but are smaller in size and are trained on smaller amounts of data.

The company announced today the Phi-3 family of open models, the most capable and cost-effective small language models available. Phi-3 models outperform models of the same size and next size up across a variety of benchmarks that evaluate language, coding and math capabilities, thanks to training innovations developed by Microsoft researchers.

Microsoft is now making the first in that family of more powerful small language models publicly available: Phi-3-mini, measuring 3.8 billion parameters, which performs better than models twice its size, the company said.

Starting today, it will be available in the Microsoft Azure AI Model Catalog and on Hugging Face, a platform for machine learning models, as well as Ollama, a lightweight framework for running models on a local machine. It will also be available as an NVIDIA NIM microservice with a standard API interface that can be deployed anywhere.

Microsoft also announced additional models to the Phi-3 family are coming soon to offer more choice across quality and cost. Phi-3-small (7 billion parameters) and Phi-3-medium (14 billion parameters) will be available in the Azure AI Model Catalog and other model gardens shortly.

Graphic illustrating how the quality of new Phi-3 models, as measured by performance on the Massive Multitask Language Understanding (MMLU) benchmark, compares to other models of similar size. (Image courtesy of Microsoft)

Small language models are designed to perform well for simpler tasks, are more accessible and easier to use for organizations with limited resources and they can be more easily fine-tuned to meet specific needs.

“What we’re going to start to see is not a shift from large to small, but a shift from a singular category of models to a portfolio of models where customers get the ability to make a decision on what is the best model for their scenario,” said Sonali Yadav, principal product manager for Generative AI at Microsoft.

“Some customers may only need small models, some will need big models and many are going to want to combine both in a variety of ways,” said Luis Vargas, vice president of AI at Microsoft.

Choosing the right language model depends on an organization’s specific needs, the complexity of the task and available resources. Small language models are well suited for organizations looking to build applications that can run locally on a device (as opposed to the cloud) and where a task doesn’t require extensive reasoning or a quick response is needed.

]

Are Language Models Benchmark Savants or Real-World Problem Solvers?

AI students taking an exam in a classroom. Image created by author and DALL-E 3.

In the realm of education, the best exams are those that challenge students to apply what they’ve learned in new and unpredictable ways, moving beyond memorizing facts to demonstrate true understanding. Our evaluations of language models should follow the same pattern. As we see new models flood the AI space everyday whether from giants like OpenAI and Anthropic, or from smaller research teams and universities, its critical that our model evaluations dive deeper than performance on standard benchmarks. Emerging research suggests that the benchmarks we’ve relied on to gauge model capability are not as reliable as we once thought. In order for us to champion new models appropriately, our benchmarks must evolve to be as dynamic and complex as the real-world challenges we’re asking these models and emerging AI agent architectures to solve.

In this article we will explore the complexity of language model evaluation by answering the following questions:

How are language models evaluated today? How reliable are language models that excel on benchmarks? Can language models and AI agents translate knowledge into action? Why should language models (or foundation models) master more than text?

So, how are language models evaluated today?

Today most models either Large Language Models (LLMs) or Small Language Models (SLMs) are evaluated on a common set of benchmarks including the Massive Multitask Language Understanding (MMLU), Grade School Math (GSM8K), and Big-Bench Hard (BBH) datasets amongst others.

To provide a deeper understanding of the types of tasks each benchmark evaluates, here are some sample questions from each dataset:

MMLU : Designed to measure information the model learned during pre-training across a variety of STEM and humanities based subjects and difficulty levels from elementary to advanced professional understanding using multiple choice questions.

Example college medicine question in MMLU: “In a genetic test of a newborn, a rare genetic disorder is found that has X-linked recessive transmission. Which of the following statements is likely true regarding the pedigree of the disorder? A. All descendants on the maternal side will have the disorder B. Females will be approximately twice as affected as males in their family. C. All daughters of an affected male will be affected. D. There will be equal distribution of males and females affected.” (Correct answer is C) [2]

: Designed to measure information the model learned during pre-training across a variety of STEM and humanities based subjects and difficulty levels from elementary to advanced professional understanding using multiple choice questions. Example college medicine question in MMLU: “In a genetic test of a newborn, a rare genetic disorder is found that has X-linked recessive transmission. Which of the following statements is likely true regarding the pedigree of the disorder? A. All descendants on the maternal side will have the disorder B. Females will be approximately twice as affected as males in their family. C. All daughters of an affected male will be affected. D. There will be equal distribution of males and females affected.” (Correct answer is C) [2] GSM8K : Language models typically struggle to solve math questions, the GSM8K dataset evaluates a models ability to reason and solve math problems using 8.5k diverse grade school math problems.

Example: “Dean’s mother gave him $28 to go to the grocery store. Dean bought 6 toy cars and 5 teddy bears. Each toy car cost $12 and each teddy bear cost $1. His mother then feels generous and decides to give him and extra $10. How much money does Dean have left?” [3]

: Language models typically struggle to solve math questions, the GSM8K dataset evaluates a models ability to reason and solve math problems using 8.5k diverse grade school math problems. Example: “Dean’s mother gave him $28 to go to the grocery store. Dean bought 6 toy cars and 5 teddy bears. Each toy car cost $12 and each teddy bear cost $1. His mother then feels generous and decides to give him and extra $10. How much money does Dean have left?” [3] BBH: This dataset consists of 23 tasks from the Big Bench dataset which language models have traditionally struggled to solve. These tasks generallly require multi step reasoning to successfully complete the task.

Example: “If you follow these instructions, do you return to the starting point? Turn left. Turn right. Take 5 steps. Take 4 steps. Turn around. Take 9 steps. Options: — Yes — No” [4]

Anthropic’s recent announcement of Claude-3 shows their Opus model surpassing GPT-4 as the leading model on a majority of the common benchmarks. For example, Claude-3 Opus performed at 86.8% on MMLU, narrowly surpassing GPT-4 which scored 86.4%. Claude-3 Opus also scored 95% on GSM8K and 86.8% on BBH compared to GPT-4’s 92% and 83.1% respectively [1].

While the performance of models like GPT-4 and Claude on these benchmarks is impressive, these tasks are not always representative of the types of challenges business want to solve. Additionally, there is a growing body of research suggesting that models are memorizing benchmark questions rather than understanding them. This does not necessarily mean that the models aren’t capable of generalizing to new tasks, we see LLMs and SLMs perform amazing feats everyday, but it does mean we should reconsider how we’re evaluating, scoring, and promoting models.

How reliable are language models that excel on benchmarks?

Research from Microsoft, the Institute of Automation CAS, and the University of Science and Technology, China demonstrates how when asking various language models rephrased or modified benchmark questions, the models perform significantly worse than when asked the same benchmark question with no modification. For the purposes of their research as exhibited in the paper, DyVal 2, the researchers took questions from benchmarks like MMLU and modified them by either rephrasing the question, adding an extra answer to the question, rephrasing the answers, permuting the answers, or adding extra content to the question. When comparing model performance on the “vanilla” dataset compared to the modified questions they saw a decrease in performance, for example GPT-4 scored 84.4 on the vanilla MMLU questions and 68.86 on the modified MMLU questions [5].

Source: DyVal2, Model Performance on Vanilla Benchmarks Compared to Probing Benchmark

Similarly, research from the Department of Computer Science at the University of Arizona indicates that there is a significant amount of data contamination in language models [6]. Meaning that the information in the benchmarks is becoming part of the models training data, effectively making the benchmark scores irrelevant since the models are being tested on information they are trained on.

Additional research from Fudan University, Tongji University, and Alibaba highlights the need for self-evolving dynamic evaluations for AI agents to combat the issues of data contamination and benchmark memorization [7]. These dynamic benchmarks will help prevent models from memorizing or learning information during pre-training that they’d later be tested on. Although a recurring influx of new benchmarks may create challenges when comparing an older model to a newer model, ideally these benchmarks will mitigate issues of data contamination and make it easier to gauge how well a model understands topics from training.

When evaluating model capability for a particular problem, we need to grasp both how well the model understands information learned during pretraining and how well it can generalize to novel tasks or concepts beyond it’s training data.

Can language models and AI agents translate knowledge into action?

As we look to use models as AI agents to perform actions on our behalf, whether that’s booking a vacation, writing a report, or researching new topics for us, we’ll need additional benchmarks or evaluation mechanisms that can assess the reliability and accuracy of these agents. Most businesses looking to harness the power of foundation models require giving the model access to a variety of tools integrated with their unique data sources and require the model to reason and plan when and how to use the tools available to them effectively. These types of tasks are not represented in many traditional LLM benchmarks.

Source: AgentVerse, results from team of agents compared to single agent on software development task involving tool calling and code execution

To address this gap, many research teams are creating their own benchmarks and frameworks that evaluate agent performance on tasks involving tool use and knowledge outside of the model’s training data. For example, the authors of AgentVerse evaluated how well teams of agents could perform real world tasks involving event planning, software development, and consulting. The researchers created their own set of 10 test tasks which were manually evaluated to determine if the agents performed the right set of actions, used the proper tools, and got to an accurate result. They found that teams of agents who operated in a cycle with defined stages for agent recruitment, task planning, independent task execution, and subsequent evaluation lead to superior outcomes compared to independent agents [8].

Beyond single modalities and into the real world. Why should language models (or foundation models) master more than text?

In my opinion the emerging agent architectures and benchmarks are a great step towards understanding how well language models will perform on business oriented problems, but one limitation is that most are still text focused. As we consider the world and the dynamic nature of most jobs, we will need agent systems and models that evaluate both performance on text based tasks as well as visual and auditory tasks together. The AlgoPuzzleVQA dataset is one example of evaluating models on their ability to both reason, read, and visually interpret mathematical and algorithmic puzzles [9].

Source: Are Language Models Puzzle Prodigies? Example questions from AlgoPuzzleVQA dataset

While businesses may not be interested in how well a model can solve a puzzle, it is still a step in the right direction for understanding how well models can reason about multimodal information.

Conclusion

As we continue adopting foundation models in our daily routines and professional endeavors, we need additional evaluation options that mirror real world problems. Dynamic and multimodal benchmarks are one key component of this. However, as we introduce additional agent frameworks and architectures with many AI agents collaborating to solve a problem, evaluation and comparison across models and frameworks becomes even more challenging. The true measure of foundation models lies not in their ability to conquer standardized tests, but in their capacity to understand, adapt, and act within the complex and often unpredictable real world. By changing how we evaluate language models, we challenge these models to evolve from text-based intellects and benchmark savants to comprehensive thinkers capable of tackling multifaceted (and multimodal) challenges.

Interested in discussing further or collaborating? Reach out on LinkedIn!

]

Models are benchmarked based on their capabilities, such as coding, common sense and reasoning. Other capabilities encompass natural language processing, including machine translation, question answering and text summarization.

LLM benchmarks play a crucial role in developing and enhancing models. Benchmarks showcase the progress of an LLM as it learns, with quantitative measures that highlight where the model excels and its areas for improvement. This in turn guides the fine-tuning process, which helps LLM researchers and developers advance the field. LLM benchmarks also provide an objective comparison of different models, helping inform software developers and organizations as they choose which models better suit their needs.

]

Ever since the UAE’s TII launched Falcon, Hugging Face Open LLM Leaderboard has been trending for both right and wrong reasons. The model came out as the champion of open source on various evaluation metrics. Interestingly, there has been no paper of the model yet. It might be possible that the researchers would have used some other metric or dataset for the evaluation of the model.

Hugging Face founders, including Thomas Wolf, the one who made a lot of noise about Falcon reaching the top of the leaderboard, stumbled upon this problem with the evaluation metrics of the recent models. According to the Open LLM Leaderboard, the benchmark of Massive Multitask Language Understanding (MMLU) showed that Meta AI’s LLaMa’s score was significantly lower than the score published in the model’s paper.

This was questioned by many people. Firstly, Andrej Karpathy raised concerns about the leaderboard and promotion of Falcon over LLaMa. It was later evaluated by Yao Fu from Allen Institute, that with no fancy prompting and decoding, LLaMa performed better than Falcon on MMLU evaluation.

MMLU-Pro is an enhanced version of the MMLU dataset, featuring ten answer choices instead of four and requiring more reasoning on questions. It has been expertly reviewed to reduce noise, making it a higher-quality and more challenging benchmark.

GPQA (Google-Proof Q&A Benchmark) is a highly difficult knowledge dataset designed by domain experts to be challenging for laypersons but manageable for experts. The dataset is access-restricted to minimize contamination and ensure accurate evaluation of models’ knowledge and reasoning abilities.

MuSR (Multistep Soft Reasoning) is a dataset of complex, algorithmically generated problems around 1,000 words long, including murder mysteries and team allocation optimizations. Solving these problems requires advanced reasoning and long-range context parsing, with most models performing no better than random.

MATH (Mathematics Aptitude Test of Heuristics) is a compilation of high-school-level competition math problems, formatted consistently with LaTeX for equations and Asymptote for figures. The benchmark focuses on the hardest problems and tests models’ mathematical reasoning and problem-solving skills.

IFEval (Instruction Following Evaluation) tests models’ abilities to follow explicit instructions accurately, such as adhering to specific formatting or keyword inclusion. The evaluation emphasizes precision in following instructions rather than content quality.

BBH (Big Bench Hard) is a subset of 23 challenging tasks from the BigBench dataset, chosen for their objective metrics, difficulty, and sufficient sample sizes for statistical significance. The tasks include multistep arithmetic, algorithmic reasoning, language understanding, and world knowledge, correlating well with human preference.

回到上一頁