Emerging trends in LLM benchmarking

4 min read

Cover Image for Emerging trends in LLM benchmarking

Introduction

Benchmark is a well defined set of tasks. Benchmarking is a process using which we evaluate a certain technology against those tasks. This serves two important purposes. One, the people building technology can decide whether their each iteration is making the technology better or worse. Second, it enables comparison between multiple models to determine which is the best model.

It is worth noting that developing benchmarks is an art and science. Sometimes technologies evolve to beat benchmarks but fail on real world tests. This is why it is important that a benchmark be hard to beat through some kind of special technology but should rather be reflective of reality in which the technology is supposed to operate.

LLM benchmarking is hard. LLMs can perform many open ended tasks. They can produce code, text, solve logical puzzles etc. This makes it hard to come up with good benchmarks. Ultimately models might be good at one type of activity (e.g. writing poetry) and bad at others (e.g. writing code).

On other hand, when it comes to commercializing models, the companies are generally aiming to use the model for one specific purpose and hence benchmarks become important to make these models commercially useful.

Also, evolving abilities of LLMs means LLMs are increasingly being used for tasks for which humans were otherwise used. This creates a possibility for pushing the boundaries of what is possible using LLMs. This also makes measuring LLM performance against some of these open ended tasks incredibly hard.

Holistic benchmarks

Holistic benchmarks evaluate a model against many different variety of tasks and even combination of those tasks chained together. Purpose of such an evaluation is basically to measure an LLM against average human intelligence.

Large language models are trained on massive amount of text. It is expected that the model will learn many patterns through this text and develop human like intelligence abilities. While is this is not general intelligence, the benchmarks measure the “general ability” of such models.

BigBench, HELM, MTBench etc. are examples of such benchmarks. These benchmarks have helped us answer which models in the industry are currently the best and who is pushing the boundaries of whats possible with LLMs.

Since models evolve rapidly the benchmarks have to be updated regularly as well. There is another class of benchmarks called LiveBench which is dynamic in nature and relies on constantly changing data.

Specialized benchmarks

While holistic benchmarks tell us which company has figured out how to train a model well when it comes to deploying these models the customers are often interested in their use cases. For example one customer who is interested in using it to generate code, might be only interested in the model that is best at generating code but is fine if the model is bad at producing poetry.

GSM8K, MATH are benchmarks that are specifically meant for mathematical reasoning.

HumanEval, Spider (Complex and Cross-Domain Semantic Parsing and Text-to-SQL) are focused on code generation.

CommonsenseQA, HellaSwag, WinoGrande, AI2 Reasoning Challenge (ARC) are designed to measure reasoning abilities and common sense.

MMLU, The LSAT (Law School Admission Test) dataset, AlphaFin focuses on general knowledge.

One might wonder why do we need both holistic and specialized benchmarks. There is an important purpose for both. When a model is trained, it is not obvious how training on a particular corpus would have impact on various abilities of the model. For example, training the model on computer code might impact the model’s ability for mathematical reasoning. So model developers have to make sure that their incremental updates do not break other abilities. Holistic benchmarks help in making sure that the model is healthy and not biased towards a specific skill.

The specialized benchmarks however enable researchers to focus on specific skill and check how their different strategies work out.

A good model development would always use both of these benchmarks to measure the health of the model.

Human feedback based benchmarks

A new trend emerging in llm benchmarks is that of human in the loop benchmarks. This basically means that the output of an LLM is evaluated by a real human. This makes the process subjective and non deterministic in nature but at the same time it is extremely powerful to measure effectiveness of LLM against more complex tasks.

This sort of evaluation can detect ambiguity, abuse, hate speech, ethical considerations and many other things which are hard for automated tests to detect.

You can have a look at some such evaluations here https://console.chaiverse.com/

Conclusion

The three emerging trends in LLM benchmarking appears to be around holistic and specialized benchmarks with human in loop benchmarks becoming popular as well. This is still an evolving field as most models are very competitive and evolving very rapidly. As new techniques and new training methods are being invented we will see even more benchmarks and diversity of what the benchmarks measure.