On Having A Personal LLM Eval Set

I don’t trust LLM leaderboards. Well, not entirely anyway. While they are the most common way to quickly assess new large language models, I’ve found them to be insufficient for a couple of reasons (or at least take their rankings with a grain of salt):

I’m fairly certain that there is an overlap between the data used to train these models and what’s in various eval sets. When you consider how much importance we’ve placed on data, it’s not surprising that whatever data is available has been hoovered up and put into a training set somewhere.
LLMs are finicky and their performance can vary widely between tasks. Whether a particular release is right for your needs becomes a matter of experimentation and a personal set of evals makes this process way faster than asking someone on the team to run your test suite.

For these reasons (and because it’s way too interesting to see how new models break), one of the things that I’ve actually found useful is to have is a personal set of evaluations which lets me quickly assess new releases. I combine the results of my personal evals with leaderboard data to form an opinion on the effectiveness of any new LLM release.

Below, I list the kind of tasks that are helpful to have in your eval set. I originally wanted to also share my current eval set but that turned out to make this article way too long. So, for the time being, this article just touches on the categories of evals I like to do and what I look for in building them. Later, I’ll expand on each category with a specific example and show how different LLMs perform.

What kind of evals do you need?

The exact list depends on your particular use cases but I find the following tasks to be effective in understanding the capabilities of new releases:

Reasoning

Reasoning has always had a soft spot in my heart. Some of my earliest work in AI was based around knowledge representation and reasoning. I’m thrilled to see it make a comeback as an important AI task and impressed to see LLMs actually doing reasoning to some degree. When evaluating reasoning abilities, it’s best to have a problem where the interaction between key factors (as well as a smattering of unrelated factors) is important to identify the correct solution. If any one factor by itself can determine the answer, it is very likely most LLMs will solve it. Second, make sure you give concrete numbers for the LLM to play with so you can see if it correctly identifies the interactions and follows up and actually uses what it has identified when figuring out the solution.

SQL

Executives can always have an analyst pull up data for them but sometimes you want answers faster than you can say “Please run these anlytics by EOD.” That means writing those queries youself and LLMs are perfectly poised to help you do this. So, think about the questions you usually like to get answered, add in the specifics of which database you are using, and the schema, and you should be able to come up with a few questions that can quickly tell you if the LLM’s ability to write SQL queries answer is good enough and an improvement over other LLMs.

Image (and now Video) generation

Image generation was one of the original uses for some of these models and they have come a long way since the early days of techniques like styleGAN. But I don’t think LVMs (Large Vision Models) are quite ready to solve the kinds of problems I have - which involves taking an idea that seems clear in my head but I don’t know quite how to articulate into words and creating a visual that matches the aesthetic that I’m visualizing. I pretty sure there is a process to be found here - one that uses additional tools and a better understanding of visual elements (like color palettes for e.g.). What I do find useful is LLM use for quick ideation work. Plus, if you do not have any preconceived notions of what the scene composition should look like, you should be good. Anyway, a good eval for an LVM is one that involves interactions of items that are unlikely to be present in the real world (like cats driving cars). Such interactions give a good idea of what an LVM excels in (drawing cars and cats individually) and where they are likely to fail (cats with hands or a cat’s head peeking out through the roof etc). With your eval you want to test the LVMs ability to draw objects, understand perspective, populate the background, and seamlessly integrate all three into a single image.

Software Development

Probably the main driver of LLM adoption currently while businesses try to figure out how to actually monetize LLMs (In case you are interested, my friend and former boss wrote this piece on AI & Business Impact where he talks about finding real value when investing in AI). I divide this eval task into a few subcategories.

Frontend - I’ve had limited success generating quality frontends for my projects. Usually I use some combination of existing frameworks like Jekyll or Hugo or Astro for webpages/blogs OR, in one case, I described my layout and color choices and had an LLM create the css and html based off of Tailwind. This limited success is partly due to a lack of my exprience on the frontend. With the use of additional tools like Figma or Photoshop and a deeper understanding of design principles, I’m sure I can do a whole lot better, but for now, generating a css+html seems to be enough for my use cases. That, and asking the LLM to debug my astro or jekyll implementation and how to modify it to get it to do what I want.
Backend - One of the current best use cases for LLM use. The key here, as with all LLM use, is iteration. Expecting the LLM to get the code right on the first try is not going to work consistently. But in conjunction with asking (prompting) the right questions, it’s possible to quickly iterate over versions of the code until you can settle on a version that does what you want it to do. I look at LLM use in coding as essentially pair coding with someone who understands the docs really well but is prone to randomly making simple mistakes. With that in mind, my go to eval is to give the LLM the docs to some API and ask it to write code that queries that API to solve a specific problem. The solution to this problem should involve multiple conditional API calls contingent on what is retrieved in previous queries.
Debugging - This is probably my most common use case. When I run into a weird problem getting some random code to work, I throw it at the LLM and have it generate an idea on how to solve it. Then I refine the prompt or simply try that suggestion (assuming it doesn’t look like it will nuke my system or is a security risk) and then post the results back to the LLM to solve it for me.
Testing - I usually take the eval task for backend coding from earlier and ask the LLM to generate test cases. The API use case where a conditional determines the flow is a good place to see whether the LLM can generate a useful unit test. Another good eval is a case where there is a while (True) loop condition and see how the LLM tests that situation.
PR review - Still haven’t met an LLM that can do a consistently useful code review but they are better than they used to be. I usually like to try cases where the logic rather than a syntax or a particular design pattern is important to the review such as flows that never get executed or flows where a return is not handled.
Documentation - I use LLMs sparingly for documentation but it is in fact a great use case. The LLM is likely to do a better job of writing your README that anything you’d put together - not because you are terrible at it but because you are likely to deprioritize it in favor of doing something else. Whatever you are using to test the backend code will work well here.

Copy-editing for tone & style

This is another popular use case but usability definitely depends on how comfortable you are with a particular language and your creativity. I usually test a model by giving it a sample of my writing and asking it to rewrite the same paragraph of text in a variety of styles - marketing, sales pitch, blog post, and documentation, and see how well the resulting text matches my expectations. I have yet to be able to use LLM generated output without modification on my part but they are great at catching typos, grammatical mistakes, and suggesting improvements.

Having a personal eval set is like having a trusted taste-tester for AI models - it helps you quickly separate the actual performance from what might be marketing fluff. While leaderboards give you the highlight reel, a personal eval set lets you see how models perform for your specific needs. Whether you’re wrestling with reasoning problems, trying to get SQL to behave, generating images that don’t put cat heads through car roofs, or just trying to get some decent code review comments, having your own test suite can save you from the dreaded ‘but it worked great on the benchmark!’ conversation with your team. Just remember - like any good testing framework, your eval set should evolve with your needs and the capabilities of new models. And yes, I did have an LLM proofread this summary, because even writers need a second pair of eyes, even if they’re artificial ones.”

Disclaimer: The views expressed in this article are my own and do not necessarily represent the views of my employer.