On Having A Personal LLM Eval Set

I don’t trust LLM leaderboards. Well, not entirely anyway. While they are the most common way to quickly assess new large language models, I’ve found them to be insufficient for a couple of reasons (or at least take their rankings with a grain of salt):

For these reasons (and because it’s way too interesting to see how new models break), one of the things that I’ve actually found useful is to have is a personal set of evaluations which lets me quickly assess new releases. I combine the results of my personal evals with leaderboard data to form an opinion on the effectiveness of any new LLM release.

Below, I list the kind of tasks that are helpful to have in your eval set. I originally wanted to also share my current eval set but that turned out to make this article way too long. So, for the time being, this article just touches on the categories of evals I like to do and what I look for in building them. Later, I’ll expand on each category with a specific example and show how different LLMs perform.

What kind of evals do you need?

The exact list depends on your particular use cases but I find the following tasks to be effective in understanding the capabilities of new releases:

Reasoning

Reasoning has always had a soft spot in my heart. Some of my earliest work in AI was based around knowledge representation and reasoning. I’m thrilled to see it make a comeback as an important AI task and impressed to see LLMs actually doing reasoning to some degree. When evaluating reasoning abilities, it’s best to have a problem where the interaction between key factors (as well as a smattering of unrelated factors) is important to identify the correct solution. If any one factor by itself can determine the answer, it is very likely most LLMs will solve it. Second, make sure you give concrete numbers for the LLM to play with so you can see if it correctly identifies the interactions and follows up and actually uses what it has identified when figuring out the solution.

SQL

Executives can always have an analyst pull up data for them but sometimes you want answers faster than you can say “Please run these anlytics by EOD.” That means writing those queries youself and LLMs are perfectly poised to help you do this. So, think about the questions you usually like to get answered, add in the specifics of which database you are using, and the schema, and you should be able to come up with a few questions that can quickly tell you if the LLM’s ability to write SQL queries answer is good enough and an improvement over other LLMs.

Image (and now Video) generation

Image generation was one of the original uses for some of these models and they have come a long way since the early days of techniques like styleGAN. But I don’t think LVMs (Large Vision Models) are quite ready to solve the kinds of problems I have - which involves taking an idea that seems clear in my head but I don’t know quite how to articulate into words and creating a visual that matches the aesthetic that I’m visualizing. I pretty sure there is a process to be found here - one that uses additional tools and a better understanding of visual elements (like color palettes for e.g.). What I do find useful is LLM use for quick ideation work. Plus, if you do not have any preconceived notions of what the scene composition should look like, you should be good. Anyway, a good eval for an LVM is one that involves interactions of items that are unlikely to be present in the real world (like cats driving cars). Such interactions give a good idea of what an LVM excels in (drawing cars and cats individually) and where they are likely to fail (cats with hands or a cat’s head peeking out through the roof etc). With your eval you want to test the LVMs ability to draw objects, understand perspective, populate the background, and seamlessly integrate all three into a single image.

Software Development

Probably the main driver of LLM adoption currently while businesses try to figure out how to actually monetize LLMs (In case you are interested, my friend and former boss wrote this piece on AI & Business Impact where he talks about finding real value when investing in AI). I divide this eval task into a few subcategories.

Copy-editing for tone & style

This is another popular use case but usability definitely depends on how comfortable you are with a particular language and your creativity. I usually test a model by giving it a sample of my writing and asking it to rewrite the same paragraph of text in a variety of styles - marketing, sales pitch, blog post, and documentation, and see how well the resulting text matches my expectations. I have yet to be able to use LLM generated output without modification on my part but they are great at catching typos, grammatical mistakes, and suggesting improvements.

Having a personal eval set is like having a trusted taste-tester for AI models - it helps you quickly separate the actual performance from what might be marketing fluff. While leaderboards give you the highlight reel, a personal eval set lets you see how models perform for your specific needs. Whether you’re wrestling with reasoning problems, trying to get SQL to behave, generating images that don’t put cat heads through car roofs, or just trying to get some decent code review comments, having your own test suite can save you from the dreaded ‘but it worked great on the benchmark!’ conversation with your team. Just remember - like any good testing framework, your eval set should evolve with your needs and the capabilities of new models. And yes, I did have an LLM proofread this summary, because even writers need a second pair of eyes, even if they’re artificial ones.”

Disclaimer: The views expressed in this article are my own and do not necessarily represent the views of my employer.

© 2025 Unmesh Kurup

Bluesky GitHub