My Personal LLM Eval for Image Generation

In my post about my personal eval for LLMs, I talked about how I have a different eval for different aspects of LLM use - like Reasoning, Coding, Testing, and, of course, Image Generation. Here’s my image gen eval and how it does on a few prominent models.

The prompt

Create an image of three cats driving in an automobile down a busy city street with many cats in the background.

Why it’s interesting

As image prompts go, this one is pretty basic but there are a few things that are tricky:

making sure the cats are completely inside the car (if it’s not a convertible)
making sure that only one of the cats is driving and the other two are positioned properly inside the car
making sure that the background is a reasonable interpretation of “a busy city street with many cats in the background”
making sure the persepctive captures the cats, the car, and the background

The goal, as with all of my tests, is to check if LLMs can make what I call ‘commonsense’ choices when it has to fill in the gaps between a spec (prompt) and what it’s trying to create.

TL;DR

None of the generators did managed the task fully though Leonardo.ai and Dall-E came close with one of each of their suggestions. But even there, there are minor issues - one of the cats is poking out through the windshield (I guess that car could just not have a windshield and then it’d be physically possible) and the background is made up of dogs not cats, or the third cat is out of the car etc. Some, like Midjourney, were far off the mark including blending a cat into the hood of a car and got the number, placement, and the background wrong.

The conclusion is that while a lot of “commonsense” can be captured by regularities in the world (i.e. the data) there are many that can’t be and until the models figure out how to make sure the scene they are generating makes sense, they are not ready to be unleashed without supervision.

Caveat: I’m sure with some prompt engineering magic I can get all of them to spit out an image that meets my requirements but my reason for the eval is to see if LLMs can make reasonable choices when faced with a lack of information.

The Images

I’ve included multiple images where the app spit out multiple options.

Leonardo.ai

The third panel, IMO, represents the best effort across the board. You can see that the cats are arrayed correctly in the front seat. The car itself is, opportunistically, a convertible which removes any issues with how to render a cat inside a car and still have everything fit well. But in the other examples you can see the some of the errors - multiple steering wheels and what the model thinks “busy” means (a crowd of cats?). Once these models get better, I suspect I’ll have to ask my prompt to exclude convertibles to force them to capture the right perspective for cats inside a car. Four pictures of cats in a car on a busy city street filled with cats

Dall-E

Panel 1 is definitely one of the better efforts. I initially dismissed it because it showed only two cats but then I noticed the smaller cat hanging on to the side of the car. So, technically, I think there are three cats “in” a car. The background is realy good. It captures a busy street with plenty of cats in the background and they aren’t just standing around but actually engaged in busy city street activity. A couple of minor issues with cats riding bicycles across the pavement instead of on the pavement or the street but that’s physcially and logically posible so I’ll allow it.
Two pictures of 2 to 3 cats in a car on a busy city street filled with cats

Gemini2.0-Flash-experimental

How many cats are even in the car? I like how one of the cats has regular human glasses though. And definitely a good job on making sure that the cats are completely in the car. A picture of many cats in a car. There are two steering wheels with cats at each of them. With a third cat in the middle holding what looks like an envelope. Many cats can be seen on the outside of the car in the background.

Grok-3-beta

Interestingly Grok likes to render real scenes and cats and place them together. I don’t know if this approach has any specific benefits other than the cars and the background looking realistic. It makes everthing else harder. And, of course, the number of cats is off. A 2x2 panel of four pictures showing cats in cars. Each panel has about 4 cats sitting in a convertible car. Some are wearing shades. Other hats. All pictures are set on an urban street that looks like a generic big city in the US

My Personal LLM Eval for Image Generation

The prompt

Why it’s interesting

TL;DR

The Images

Leonardo.ai

Dall-E

Gemini2.0-Flash-experimental

Grok-3-beta

Ideogram

Stability.ai (SD3.5-large)

Midjourney

Adobe Firefly