My Personal LLM Eval for Image Generation

In my post about my personal eval for LLMs, I talked about how I have a different eval for different aspects of LLM use - like Reasoning, Coding, Testing, and, of course, Image Generation. Here’s my image gen eval and how it does on a few prominent models.

The prompt

Create an image of three cats driving in an automobile down a busy city street with many cats in the background.

Why it’s interesting

As image prompts go, this one is pretty basic but there are a few things that are tricky:

The goal, as with all of my tests, is to check if LLMs can make what I call ‘commonsense’ choices when it has to fill in the gaps between a spec (prompt) and what it’s trying to create.

TL;DR

None of the generators did managed the task fully though Leonardo.ai and Dall-E came close with one of each of their suggestions. But even there, there are minor issues - one of the cats is poking out through the windshield (I guess that car could just not have a windshield and then it’d be physically possible) and the background is made up of dogs not cats, or the third cat is out of the car etc. Some, like Midjourney, were far off the mark including blending a cat into the hood of a car and got the number, placement, and the background wrong.

The conclusion is that while a lot of “commonsense” can be captured by regularities in the world (i.e. the data) there are many that can’t be and until the models figure out how to make sure the scene they are generating makes sense, they are not ready to be unleashed without supervision.

Caveat: I’m sure with some prompt engineering magic I can get all of them to spit out an image that meets my requirements but my reason for the eval is to see if LLMs can make reasonable choices when faced with a lack of information.

The Images

I’ve included multiple images where the app spit out multiple options.

Leonardo.ai

The third panel, IMO, represents the best effort across the board. You can see that the cats are arrayed correctly in the front seat. The car itself is, opportunistically, a convertible which removes any issues with how to render a cat inside a car and still have everything fit well. But in the other examples you can see the some of the errors - multiple steering wheels and what the model thinks “busy” means (a crowd of cats?). Once these models get better, I suspect I’ll have to ask my prompt to exclude convertibles to force them to capture the right perspective for cats inside a car. Four pictures of cats in a car on a busy city street filled with cats

Dall-E

Panel 1 is definitely one of the better efforts. I initially dismissed it because it showed only two cats but then I noticed the smaller cat hanging on to the side of the car. So, technically, I think there are three cats “in” a car. The background is realy good. It captures a busy street with plenty of cats in the background and they aren’t just standing around but actually engaged in busy city street activity. A couple of minor issues with cats riding bicycles across the pavement instead of on the pavement or the street but that’s physcially and logically posible so I’ll allow it.
Two pictures of 2 to 3 cats in a car on a busy city street filled with cats

Gemini2.0-Flash-experimental

How many cats are even in the car? I like how one of the cats has regular human glasses though. And definitely a good job on making sure that the cats are completely in the car. A picture of many cats in a car. There are two steering wheels with cats at each of them. With a third cat in the middle holding what looks like an envelope. Many cats can be seen on the outside of the car in the background.

Grok-3-beta

Interestingly Grok likes to render real scenes and cats and place them together. I don’t know if this approach has any specific benefits other than the cars and the background looking realistic. It makes everthing else harder. And, of course, the number of cats is off. A 2x2 panel of four pictures showing cats in cars. Each panel has about 4 cats sitting in a convertible car. Some are wearing shades. Other hats. All pictures are set on an urban street that looks like a generic big city in the US

Ideogram

First panel comes the closest but I do like the background in panel 4. They managed to mix in people with cats just randomly standing around. What’s up with the tiny pink top hats on the cats? :) These images show a collection of humorous digitally altered photographs featuring cats driving or riding in vintage cars on busy city streets. The collection consists of four side-by-side images: Three cats wearing small top hats sitting in the front seat of a classic blue/gray sedan driving down a crowded street Three white cats in the front seat of a light blue vintage car with other cats lining the street watching them. Three cats wearing pink clothing in a red convertible classic car Three cats sitting in a light blue vintage convertible on what appears to be an Asian city street with motorcycles and pedestrians All images share the same whimsical concept of anthropomorphized cats as drivers or passengers in vintage automobiles navigating through urban environments.

Stability.ai (SD3.5-large)

If I encountered that background in real life, I’d run. That many cats could only mean bad news. A digitally manipulated image showing three cats (one gray tabby, one orange tabby, and one white and orange) with their front paws resting on the dashboard of a convertible car. In the background, hundreds of cats appear to be lining both sides of a city street. The perspective is from inside the vehicle looking out at the surreal scene of the cat-filled urban landscape, with buildings and traffic visible in the distance.

Midjourney

Hello random child in the first panel. So glad you could be part of this experiment. A composite image of four digitally manipulated photographs showing cats in vintage vehicles on city streets. In the top left, cats ride in an antique car with one wearing a top hat. The top right shows three ginger cats peering out from a car window. The bottom left features three cats in a red convertible, while the bottom right shows three more cats (orange, tabby, and white) riding in a blue convertible. All images are set in urban environments with tall buildings in the background, creating a whimsical series depicting cats as automobile passengers or drivers in a human city.

Adobe Firefly

Adobe did a good job with panel 2 where the cats are completely in the car without any parts sticking out. A grid of four stylized AI-generated illustrations showing cats driving vintage cars down city streets. Each panel features a similar scene with variations: Three cats in a teal vintage car with other cats visible in the background on a sunny city street. Two orange cats driving a blue classic car down a street with tall buildings and a setting sun. Three cats in a light blue vintage car with other cat-driven vehicles visible in traffic. Three cats (one orange, two gray) driving a yellow/brown classic car through a busy urban setting. All images share an illustrated/painted art style with warm sunset lighting, depicting anthropomorphized cats as drivers navigating through colorful cityscapes with other cat characters visible in the surroundings.

Disclaimer: The views expressed in this article are my own and do not necessarily represent the views of my employer.

© 2025 Unmesh Kurup

Bluesky GitHub