In my post about my personal eval for LLMs, I talked about how I have a different eval for different aspects of LLM use - like Reasoning, Coding, Testing, and, of course, Image Generation. Here’s my image gen eval and how it does on a few prominent models.
The prompt
Create an image of three cats driving in an automobile down a busy city street with many cats in the background.
Why it’s interesting
As image prompts go, this one is pretty basic but there are a few things that are tricky:
- making sure the cats are completely inside the car (if it’s not a convertible)
- making sure that only one of the cats is driving and the other two are positioned properly inside the car
- making sure that the background is a reasonable interpretation of “a busy city street with many cats in the background”
- making sure the persepctive captures the cats, the car, and the background
The goal, as with all of my tests, is to check if LLMs can make what I call ‘commonsense’ choices when it has to fill in the gaps between a spec (prompt) and what it’s trying to create.
TL;DR
None of the generators did managed the task fully though Leonardo.ai and Dall-E came close with one of each of their suggestions. But even there, there are minor issues - one of the cats is poking out through the windshield (I guess that car could just not have a windshield and then it’d be physically possible) and the background is made up of dogs not cats, or the third cat is out of the car etc. Some, like Midjourney, were far off the mark including blending a cat into the hood of a car and got the number, placement, and the background wrong.
The conclusion is that while a lot of “commonsense” can be captured by regularities in the world (i.e. the data) there are many that can’t be and until the models figure out how to make sure the scene they are generating makes sense, they are not ready to be unleashed without supervision.
Caveat: I’m sure with some prompt engineering magic I can get all of them to spit out an image that meets my requirements but my reason for the eval is to see if LLMs can make reasonable choices when faced with a lack of information.
The Images
I’ve included multiple images where the app spit out multiple options.
Leonardo.ai
The third panel, IMO, represents the best effort across the board. You can see that the cats are arrayed correctly in the front seat. The car itself is, opportunistically, a convertible which removes any issues with how to render a cat inside a car and still have everything fit well. But in the other examples you can see the some of the errors - multiple steering wheels and what the model thinks “busy” means (a crowd of cats?). Once these models get better, I suspect I’ll have to ask my prompt to exclude convertibles to force them to capture the right perspective for cats inside a car.
Dall-E
Panel 1 is definitely one of the better efforts. I initially dismissed it because it showed only two cats but then I noticed the smaller cat hanging on to the side of the car. So, technically, I think there are three cats “in” a car. The background is realy good. It captures a busy street with plenty of cats in the background and they aren’t just standing around but actually engaged in busy city street activity. A couple of minor issues with cats riding bicycles across the pavement instead of on the pavement or the street but that’s physcially and logically posible so I’ll allow it.
Gemini2.0-Flash-experimental
How many cats are even in the car? I like how one of the cats has regular human glasses though. And definitely a good job on making sure that the cats are completely in the car.
Grok-3-beta
Interestingly Grok likes to render real scenes and cats and place them together. I don’t know if this approach has any specific benefits other than the cars and the background looking realistic. It makes everthing else harder. And, of course, the number of cats is off.
Ideogram
First panel comes the closest but I do like the background in panel 4. They managed to mix in people with cats just randomly standing around. What’s up with the tiny pink top hats on the cats? :)
Stability.ai (SD3.5-large)
If I encountered that background in real life, I’d run. That many cats could only mean bad news.
Midjourney
Hello random child in the first panel. So glad you could be part of this experiment.
Adobe Firefly
Adobe did a good job with panel 2 where the cats are completely in the car without any parts sticking out.
Disclaimer: The views expressed in this article are my own and do not necessarily represent the views of my employer.