When AI Finally Gets the Picture

This is one pass at an issue I’ve been circling for a while.

For nearly four years, I’ve been trying to generate a single image: a late-1960s muscle car with a massive engine and comically undersized rear tires. Not a stylized interpretation. Not an artistic take on the concept. The actual thing, rendered clearly enough that you’d immediately understand what you were looking at.

I had a real purpose for this image.

I needed a visual that captured capability without effective application, power that can’t translate to results. A car that could theoretically accelerate from zero to sixty in four seconds, except those tiny rear tires would just spin uselessly the moment you touched the accelerator.

All that potential, no traction. It’s a concept that comes up constantly in organizational work, and I wanted an image that made it instantly clear.

The image I was trying to generate for years.

Four Years of Not Quite

Every few months, I’d try again. Midjourney, DALL-E, Stable Diffusion, whatever new platform emerged. The models improved steadily. My results didn’t. I got beautiful muscle cars with normal proportions. I got hot rods with oversized everything. I got vehicles that looked vaguely wrong in ways I couldn’t specify in a prompt. Early on, I got genuine monstrosities where cars sprouted several extra wheels from the roof, each one beautifully rendered but utterly absurd. Nothing came close to what I needed.

This wasn’t casual experimentation. I consulted enthusiasts. I asked experts for prompt refinement strategies. I tried technical terminology, then plain language, then absurdly detailed descriptions. The image existed clearly in my head. Translating that clarity into something AI could execute proved impossible.

Then, this week, ChatGPT got it right. First try, cleanest result. I run my test prompt whenever I see news of image generation advancements, and this week marked the breakthrough I’d been waiting for. The proportions were perfect. The engine dominated the front end exactly as it should. The rear tires looked almost comically inadequate for the power they’d need to handle.

Nano Banana Pro eventually succeeded too, but getting there revealed something important about how these tools actually work. After half a dozen failed attempts that produced everything from normal proportions to bicycle wheels mounted on the rear axle, I completely changed my prompting approach and finally got the right image. Success felt close. But then I’d ask for adjustments like different lighting or a slight angle change, and the tires would revert to regular width. The model could handle my request, but it couldn’t maintain the core concept consistently across iterations. Each modification risked losing what had finally worked.

Where Image Models Still Fall Apart

Meanwhile, I ran a separate test that produced entirely different results. I sketched a simple single-story floor plan in pencil and asked various models to clean it up and render it properly. No model was able to get it right. The best got the basics right but with one or two glaring errors. Others completely reimagined the layout, adding rooms I hadn’t drawn or repositioning walls in ways that made no structural sense. The capability gap between different tools, and even between different tasks within the same tool, remains surprisingly wide and unpredictable.

The whole experience reminded me of the semi-famous wine glass test that circulated among AI image communities a while back. Fill a wine glass right to the rim with red wine. Photograph it from the side. Simple request. Watch every model fail spectacularly.

They’d generate glasses too full, showing liquid bulging impossibly above the rim in defiance of physics. Or glasses with the liquid line nowhere near the top, despite clear instructions. Getting physics right, getting proportions accurate, understanding what “right to the rim” actually means, proved far harder than generating artistic beauty or dramatic composition.

Capability Before Understanding

AI image generation has progressed differently than large language models. LLMs became coherent quickly, then spent years developing nuance, reliability, and the ability to handle complex instructions. Still, of course there are many tasks where LLM’s get tripped up, but at least there are tried and true ways to guide them to the right output, and discover when things are straying. Image generation went another direction entirely.

Early outputs were often visually stunning, but literal accuracy lagged behind. Ask for creative interpretation of a concept and you’d get a masterpiece. Ask for specific physical relationships or precise proportions and you’d get confident nonsense rendered beautifully.

The limitation wasn’t imagination or artistic capability. It was physical understanding. These systems learned correlation without grasping causation, aesthetics without mechanics, visual appeal without functional possibility. They knew what powerful looked like in dramatic photography. They didn’t know what functionally possible meant in the real world.

What actually changed? Training improved on physical relationships and spatial accuracy. Prompt interpretation became more precise and less prone to wild reinterpretation. The models developed better understanding of how objects relate to each other in three-dimensional space, how proportions actually work rather than how they appear in stylized imagery or artistic rendering.

Why This Matters To Users

But here’s what strikes me as most relevant. This is partially about technological progress and model improvements. But it’s also about understanding tools that change constantly, possess genuinely amazing capabilities, yet come with this patchwork of limitations and constraints that shift between platforms and even between different tasks on the same platform. Learning to work within those boundaries, testing regularly, adjusting your approach based on what each tool handles well and where it falls apart, becomes its own valuable skill. You can be an active, effective user without being a coder or understanding the technical architecture underneath.

The growth experience comes from persistent experimentation and pattern recognition about what works where. Which connects directly back to that muscle car image I was trying to create.

An Unplanned Meta-Lesson

The irony isn’t lost on me. I was trying to generate an image about capability that can’t be effectively applied, using tools that have tremendous capability but can’t always apply it effectively to specific real-world needs. The disconnect between raw power and useful traction exists in the tools themselves.

We’ve moved past impossible and arrived at inconsistent. That’s real progress, and I don’t mean that dismissively. But the distance between occasionally brilliant and reliably useful is where actual value gets built, and where users who understand the landscape can extract real utility despite the gaps. Understanding those gaps, working around them, knowing when to push harder and when to try a completely different approach, matters as much as the underlying technology improving.