Here is a scene that plays out in global brand teams every month. Two or three creative agencies are asked to develop concepts against the same brief. Each agency submits their best work. The brand needs to test all of them, but the stimuli arriving from each agency look completely different: one has a polished animatic, another has illustrated boards, and the third has a mood film stitched together from stock footage.

The research runs. One direction wins. But did it win because the idea was strongest, or because the stimulus was the most polished? When stimuli are produced at different quality levels by different teams, there is no way to untangle that question. The data becomes unreliable.

Internally, we call this a stim race. And it is the situation where AI-powered creative testing makes the most immediate, practical difference.

The stim race problem

Large brands, the kind spending serious money on advertising, routinely test scripts from multiple creative agencies at once. That is sensible. You want the best idea to win on merit, not on politics or familiarity.

But the testing process has a structural flaw. Each agency produces its own stimulus material, and they produce it to wildly different standards. One agency might invest in a high-quality animatic because they know the test matters. Another might submit rough boards because their production budget is allocated elsewhere. A third might fall somewhere in between.

The result is a test that looks rigorous on paper but is comparing apples, oranges, and something in between. The research agency does their best with what they are given, but no methodology can fully compensate for stimulus quality that varies this much.

This is where Myth Labs sits. When a brand routes all its creative directions through a single production partner for testing, every concept gets the same level of treatment. Same visual quality, same editorial polish, same sound design. The playing field levels out. The data starts measuring what it is supposed to measure: which idea resonates, not which stimulus was better produced.

What AI stimuli actually look like now

If you have not worked with AI-generated animatics recently, your reference point might be outdated. The technology in 2026 has moved well past the wobbly faces and melting hands of a couple of years ago.

What we produce now is, in many cases, very close to the finished ad. Not close enough to air as the final commercial, but close enough that a research respondent or a boardroom full of stakeholders can genuinely feel what the ad will be like. The lighting, the atmosphere, the pacing, the emotional register: it all comes through.

This matters because people respond to what they see, not what they are told to imagine. A respondent watching a polished AI animatic engages with the idea differently than a respondent trying to mentally reconstruct a finished ad from a set of rough sketches. The closer the stimulus is to the real thing, the more useful the response.

More than testing: how AI stimuli feed the creative process

Something we did not fully anticipate when we started this work: the AI stimulus process often becomes part of the creative development itself.

Here is how that happens. We produce a first round of animatics for testing. The brand and agency review them together, and ideas start flowing. "What if we changed the setting?" "What if we tried a different casting direction?" "What if the narrative took a different turn in the second half?" Because AI generation is fast, we can implement those ideas and produce updated versions within a day or two. The stimulus evolves alongside the thinking.

By the time the testing process is complete, the winning animatic is not just a rough preview of what the ad might be. It has been refined through multiple rounds of creative input. It becomes a close guide for the director of the commercial: a detailed visual reference showing exactly how the brand, the agency, and the research audience all agreed the ad should look and feel.

That is a different proposition from a traditional test, where the animatic is a throwaway artifact that gets filed once the direction is chosen. Here, the stimulus has lasting value because it has absorbed the creative development that happened around it.

The economics, plainly

A traditional illustrated animatic for a 30-second ad typically costs £15,000-30,000. An AI-generated animatic of the same length: £5,000-12,500.

For a brand testing scripts from three agencies, levelling the playing field with AI means producing three to five animatics at a total cost that is comparable to what one or two traditional animatics would have run. The investment goes further, the data is cleaner, and the creative development gets a bonus round of visual iteration that would not have been affordable otherwise.

For creative testing packages where four to six animatic variants are produced from a single brief, the total typically falls in the £12,000-25,000 range. That is roughly what you would spend on one or two traditionally produced animatics, but you get genuine breadth.

How the process works

You share a brief. It can be a full creative brief with finished boards, or it can be a script with some reference images. Either way, we need to understand what each creative territory is trying to do and who it is trying to reach.

We break each territory into a shot-by-shot plan and develop detailed prompts for every frame. This is the craft step: describing not just what the scene contains, but the camera angle, the lens, the lighting, the atmosphere, the mood. The difference between a mediocre AI animatic and a good one lives almost entirely in this work.

We generate, curate, and assemble everything into timed sequences with music and sound. Typical turnaround from brief to delivery of all variants is 5-7 working days.

What it does not fix

AI stimulus production solves the quality-levelling problem and makes it economically viable to test more directions. It does not fix a weak brief, a poorly designed test, or a research methodology that asks the wrong questions.

If the creative territories being tested are too similar, no amount of visual polish will produce differentiated data. If the research sample is too small or poorly recruited, the results will be noisy regardless of stimulus quality. If the test is happening because someone needs political cover rather than genuine insight, the data will be interpreted to confirm the existing preference.

These are process problems, not production problems. AI stimulus production makes the production side better. The strategic side still requires good thinking and honest intent.

Where AI stimulus quality still has limits

It would be dishonest to present AI generation as having no constraints.

Character consistency across many shots remains a technical challenge. We solve it, but it requires more production craft than, say, generating a single beautiful landscape. Hands, though much improved from earlier models, still need attention. Very specific real-world locations (a recognisable London street, a particular stadium) can be guided towards accuracy but not guaranteed to be photographic replicas.

For most creative testing applications, these limitations are manageable. The stimulus needs to communicate the idea clearly, not replicate reality frame by frame. But if your concept depends on a specific recognisable setting or a character with a very particular physical appearance, discuss this upfront so we can set realistic expectations.

Getting started without overthinking it

If this is new territory for you, the simplest approach is to run one project and compare.

Pick a brief where you would normally develop two directions for testing. Use AI to produce four or five instead. Run the same research methodology you use now. Compare the experience: the turnaround, the cost, the stimulus quality, the usefulness of the data.

If you are not sure whether AI animatics are right for your specific brief, we are happy to produce a single animatic as a sample before committing to a larger package. That way you see the quality first-hand before making a decision on the full project. Get in touch if that is useful.

Questions we get asked a lot

How many directions should we test? Four to eight is the sweet spot for most briefs. Fewer than three and the comparison is too narrow. More than eight and the research design strains. We will give you a straight recommendation based on your brief and budget.

Can you match our brand guidelines? Yes. We work from your colour palettes, typography, and art direction references. Every output is reviewed against the brief before delivery.

What does it cost? A set of 4-6 animatic variants for a 30-second brief typically falls in the £12,000-25,000 range total. We always quote upfront with no ambiguity.

What research platforms do you deliver for? Standard video formats (MP4, MOV) in whatever resolution and spec the platform requires. We have delivered for Zappi, System1, Toluna, and various online survey tools.

Can the stimuli be reused beyond the research? Often, yes. AI animatics work well for internal presentations, client decks, and as a visual reference for the production director. In many cases the winning animatic becomes a working document that guides the shoot.

The insight behind creative testing has always been sound: measure before you commit. The limitation has been practical. It cost too much and took too long to give every idea a fair shot.

That limitation is gone. If you have a brief with more directions worth testing than your current process can handle, or if you are running a stim race and need every concept treated equally, let's talk about it.

Ready to test more directions?

Share a brief with Myth Labs and we will show you what AI-powered creative testing looks like for your next campaign.

Get in touch

About this article

Written by James Finlay, Creative Director at Myth Labs. Reviewed for accuracy by Izzy Hill, Head of Client Success. Based on our production experience and industry research.

Creative Testing with AI: What Actually Changes (and What Doesn't)