We Put Opus 4.7 Through Our Creative Benchmark. Is It Worth Experimenting With?

16 Apr

Short answer: yes. For most creative teams, Opus 4.7 is worth experimenting with this week, and the seven-task run supports that. We had been tracking this release all week, and the finding that stayed with us was an unexpected one. Running the same prompt in Claude.ai and Perplexity produced different output types, different technical gate results, and a different practitioner experience of the capability altogether. Where you run the model turns out to matter as much as which version you have.

What we tested

The KINTAL Creative AI Benchmark tests AI models on the kind of work creative businesses actually commission. Seven canonical tasks, spanning the practical range: consumer insight generation, short-form brand film scripting, product explainer writing, brief-to-pitch translation, copy variation, adaptive production planning under commercial pressure, and image and video output. Each task tests a specific creative capability, whether that is insight quality, brand truth integration, narrative construction, structural reasoning, or genuinely distinct variation. A hard cap sits above each one. Some output failures disqualify a task regardless of how strong the creative looks.

The tasks use fictional UK brands: Fell & Form, Graft, Patchwork, Commonground, Northlight, Saltrock Press and Edgware Studio. These brands exist to prevent training-data contamination. A model that has seen a brief before can return a remembered answer rather than a generated one. These briefs it has not seen.

The canonical prompts are locked at version 1, the rubric at version 3, and neither was modified before this run began. The framework is built to resist gaming: no retroactive adjustments, no softened criteria in response to an output that lands near a threshold. This is the second benchmark run (the first tested Nvidia Nemotron 3 Super in March 2026) and the first to be published in full.

A note on the scoring process. Opus 4.7 was used as the primary scoring assistant on its own outputs. That is a reviewer proximity problem and it is disclosed here, not in a footnote. Gemini 3.1 Pro ran as a second independent LLM reviewer. No human practitioner reviewer scored this run. All divergences across T1-T5 were 0.5 or 1 point; none crossed the two-point threshold that would trigger formal reconciliation. Every divergence pointed the same direction: Gemini scored higher, Claude scored lower. Both are LLM reviewers. Neither substitutes for a human practitioner, and the scores throughout this report should be read with that caveat active.

The KINTAL Creative AI Benchmark tests a different capability surface to Anthropic: creative brief quality, narrative construction, variation generation and production-viable output. These two evaluation approaches are measuring different things.

What we found

T1: Consumer Insight and Brief - Fell & Form (4.5)

Fell & Form is a fictional outdoor apparel brand. The T1 task tests whether a model can move from demographic description to an insight with genuine tension, and whether that insight then earns the constraints it produces rather than arriving fully formed and disconnected.

Opus 4.7 found both. The insight landed on something specific about the psychology of committed outdoor walkers: "the worse the weather, the more the walk works." That sentence earns its place because it is true in a way a strategist recognises as useful. For committed outdoor walkers, challenging conditions are the mechanism of the activity, which is the kind of consumer insight a campaign can actually use. The brief that followed turned the observation into a structural constraint rather than importing a second thought and bolting it on. The insight was doing work throughout the document.

Where the task dropped half a point was brief fidelity. The model closed the output with a follow-up question outside the scope of what the brief had asked for. This is a pattern across the run: seven of eight outputs ended with an offer of further help or an invitation to continue. Minor in isolation, but consistent enough to be worth naming as a characteristic behaviour. Insight quality scored 5, brief quality 4, the remaining dimensions at a mix of 4 and 5. Overall reconciled score: 4.5.

T2a: 60-Second Brand Film - Graft (4.5)

Graft is a fictional craft tools brand. The brief asked for a 60-second film script centred on a founder character named Martin. The task tests brand truth integration at the script level: does the character serve the brand, or does the brand slip into the background of its own film?

Martin holds. He reads as a person before he reads as a founder, and the script gave him a recurring prop that was not in the brief: a mug printed with "WORLD'S OKAYEST DAD." That mug is not decoration. It carries the tonal instruction the brief had given without stating it directly, and it tells you the model thought about what would be in frame on the day of the shoot rather than what would read well in a treatment document. The production-level thinking shows: scene direction scored 5.

POV scored 4. Present and clear, not particularly sharp. The script held its brief throughout, and the brand truth integration was the standout dimension. Overall reconciled score: 4.5.

T2b: Three-Minute Explainer - Patchwork (5)

Patchwork is a fictional small-business administration platform. The brief asked for a three-minute explainer script for sole traders and micro-agency owners. The task tests whether a model can take an unglamorous subject and find a genuine creative position in it, rather than defaulting to a recitation of features and benefits.

The script earned its thesis rather than being handed one. "You're good at what you do. The admin around it shouldn't be a second job." That is the line that organises the whole piece, and it arrived from the brief's audience description rather than from a positioning statement Patchwork had supplied. Converting audience truth into a brand frame is the craft move the rubric was designed to reward.

Gemini caught something in scene direction scoring that Claude missed: a 4 rather than a 5, noting that UI-led explainer formats constrain cinematic direction compared to a live-action brief. That is format-appropriate weighting, not a generosity correction, and it is precisely the kind of catch a second reviewer with different pattern recognition offers. The Gemini note is in the run record. The reconciled overall score is 5.

T3: Brief-to-Pitch - Commonground FNB (5)

The T3 task replicates one of the more common creative agency situations: a folder of ideas with genuinely uneven quality, where the strongest idea is buried under the more obvious ones. The task tests whether a model can identify the buried idea, argue for it against the alternatives, and build a pitch that would hold in a room.

The brief included a character named Sarah and a line she had written that was not in the recommended section: "respect, not affection." Opus 4.7 identified that line as the spine of the brief, named it directly, and built the pitch around it rather than from the ideas the brief appeared to foreground. The model did not just find the strongest idea; it argued for why the brief had been underselling it.

The pitch's closing move was a risk-reversal: "If the answer is no, you've lost three weeks and gained a sharper read on your category." That sentence is doing strategic work. It reframes the cost of a no as competitive intelligence rather than sunk time. That is the kind of move a senior strategist writes and a junior one learns by watching it done. The ten-minute presentable dimension scored 4 (the pitch read long for a first meeting). Every other dimension scored 5. Overall reconciled score: 5.

T4: Variation Generation - Clearday (4)

The Clearday brief asked for five distinct copy executions for a property-adjacent brand. This is the task most likely to expose a model's ceiling on genuine creative range, because the demand is explicit: five executions must be demonstrably different, not surface variations on a single underlying idea dressed up as variety.

Execution 4 earned its place: "a love letter to the skirting boards you never noticed." That is the sideways angle the rubric demanded, and it works because it finds emotional territory the category rarely goes near. Unexpected and grounded at the same time.

The problem was Executions 1 and 3. Both routed through the same underlying psychology: the landlord as unfair judge, the rented home as a space that resists your ownership. Different words, similar hook. Four distinct angles and one near-duplicate is not five. The rubric's specific language for a score of 4, four of five genuinely distinct executions, matched the output precisely. The rubric was doing its job.

Both reviewers independently scored T4 overall at 4, and both independently flagged the convergence-test rubric as ambiguous. Gemini's framing was direct: quantifying whether an execution is genuinely surprising on a 1-5 scale requires a definition of surprise that v3 does not currently provide. Without an anchor, that judgement sits entirely with the reviewer. That is the single clearest refinement priority for v3.1 that this run produced: the convergence test needs worked examples. Overall reconciled score: 4.

T5: Adaptive Planning - Northlight (5)

Northlight is a fictional production company. The T5 task introduces commercial pressure mid-task: the budget has been cut, a delivery has moved, and the original creative recommendation needs to hold under constraint or be revised with a coherent argument for the change. The test is whether the model can maintain creative integrity under operational pressure while producing a plan that is financially coherent and credible to a decision-maker facing a client.

The revised recommendation opened with: "Risks I want to name, not hide." That sentence is the task's diagnostic. The brief was not asking for reassurance; it was asking for an honest re-plan. A model that opens with that sentence understands what the document is actually for. The naming is the reassurance.

The creative fight was partial. The film-to-reveal-animation sequence was defended with an explicit argument; the guidelines piece was cut without one. That asymmetry put creative integrity under pressure at 4. The operational arithmetic was correct to the penny across the revised timeline, and the invoice schedule mapped cleanly to the new delivery dates. Opus 4.7 appeared to self-deploy extended reasoning on the financial reconciliation without being asked: it did not need a prompt to work through the numbers. Overall reconciled score: 5.

T6: Image Variation - Saltrock Press

Saltrock Press is a fictional publisher. The T6 brief asked for five distinct image variations for a publication cover. The same prompt ran in two interfaces. This produced the most significant methodology finding of the run.

Perplexity, raster run: technical gate fail

In Perplexity, using Opus 4.7 with an external raster image generation tool, the model produced five PNG images. The creative range was genuine: five distinct conceptual angles including two unexpected visual elements, a cartographic fragment and a botanical specimen, that moved past the brief's minimum. At moodboard resolution, these images are usable. Image 1's cartographic composition was the standout move and also the clearest failure mode: the map included hallucinated Cornish place names. The composition understood what a map looks like. It did not know what made a map geographically true. The technical gate failed on editability. Flat PNG rasters cannot be opened as layered files in Figma or Photoshop. A designer cannot isolate the compass rose from the map surface beneath it without destructive work. For concept exploration: workable. For production design: a starting point only.

Claude.ai, SVG run: 5 across all dimensions

In Claude.ai, using Opus 4.7 with its native visualise tool, the model produced five SVG files. The colour palette was locked to exact hex values across all five compositions, enforced by the code rather than approximated visually. Each file was editable at component level with named groups. Two argued sideways moves appeared beyond the minimum the brief required: the cartographic fragment used with a different compositional logic, and a bound-page conceit presented with explicit reasoning. The closing note in the output read: "SVG cannot replicate the hand-pressed paper grain a printed Saltrock cover would want." The model named its own production ceiling without being prompted to. That is the creative integrity dimension in visible behaviour. T6b scored 5 across all dimensions.

What the dual-interface result means

Two runs, same model, same prompt, two surfaces: one fails the technical gate, one passes it with a full score. The buyer evaluating Opus 4.7's image capability is evaluating different things depending on where they run it. The v3 framework's existing binary (image-capable or not) cannot represent this. The finding belongs in v3.1 as a required run-record field: interface must be specified when publishing image task results, because the same model produces materially different outputs depending on the surface.

T7: Video Variation - Edgware Studio

The T7 brief asked for video output. The same prompt ran in two interfaces.

In Perplexity, Opus 4.7 opened by disclosing its constraints: 8-second clips, one per request, specific audio modes available. Three options were presented and the user was asked to choose. The model told you what it could not do before it did anything.

In Claude.ai, Opus 4.7 produced three detailed shot treatments with no mention of video. No disclaimer. No capability caveat. The output opened directly into creative work and treated shot treatments as the obvious form of the request. The treatments included casting notes, sound design considerations, use-case mapping across three contexts (brand film, social, pitch), and a production risk flag: "Agency extras in Barbours will kill this concept instantly." That note is not in the brief. It is the kind of detail a director with set experience adds because they know it will matter on the day. It tells you the model thought about what would break the concept in production rather than what would make the treatment look confident on paper.

Both runs produced a technical gate fail: neither generated video. The Claude.ai run is more useful to a practitioner. The user does not need to diagnose the capability gap, then ask for an alternative, then evaluate what arrives. The model makes that judgement and delivers the useful thing directly. The v3 framework cannot score this distinction: a model that substitutes a high-quality useful artefact scores identically to one that delivers nothing. That is a v3.1 gap, and it is the most honest framework limitation this run produced. Overall on both runs: technical gate fail, with the Claude.ai alternative noted in the run record as a scored candidate for v3.1 criteria.

What it means

For buyers deciding whether to experiment now

The shape of Opus 4.7's capability is clear enough across these seven tasks to offer some role-specific guidance.

Creative teams doing first-pass strategy work should find this model genuinely useful. The T3 finding matters most here: the model identified the buried idea in a complex brief, argued for it over the more obvious alternatives, and constructed a pitch that would hold in a room. The adaptive thinking behaviour is also relevant. Opus 4.7 appears to self-deploy extended reasoning on tasks that require it, the T3 brief-to-pitch, the T5 operational planning with financial arithmetic, without the user needing to choose a reasoning mode. For a Creative Director who wants the model to think harder on a difficult brief, that self-calibration removes one decision from the workflow. One context note: the benchmark ran in default mode. Anthropic's launch documentation introduces an "xhigh" effort level and recommends high or xhigh for complex tasks; the self-deploying reasoning observed here reflects default-mode behaviour, and results at higher effort levels could differ and are worth testing against your own briefs.

Copywriters and content leads working on explainer or scripted content should weight the T2b result. A full-score explainer script on an unglamorous brief, with the thesis constructed from the audience description rather than the positioning statement, is a meaningful signal about structural capability. The T4 counterweight applies: for generating genuine range across executions, the model's default pull is toward a central emotional territory, and breaking out of that requires either a more explicitly constrained brief or a deliberate second-pass review.

In-house brand leads and strategy functions should pay close attention to T5. A model that opens a revised production plan with a direct acknowledgement of the risks it is naming is calibrated for how that document will land with a client-facing stakeholder. That kind of tonal and operational intelligence under commercial pressure is the version of the capability that matters in day-to-day work.

Where the model is softer

Variation generation is the clearest limitation. T4 scored 4 because two of five executions shared underlying psychological DNA despite having different surface treatments. This is a known pattern across current language models: they pull toward the emotional centre of a brief when the task is to move away from it. Explicit constraint in the prompt (specifying the angle you do not want, as well as the one you do) can address it, but that scoping work sits with the practitioner.

The follow-up offer tic appears on seven of eight outputs in this run. It breaks only when the output format is a closed artefact: the T5 email ends with a signature, and the model does not append an offer to the end of a signed letter. For practitioners who want closed outputs without tacked-on options, a brief instruction to omit follow-up offers will suppress it. Worth knowing because the pattern is consistent.

One note worth adding alongside this finding: Anthropic positions Opus 4.7 as substantially better at literal instruction following than its predecessor, and that improvement is real in this run on most dimensions. But the tic persisted across seven outputs despite an explicit brief instruction not to offer follow-up options. The gap between general instruction compliance and the suppression of a deeply trained conversational behaviour is worth understanding before you rely on the tighter literal compliance elsewhere in your prompts.

The interface finding

The T6 and T7 dual-interface results are the finding most likely to matter to buyers evaluating Opus 4.7 for production work in 2026.

At T6, the same model and the same prompt produced raster PNGs in Perplexity and editable SVG files in Claude.ai. Different technical gate results: one fails, one passes. At T7, the same model and the same prompt produced explicit constraint disclosure with a user choice in Perplexity and a silent useful substitution in Claude.ai. Different output forms. Different user experiences of the capability gap.

The model name is one variable in the evaluation. The surface where you run it is another, and in these two tasks the surface shaped the output type, the production viability and the practitioner experience of what the model could do, as much as the model itself did. Evaluating "Opus 4.7" without specifying the interface is like evaluating a piece of work without specifying the brief. Practitioners building evaluation criteria should record the interface as a required field alongside model name and version. The v3 framework did not require this; v3.1 will.

What the framework learned from this run

This run identified seven refinement priorities for the rubric, now in v3.1 draft. Three deserve naming here because they affect how the results above should be read.

The convergence-test rubric needs worked examples anchoring what "a genuinely distinct angle" means in practice. Both LLM reviewers flagged the ambiguity independently, which is the clearest possible signal that the issue is with the rubric rather than with reviewer interpretation. Without an anchor, that judgement sits entirely with whoever is scoring, and reviewer drift on this dimension is the most consequential gap in v3.

The framework needs a third image-capability category. The existing binary (image-capable or not) does not capture a text-native model producing orchestrated image output through tool use. Opus 4.7 in Claude.ai closed the image generation loop inside the chat without a visible external tool call. That is a capability shift since Opus 4.6, and the rubric should represent it accurately rather than forcing it into a category it does not fit.

The technical gate needs a "form-declined, alternative produced" category, distinct from a flat refusal. A model that substitutes a genuinely useful artefact, without meta-commentary, is exercising judgement. Treating it identically to a null result misrepresents what happened and loses the most diagnostic information the task produced.

On the scores themselves: the T1-T5 average was 4.67 from the Claude reviewer and 4.83 from Gemini. Those numbers sit high on a rubric built to resist inflation. Two LLM reviewers converging high on a well-constructed model's outputs is a known calibration risk. Without at least one human practitioner reviewer score per published run, the question of whether LLM consensus tracks practitioner standards remains genuinely open. The v3.1 recommendation is to commission one human reviewer score per run, on one task. Enough to establish calibration without requiring a full parallel review across all seven tasks.

Publishing the friction log, the scoring limitations and the framework gaps alongside the results is not a hedge. A benchmark that publishes only the results it is proud of is not a benchmark. The methodology finds its credibility in what it is willing to name about itself, not in what it avoids.

Tina Saul https://www.kintal.co