Mind in Transition Pipeline

Overview

A working creative-direction system that takes a recorded conversation and turns it into a composed video piece — with AI-generated visuals, ambient sound design, and a director's eye holding the whole thing together.

The pipeline has four stages: Transcribe → Direct → Generate → Assemble. A recording of someone thinking aloud goes in. Claude reads the transcript, interprets it against a creative brief, and writes a scene-by-scene plan. fal.ai and ElevenLabs generate the assets in parallel. FFmpeg stitches it all together into a finished draft. What comes out the other end is something I couldn't have made before — not because the production is faster, but because the act of directing has been pushed up a level.

The Pipeline

Pipeline stages — Transcribe, Direct, Generate, Assemble

Transcribe. A recording — me thinking aloud, or a conversation with someone else — passes through Whisper. Out comes a timestamped, speaker-labelled transcript.

Direct. This is where the system becomes interesting. Claude reads the transcript alongside a creative brief — mood, palette, references, things to avoid — and writes a production plan. Scene by scene: what should be on screen, which model should generate it, how long it should hold, what the audio underneath should feel like. The brief is the aesthetic vocabulary; the transcript is the source. Claude is the director negotiating between them.

Generate. The plan dispatches in parallel to fal.ai for video and stills, and ElevenLabs for sound design. Each scene is its own asset, cached on disk. If something fails, the run continues with what it has.

Assemble. FFmpeg composes the final piece — concatenating clips, applying Ken Burns zooms to stills, ducking the source audio under ambient layers, mixing to broadcast loudness, embedding subtitles. The output is a draft I can watch end-to-end, then go back and refine scene by scene with feedback in natural language.

Why Build It

The interesting question with generative AI isn't what can it speed up? It's what does it make possible that wasn't possible before?

Most AI video tooling answers the first question. It's framed around production efficiency — faster cuts, cheaper stock, fewer revisions. That framing leaves the most interesting territory unexplored. What happens when the model isn't just generating a clip, but interpreting an idea? What happens when you delegate not the labour but the directing?

Mind in Transition is the experiment. The pipeline doesn't replace a step in an existing workflow — it stands up a workflow that didn't previously exist. A single person, sitting with a recording, can now have an AI system read what they said, take a creative position on it, and compose it into something watchable. That's not a faster version of filmmaking. It's a different relationship to it.

What's Genuinely New

The director moves up a level

I'm no longer choosing shots. I'm shaping the conditions under which shots get chosen — through the brief, the references, the constraints. The aesthetic decisions sit in the creative brief; the execution sits with the model.

The transcript becomes the source material

Recorded thinking — usually discarded as raw input — becomes the spine of the piece. The work is in how it's interpreted, not in what's said.

Refinement is conversational

When a scene doesn't land, I tell the system what's wrong in plain English and it rewrites that scene only. The iteration loop collapses from hours to minutes, which means the standard for "good enough" rises.

The machines still dream of nothing on their own. But pointed at a transcript, briefed properly, and held to a creative line — they can dream toward something.

Selected Output

The Shape of Internal Monologue — an early draft produced end-to-end by the pipeline.

Read the full write-up on LinkedIn →