Turn an image + narration into a captioned zoom video

One still image, one audio file, one bash script → a Ken Burns zoom video with the narration muxed in and auto-generated captions burned on screen. All through the FFmpeg Micro API.

No brew install ffmpeg. No Whisper Docker image. No Python ML stack. Just bash + curl + python3(all preinstalled on macOS and Linux) and a free API key.

📥

Download Resources for This Video

Get the bash script and setup guide — completely free.

Already have an account? Log in

What You'll Learn

  • Presigned-URL uploads — push bytes directly to GCS, no egress through our API
  • The /v1/transcribe endpoint — Whisper-powered SRT generation without hosting Whisper yourself
  • Chaining transcribe → transcode — the signed SRT URL drops straight into an FFmpeg subtitles= filter
  • Ken Burns zoom via zoompan — a still image becomes a moving video without a render farm
  • Audio mux with -shortest — keep the video as long as the voice, no manual trimming

How the Script Works

One run of the script performs a six-step pipeline end-to-end:

  1. Upload the imagePOST /v1/upload/presigned-url + PUT + POST /v1/upload/confirm
  2. Upload the audio — same flow, and the confirm response hands back the audio duration so the script knows how long to zoom for
  3. TranscribePOST /v1/transcribe with the audio's gs:// URL, then poll until complete
  4. Fetch the signed SRT URLGET /v1/transcribe/:id/download returns an HTTPS URL good for 10 minutes
  5. Render the final videoPOST /v1/transcodes with one filter graph: zoompan + subtitles='<signed-url>', and audio mapped from the second input with -shortest
  6. Download the MP4GET /v1/transcodes/:id/download → curl the signed URL to disk

The whole pipeline is three API calls (transcribe, transcode, download) plus a pair of uploads. Whisper runs on our side, FFmpeg runs on our side, you never install either.

What You'll Need

  • 🔧 Bash + curl + python3 — already installed on macOS and Linux
  • 🔑 FFmpeg Micro API key — free tier, no credit card required
  • 🖼️ A still image — JPG or PNG, landscape works best for zoom
  • 🎙️ An audio file — MP3, WAV, or M4A with clear narration (music-only tracks transcribe to empty SRTs)

Cost Breakdown

Video processing minutes are billed per transcode job, rounded up — a 30-second clip consumes 1 billable minute, a 61-second clip consumes 2. Every clip in this pipeline is one transcode, so plan sizing scales directly with how many videos you render.

  • Free plan ($0/mo, 100 min): roughly 100 clips/month at ≤60s each
  • Starter plan ($19/mo, 2,000 min): roughly 2,000 clips/month
  • Pro plan ($89/mo, 12,000 min): roughly 12,000 clips/month

Check FFmpeg Micro pricing for the latest plan limits and features.

Run It — with sample assets (fastest)

The zip also ships a fetch-demo-assets.sh that pulls a sample image + narration from ffmpeg-micro.com/samples — so you can watch the pipeline run end-to-end before curating your own content:

export FFMPEG_MICRO_API_KEY=sk_live_xxx
./fetch-demo-assets.sh
./zoom-and-captions.sh sample-image.jpg sample-narration.mp3 demo.mp4
open demo.mp4

Run It — with your own content

export FFMPEG_MICRO_API_KEY=sk_live_xxx
./zoom-and-captions.sh my-photo.jpg narration.mp3 output.mp4

Come back in ~60 seconds and output.mp4 is a 1280×720 H.264 + AAC file with your image slowly zooming, your voice playing, and captions crawling along the bottom in sync.

Tweaks

The script is short and readable — open it in your editor. Easy knobs:

  • Zoom speed: the zoom+0.0005 increment per frame — lower = slower zoom
  • Max zoom level: the min(zoom+..., 1.3) cap — raise for more dramatic end
  • Caption position: force_style='Alignment=2' — 2 = bottom center, 8 = top center, 1/3 = corners
  • Output resolution: s=1280x720 in the filter — swap to 720x1280 for vertical Shorts / Reels / TikTok
  • Caption font size: Fontsize=22 — raise for more readable captions on mobile

Related training

Related endpoints

See the FFmpeg API documentation for the full surface this script uses. Key endpoints:

  • POST /v1/upload/presigned-url — generate signed PUT URLs for direct-to-GCS uploads
  • POST /v1/transcribe — Whisper-powered audio → SRT
  • GET /v1/transcribe/:id/download — signed URL for the generated SRT
  • POST /v1/transcodes — multi-input FFmpeg pipeline with zoompan + subtitles filters

Skip the install chore

FFmpeg Micro runs FFmpeg and Whisper on our side so you don't have to. Send URLs, get finished videos.

Get a free API key