FFmpeg: Turn an Image Into a Captioned Zoom Video

What You'll Learn

✅ Presigned-URL uploads — push bytes directly to GCS, no egress through our API
✅ The /v1/transcribe endpoint — Whisper-powered SRT generation without hosting Whisper yourself
✅ Chaining transcribe → transcode — the signed SRT URL drops straight into an FFmpeg subtitles= filter
✅ Ken Burns zoom via zoompan — a still image becomes a moving video without a render farm
✅ Audio mux with -shortest — keep the video as long as the voice, no manual trimming

How the Script Works

One run of the script performs a six-step pipeline end-to-end:

Upload the image — POST /v1/upload/presigned-url + PUT + POST /v1/upload/confirm
Upload the audio — same flow, and the confirm response hands back the audio duration so the script knows how long to zoom for
Transcribe — POST /v1/transcribe with the audio's gs:// URL, then poll until complete
Fetch the signed SRT URL — GET /v1/transcribe/:id/download returns an HTTPS URL good for 10 minutes
Render the final video — POST /v1/transcodes with one filter graph: zoompan + subtitles='<signed-url>', and audio mapped from the second input with -shortest
Download the MP4 — GET /v1/transcodes/:id/download → curl the signed URL to disk

The whole pipeline is three API calls (transcribe, transcode, download) plus a pair of uploads. Whisper runs on our side, FFmpeg runs on our side, you never install either.

What You'll Need

🔧 Bash + curl + python3 — already installed on macOS and Linux
🔑 FFmpeg Micro API key — free tier, no credit card required
🖼️ A still image — JPG or PNG, landscape works best for zoom
🎙️ An audio file — MP3, WAV, or M4A with clear narration (music-only tracks transcribe to empty SRTs)

Cost Breakdown

Video processing minutes are billed per transcode job, rounded up — a 30-second clip consumes 1 billable minute, a 61-second clip consumes 2. Every clip in this pipeline is one transcode, so plan sizing scales directly with how many videos you render.

• Free plan ($0/mo, 100 min): roughly 100 clips/month at ≤60s each
• Starter plan ($19/mo, 2,000 min): roughly 2,000 clips/month
• Pro plan ($89/mo, 12,000 min): roughly 12,000 clips/month

Check FFmpeg Micro pricing for the latest plan limits and features.

Run It — with sample assets (fastest)

The zip also ships a fetch-demo-assets.sh that pulls a sample image + narration from ffmpeg-micro.com/samples — so you can watch the pipeline run end-to-end before curating your own content:

export FFMPEG_MICRO_API_KEY=sk_live_xxx
./fetch-demo-assets.sh
./zoom-and-captions.sh sample-image.jpg sample-narration.mp3 demo.mp4
open demo.mp4

Run It — with your own content

export FFMPEG_MICRO_API_KEY=sk_live_xxx
./zoom-and-captions.sh my-photo.jpg narration.mp3 output.mp4

Come back in ~60 seconds and output.mp4 is a 1280×720 H.264 + AAC file with your image slowly zooming, your voice playing, and captions crawling along the bottom in sync.

Tweaks

The script is short and readable — open it in your editor. Easy knobs:

• Zoom speed: the zoom+0.0005 increment per frame — lower = slower zoom
• Max zoom level: the min(zoom+..., 1.3) cap — raise for more dramatic end
• Caption position: force_style='Alignment=2' — 2 = bottom center, 8 = top center, 1/3 = corners
• Output resolution: s=1280x720 in the filter — swap to 720x1280 for vertical Shorts / Reels / TikTok
• Caption font size: Fontsize=22 — raise for more readable captions on mobile

Related training

• Watermark a video with a logo — same bash + FFmpeg Micro shape, different filter (overlay instead of zoompan + subtitles)
• Make 100 shorts in under a minute — parallel bash pipeline driven by a CSV

Related endpoints

See the FFmpeg API documentation for the full surface this script uses. Key endpoints:

• POST /v1/upload/presigned-url — generate signed PUT URLs for direct-to-GCS uploads
• POST /v1/transcribe — Whisper-powered audio → SRT
• GET /v1/transcribe/:id/download — signed URL for the generated SRT
• POST /v1/transcodes — multi-input FFmpeg pipeline with zoompan + subtitles filters

Download Resources for This Video