Turn Photos Into a Captioned Slideshow Video

Turn one or more still photos and a narration recording into a Ken Burns zoom video, or a multi-image slideshow with smooth cross-fades between images, with auto-generated captions burned in. All through the FFmpeg Micro API. No local FFmpeg, no Whisper install, no Python ML stack.

📥

Download Resources for This Video

Get the scripts and setup guide, completely free.

Already have an account? Log in

What's in the Zip

The download includes two scripts that produce identical output via different paths, so you can understand exactly what's happening at every level.

  • zoom-and-captions.sh is the API version. It uploads images and audio to FFmpeg Micro, transcribes the narration with Whisper, and renders the final video. No local FFmpeg needed, just bash, curl, python3, and a free API key.
  • direct-ffmpeg-reference.sh runs the exact same zoompan + xfade + subtitles filter graph locally, so you can see what the API is executing on the server, tweak the filter, or run without an API key (requires FFmpeg with libass).
  • fetch-demo-assets.sh downloads sample images and narration so you can run either script immediately without sourcing your own content.

What You'll Learn

  • Presigned-URL uploads: push bytes directly to GCS, no egress through our API
  • The /v1/transcribe endpoint: Whisper-powered SRT generation without hosting Whisper yourself
  • Chaining transcribe and transcode: the signed SRT URL drops straight into an FFmpeg subtitles= filter
  • Ken Burns zoom via zoompan: a still image becomes a moving video without a render farm
  • Multi-image slideshows with xfade: alternate zoom-in and zoom-out per image, cross-fade between them, timed to split your audio evenly
  • Audio mux with -shortest: keep the video as long as the voice, no manual trimming

How the Script Works

One run of the script performs a six-step pipeline end-to-end:

  1. Upload every image using POST /v1/upload/presigned-url + PUT + POST /v1/upload/confirm for each one
  2. Upload the audio the same way. The confirm response returns the audio duration so the script knows how long the final video should be.
  3. Transcribe by calling POST /v1/transcribe with the audio's gs:// URL, then poll until complete
  4. Fetch the signed SRT URL with GET /v1/transcribe/:id/download, which returns an HTTPS URL good for 10 minutes
  5. Render the final video with POST /v1/transcodes and one filter graph: a zoompan per image, xfade between adjacent images (multi-image only), and subtitles='<signed-url>' on top; audio mapped from the last input with -shortest
  6. Download the MP4 with GET /v1/transcodes/:id/download, then curl the signed URL to disk

The whole pipeline is three API calls (transcribe, transcode, download) plus one upload per image and one for the audio. Whisper runs on our side, FFmpeg runs on our side, you never install either.

What You'll Need

  • 🔧 Bash, curl, and python3: already installed on macOS and Linux
  • 🔑 FFmpeg Micro API key: free tier, no credit card required
  • 🖼️ One or more still images: JPG or PNG. Pass 2-5 for a slideshow with cross-fades.
  • 🎙️ An audio file: MP3, WAV, or M4A with clear narration. Music-only tracks transcribe to empty SRTs. Longer audio lets more images breathe between cross-fades.

Cost Breakdown

Video processing minutes are billed per transcode job, rounded up. A 30-second clip consumes 1 billable minute, a 61-second clip consumes 2. Every clip in this pipeline is one transcode, so cost scales directly with how many videos you render.

  • Free plan ($0/mo, 100 min): roughly 100 clips/month at 60s or under
  • Starter plan ($19/mo, 2,000 min): roughly 2,000 clips/month
  • Pro plan ($89/mo, 12,000 min): roughly 12,000 clips/month

Check FFmpeg Micro pricing for the latest plan limits and features.

Run It: Sample Assets

Run fetch-demo-assets.sh to pull three sample images and narration from ffmpeg-micro.com/samples, then run either script:

./fetch-demo-assets.sh          # downloads sample images + narration

# Via FFmpeg Micro API (requires API key)
export FFMPEG_MICRO_API_KEY=sk_live_xxx

# Single-image Ken Burns
./zoom-and-captions.sh sample-image.jpg sample-narration.mp3 demo.mp4

# 3-image slideshow with cross-fades and auto-generated captions
./zoom-and-captions.sh sample-image-1.jpg sample-narration.mp3 slideshow.mp4 \
    sample-image-2.jpg sample-image-3.jpg

# Direct FFmpeg reference (no API key needed, requires FFmpeg with libass)
./direct-ffmpeg-reference.sh

Run It: Your Own Content

Single-image Ken Burns:

export FFMPEG_MICRO_API_KEY=sk_live_xxx
./zoom-and-captions.sh my-photo.jpg narration.mp3 output.mp4

Multi-image slideshow with cross-fades. Any extra images after the output filename become segments of the slideshow:

export FFMPEG_MICRO_API_KEY=sk_live_xxx
./zoom-and-captions.sh a.jpg narration.mp3 slideshow.mp4 b.jpg c.jpg d.jpg

About 60 seconds later, slideshow.mp4 is a 1280x720 H.264 + AAC file with each image getting its own Ken Burns zoom (alternating in and out for variety), smooth 1-second cross-fades between images, your voice playing, and captions synced along the bottom.

Tweaks

The script is short and readable. Open it in your editor. Easy knobs:

  • Cross-fade duration: XFADE_DURATION=2.0 ./zoom-and-captions.sh ... env var, defaults to 1.0s. Must be shorter than the per-image segment.
  • Max zoom level: the 1.3 constant in the z_expr lines. Raise it for more dramatic zoom, lower it for subtler motion.
  • Zoom direction pattern: even-indexed images zoom in, odd-indexed zoom out. Reorder your image args to change the rhythm.
  • Caption position: force_style='Alignment=2'. 2 is bottom center, 8 is top center, 1 and 3 are corners.
  • Output resolution: s=1280x720 in the filter. Swap to 720x1280 for vertical Shorts, Reels, or TikTok.
  • Caption font size: Fontsize=22. Raise it for more readable captions on mobile.

Related training

Related endpoints

See the FFmpeg API documentation for the full surface this script uses. Key endpoints:

  • POST /v1/upload/presigned-url: generate signed PUT URLs for direct-to-GCS uploads
  • POST /v1/transcribe: Whisper-powered audio transcription to SRT
  • GET /v1/transcribe/:id/download: signed URL for the generated SRT
  • POST /v1/transcodes: multi-input FFmpeg pipeline with zoompan and subtitles filters

Skip the install chore

FFmpeg Micro runs FFmpeg and Whisper on our side so you don't have to. Send URLs, get finished videos.

Get a free API key