Lesson 6: Subtitles and Captions - Learn FFmpeg Course

Subtitles are one of those things that look simple from the outside. You have a video. You have some text. You want the text to appear at the right moments on top of the video. How hard could it be?

In practice, subtitles are one of the most quietly complicated topics in video processing. There are two completely different ways to attach them, three different file formats in common use, a styling language with its own quirks, and a handful of cross-platform gotchas that have been frustrating developers since the early 2000s. This lesson walks you through all of it, in the order you actually need to understand it.

By the end you will know how to choose between burned-in and soft subtitles for any given project, how to style burned-in captions to match a brand or platform, when to graduate from SRT files to ASS files, and how to dodge the three traps that catch almost every first-timer.

A Quick Mental Model: What Is a Video File, Really?

Before we talk about how to add subtitles, it helps to know what a video file actually is on disk. A file like vacation.mp4 is not a single stream of "video." It is a container, which is a box that holds multiple streams inside it. A typical MP4 holds at least one video stream and one audio stream. But containers can also hold subtitle streams, alternate audio tracks (English, Spanish, director's commentary), chapter markers, and metadata.

When you play an MP4 in QuickTime or VLC, the player opens the container, looks at what streams are inside, and decides what to render. If there is a subtitle stream and your player supports it, you can turn captions on and off, and the same file can carry English, Spanish, and Japanese subtitles all at once. The viewer picks.

This idea, that a video file is a container holding separate streams, is the foundation for understanding the two ways to add subtitles. With soft subtitles, you add a new subtitle stream to the container. The video pixels are untouched. With burned-in subtitles, you skip the container entirely and paint the text directly onto each frame of the video. There is no separate subtitle stream because the captions have become part of the video itself.

Both approaches are useful. They solve different problems. The next section walks through how to choose between them.

Burned-In vs Soft Subtitles: How to Choose

Burned-in subtitles are permanent. Once the captions are painted onto the video frames, no player and no setting can turn them off. The text is just part of the picture, in the same way that the speaker's face is part of the picture. The downside is obvious: you cannot remove them, swap languages, or fix a typo without re-rendering the whole video. The upside is that they are guaranteed to show up everywhere. Every player on every device will display them, because there is nothing special to display. They are just video.

Soft subtitles are the opposite. The text and timing are stored in a separate track inside the container, and the player decides whether to render them. Viewers can toggle them on or off. You can carry multiple languages in the same file and let the viewer choose. You can fix typos by replacing the subtitle track without re-rendering the video. The catch is that you depend on the player supporting subtitle tracks for that container format, and you depend on the file not being passed through any system that strips subtitle tracks.

Here is the rule of thumb that will get you the right answer ninety percent of the time.

If the video is destined for a social platform (TikTok, Instagram Reels, YouTube Shorts, X, LinkedIn), burn the subtitles in. Those platforms either strip subtitle tracks during ingestion or display them inconsistently, and roughly eighty-five percent of viewers watch with sound off, so captions are not optional. They have to be there, visible, unconditionally.

If the video is destined for a player that respects subtitle tracks (YouTube long-form, a video on your own website, an MKV file shared directly), keep the subtitles soft. You preserve viewer choice, you skip the cost and quality loss of re-encoding the video, and you can ship multiple languages without producing multiple files.

If you are not sure where the video will end up, default to burn-in. It is the safer of the two choices because it always works.

	Burn-In	Soft Subtitles
Where the text lives	Rendered into the video pixels	Stored as a separate track
Can be turned off?	No, they are part of the picture	Yes, the viewer toggles them
Works on every platform?	Yes, it is just video	Depends on player and container
Re-encodes the video?	Yes, slow, some quality loss	No, fast, lossless copy
Multiple languages?	No, one set of captions	Yes, multiple subtitle tracks
Best for	Social, autoplay-muted contexts	Long-form video, accessibility

Soft Subtitles in Practice

Soft subtitles are the easier of the two cases, so we will start here. You already have a video file. You already have a subtitle file (typically an SRT, which we will explain in detail in a moment). All you need to do is mux them together, which is the technical term for combining streams into a single container without changing the streams themselves.

Adding a Subtitle Track to an MKV

MKV (Matroska) is the most flexible container for subtitles because it accepts almost any subtitle format directly:

ffmpeg -i input.mp4 -i subtitles.srt \ -c copy -c:s srt \ output.mkv

The flag -c copy is the important one. It tells FFmpeg to copy the video and audio streams as-is, without re-encoding them. This is what makes the operation fast (often under a second, even for a long video) and what makes it lossless. You are not changing the picture or the sound. You are just adding a third stream alongside them.

The flag -c:s srt says "use SRT as the subtitle codec." MKV accepts this directly, so the SRT text gets stored in the container with no conversion.

Adding a Subtitle Track to an MP4

MP4 is more particular. It does not accept SRT directly. Instead it uses a format called mov_text, which is the subtitle codec defined in the original QuickTime spec. FFmpeg converts the SRT to mov_text on the fly:

ffmpeg -i input.mp4 -i subtitles.srt \ -c copy -c:s mov_text \ output.mp4

This works in QuickTime, VLC, and most modern web video players. But some players and platforms silently ignore mov_text tracks, which is one reason MKV is the preferred container when you have a choice.

Tagging the Subtitle Language

If you are shipping multi-language content, you want the player to label the tracks correctly:

ffmpeg -i input.mp4 -i english.srt -i spanish.srt \ -map 0:v -map 0:a -map 1 -map 2 \ -c copy -c:s srt \ -metadata:s:s:0 language=eng \ -metadata:s:s:1 language=spa \ output.mkv

This is the most complicated soft-subtitle command in the lesson. The three -i flags name the three input files: the video, the English subtitles, the Spanish subtitles. The -map flags say which streams to include in the output and in what order. 0:v and 0:a mean "the video and audio streams from input zero," which is the source MP4. -map 1 adds the English SRT as the first subtitle stream. -map 2 adds the Spanish SRT as the second.

The -metadata:s:s:0 language=eng flag attaches a metadata tag to the first subtitle stream (s:s:0 means "subtitle stream, position zero") telling the player it is English. The same pattern with s:s:1 tags the second as Spanish. The codes are from a standard called ISO 639-2: eng, spa, fra, por, jpn, and so on.

When a viewer opens the resulting file, their player shows a language menu with "English" and "Spanish" as choices.

Burned-In Subtitles in Practice

Burning in is where most of the complexity lives. To burn subtitles into a video, FFmpeg has to read the subtitle file, parse the timing and styling, render each line as pixels, and composite those pixels onto each frame of the video at the right moments. The library that handles all this is called libass. You do not interact with libass directly, but it is the engine behind every subtitle burn-in operation, and knowing it exists will save you time when something goes wrong.

The filter that invokes libass is called subtitles. The minimum command is short:

ffmpeg -i input.mp4 -vf "subtitles=subtitles.srt" output.mp4

That is the whole thing. The -vf flag means "video filter," and subtitles= tells FFmpeg to apply the subtitles filter using the file you named. FFmpeg parses the SRT, picks a default style (white text with a black outline at the bottom of the frame), and renders each caption onto the appropriate frames. The output is a new MP4 with the captions baked in.

Re-encoding is unavoidable here. The pixels of the video have changed (they now include text), so the video stream has to be re-compressed. This is slow compared to soft-subtitle muxing. A sixty-minute 1080p video can take ten to twenty minutes on a typical laptop. We will talk about how to handle that at scale at the end of the lesson.

A Quick Tour of the SRT Format

SRT (SubRip Text) is the simplest subtitle format in common use, and it is the one you will encounter most often. A typical SRT file looks like this:

1 00:00:01,000 --> 00:00:04,000 First subtitle line. 2 00:00:05,500 --> 00:00:08,000 Second line. Can span two lines.

Each subtitle entry has four parts: an index number, a timestamp range, the text itself, and a blank line to separate it from the next entry. The timestamp format is HH:MM:SS,mmm, where the comma before the milliseconds is required. Many editors will save it as a period instead, which breaks libass. If your captions seem to be appearing at random times or not at all, check the timestamps first.

SRT has no built-in styling. The format carries timing and text and nothing else. If you want bold, color, position, or anything beyond default white text on a black outline, you have two choices: use force_style to override the defaults (next section), or use a richer format like ASS (the section after).

Styling Burned-In Captions with force_style

The force_style parameter on the subtitles filter lets you override how libass renders each line, without modifying the SRT file. The keys come from the ASS (Advanced SubStation Alpha) styling language, which we will meet properly in a moment, but for now you can think of them as a set of named knobs you can turn.

Here is a command that styles the captions in a YouTube-Shorts style, with white text in a semi-transparent black box, positioned eighty pixels up from the bottom:

ffmpeg -i input.mp4 -vf \ "subtitles=subtitles.srt:force_style='Fontname=Arial,Fontsize=28,PrimaryColour=&H00FFFFFF,BackColour=&H80000000,BorderStyle=3,Outline=4,MarginV=80,Alignment=2'" \ output.mp4

Two things in that command surprise everyone the first time they see it. Both deserve an explanation.

First, the color values are in BGR order, not RGB. Most modern color systems put red first, then green, then blue. ASS, the styling language libass uses, puts blue first, then green, then red. The string &H00FF0000 is not red. It is blue. Pure red is &H000000FF. This is a legacy choice from SubStation Alpha, an early 2000s subtitle tool that ASS evolved from, and we are all stuck with it forever.

Second, the leading &H00 is an alpha (transparency) byte. The two hex digits after &H set the transparency, where 00 is fully opaque and FF is fully transparent. So &H80000000 is fifty-percent transparent black, which is what you want for a readable caption background that does not fully block the video.

Here are the styling keys you will use most often:

Key	What it controls
`Fontname`	The font family. Must be installed on the system running FFmpeg.
`Fontsize`	Font size relative to the script's reference resolution (`PlayResY` in an ASS file, default 384 for raw SRT). For 1080p output and a 1080-tall reference, 24-32 reads cleanly.
`PrimaryColour`	The fill color of the text itself.
`OutlineColour`	The color of the outline stroke around the text.
`BackColour`	The background box color, used when `BorderStyle=3`.
`Outline`	Thickness of the outline stroke in pixels.
`BorderStyle`	`1` for outline only, `3` for a filled background box.
`Alignment`	Position on the frame. See the grid below.
`MarginV`	Vertical margin from the top or bottom edge, in pixels.
`Bold`	`-1` for bold, `0` for regular. (Yes, `-1` means "yes." Do not ask.)

The Alignment value uses a layout based on the numeric keypad on a keyboard. Imagine the keypad as a three-by-three grid overlaid on the video frame:

7 8 9 top-left top-center top-right 4 5 6 middle-left middle-center middle-right 1 2 3 bottom-left bottom-center bottom-right

Alignment=2 is bottom-center, which is the default for traditional captions. Alignment=8 is top-center, useful for talking-head videos where the speaker is in the lower half of the frame. Alignment=5 is the dead center of the screen, used for things like title cards.

ASS Files: When SRT Is Not Enough

SRT is fine if every caption in your video looks the same. The moment you need two captions to look different (different speakers in different colors, a label that should appear in the top-right while normal captions stay at the bottom, a title card with a custom font), you need a richer format.

That format is ASS, short for Advanced SubStation Alpha. ASS is a text file like SRT, but instead of just listing timestamps and text, it includes a style definition block and a more structured event block. Each event can reference a named style, or override styling inline. The format is verbose, but it gives you per-line control over everything libass can render.

A minimal ASS file looks like this:

[Script Info] ScriptType: v4.00+ PlayResX: 1920 PlayResY: 1080 [V4+ Styles] Format: Name, Fontname, Fontsize, PrimaryColour, OutlineColour, BackColour, Bold, BorderStyle, Outline, Alignment, MarginV Style: Default,Arial,40,&H00FFFFFF,&H00000000,&H80000000,-1,3,0,2,80 Style: SpeakerA,Arial,40,&H0000FFFF,&H00000000,&H80000000,-1,3,0,2,80 [Events] Format: Layer, Start, End, Style, Name, MarginL, MarginR, MarginV, Effect, Text Dialogue: 0,0:00:01.00,0:00:04.00,Default,,0,0,0,,Welcome to the show. Dialogue: 0,0:00:05.50,0:00:08.00,SpeakerA,,0,0,0,,Speaker A talks in yellow.

The [Script Info] section sets the reference resolution. PlayResX and PlayResY should match the resolution of the video you are burning into, because font sizes are interpreted relative to this resolution. If they are off, your text will be the wrong size.

The [V4+ Styles] section defines named styles. The Default style here is white text on a fifty-percent-black box. The SpeakerA style is the same but with yellow text (&H0000FFFF, which is BGR for yellow because R and G are both FF). You can define as many styles as you want.

The [Events] section lists the actual captions, each one referencing a named style. The first event uses Default. The second uses SpeakerA and will render in yellow.

Burning an ASS file in is the same command as for SRT, just with a different file extension:

ffmpeg -i input.mp4 -vf "subtitles=subtitles.ass" output.mp4

You do not need force_style because the styling is already in the file.

If you have an SRT file and you want to start customizing per-line styles, FFmpeg can convert it to ASS for you:

ffmpeg -i subtitles.srt subtitles.ass

The resulting ASS file has a default style block and a converted events block. Open it in a text editor, add your styles, save, and burn.

The Three Gotchas Almost Everyone Hits

These three problems are responsible for the vast majority of "my subtitle command is not working" questions on Stack Overflow. They have nothing to do with FFmpeg itself. They are problems at the boundary between FFmpeg and the operating system, and once you know about them, you will save yourself hours over the course of your career.

Gotcha 1: The Font You Named Is Not Installed

When you write Fontname=Inter in a force_style argument, libass goes looking for a font called Inter on the system running FFmpeg. On Linux it does this through a system called fontconfig, which maintains an index of all the fonts available. If Inter is not installed, fontconfig returns its best guess, which is usually a generic sans-serif font that looks nothing like Inter.

The really frustrating part is that libass does this silently. There is no error. The video renders successfully. The captions just look wrong.

To check what fonts fontconfig can see:

fc-list | grep -i inter

If nothing comes back, the font is not installed. On macOS you can install most popular fonts through Homebrew (brew install --cask font-inter). On a Linux server (the most common production setup), you will need to install the font package (apt install fonts-inter on Debian or Ubuntu) and then run fc-cache -fv to rebuild the index.

If you are running FFmpeg in a minimal Docker image (Alpine-based or a -slim variant), the container often ships with no fonts beyond a basic fallback. Either install fonts in your Dockerfile (and re-run fc-cache -fv) or use a base image that already has them.

Gotcha 2: Windows Paths Break the Filter Syntax

FFmpeg's filter language uses the colon character (:) as a separator between filter options. This is fine on macOS and Linux, where paths look like /home/me/subs.srt. But on Windows, paths look like C:\Users\me\subs.srt, and FFmpeg's parser sees that first colon (the one after C) as a filter option separator. The parser gets confused and the command fails with a cryptic error.

The fix is awkward but reliable. Escape the colon with a backslash, replace backslashes in the rest of the path with forward slashes, and wrap the whole path in single quotes:

ffmpeg -i input.mp4 -vf "subtitles='C\\:/Users/me/subs.srt'" output.mp4

Yes, this is ugly. Yes, it bothers everyone. No, there is not a cleaner way, because the alternative would require breaking backward compatibility with every existing FFmpeg filter chain in the world.

If you are working on Windows, the cleanest workaround is to cd into the directory containing the subtitle file first, so you can refer to it with just a filename and avoid the absolute path entirely:

cd C:\Users\me ffmpeg -i video.mp4 -vf "subtitles=subs.srt" output.mp4

On macOS and Linux, this gotcha does not apply. Use absolute or relative paths as you would for any other file.

Gotcha 3: Non-UTF-8 Encoding Mangles Special Characters

Your SRT file is, fundamentally, a text file. Like any text file, it has an encoding, which is the system used to translate the bytes on disk into the characters you see when you open the file. UTF-8 is the modern standard. Almost everything written today is in UTF-8. But subtitle files often come from older tools, legacy archives, or Windows applications that default to other encodings like Windows-1252 (Latin-1) or UTF-8 with a byte-order mark (BOM).

libass expects clean UTF-8. If your file is in a different encoding, accented characters and non-Latin scripts will show up as garbage in the rendered captions. An é becomes Ã©, or a question mark, or a blank.

The fix is to convert the file:

iconv -f WINDOWS-1252 -t UTF-8 broken.srt > fixed.srt

The -f flag is the source encoding (use whichever one your file is actually in), and -t UTF-8 is the target. After this, the captions render correctly.

If you are not sure what encoding the file is in, the file command on macOS and Linux can usually tell you:

file subtitles.srt

It will report something like UTF-8 Unicode text (good) or ISO-8859 text (needs conversion).

To prevent this problem when you control the source of the subtitles, save SRT files as UTF-8 without BOM in your editor. Most modern editors have an "encoding" or "save as" option that lets you pick this.

Combining Captions with the Watermark Techniques from Lesson 5

Subtitles compose with everything else you learned in the watermark lesson. You can layer a logo, a URL, and burned-in captions in a single command:

ffmpeg -i video.mp4 -i logo.png \ -filter_complex \ "[1:v]scale=100:-1[logo];[0:v][logo]overlay=W-w-20:20,subtitles=subs.srt:force_style='Fontsize=30,Alignment=2,MarginV=60,BackColour=&H80000000,BorderStyle=3'" \ output.mp4

The order of filters matters. In the chain above, the logo is overlaid first, then the captions are rendered on top of that result. If you wanted the logo to appear on top of the captions instead, you would put the subtitles filter first.

When You Should Stop Doing This Yourself

Burning subtitles into video is one of the slower FFmpeg operations because every frame is being re-encoded with new pixels. On a single laptop, a sixty-minute 1080p video can take ten to twenty minutes. If you are producing one or two videos a week, this is fine. Run it overnight.

But if you are running a content pipeline (a podcast that publishes daily clips, a coaching business that captions every workshop, a SaaS product that captions user-uploaded video), running FFmpeg on your own infrastructure becomes a real ongoing cost. You are paying for CPU time, you are maintaining a font catalog, you are handling the path-escaping and encoding problems we just covered, and you are keeping the rendering machine separate from your application server so subtitle jobs do not crash your web app.

This is the workload FFmpeg Micro was built to absorb. You send the same FFmpeg command you would run locally, just packaged as an HTTP request, and the rendering happens on managed infrastructure with fonts pre-installed and libass already configured.

The equivalent of the YouTube-Shorts-style burn-in command from earlier in this lesson, sent to the API:

curl -X POST https://api.ffmpeg-micro.com/v1/transcodes \ -H "Authorization: Bearer $API_KEY" \ -H "Content-Type: application/json" \ -d @- <<'EOF' { "inputs": [{ "url": "https://yourcdn.com/video.mp4" }], "outputFormat": "mp4", "options": [ { "option": "-vf", "argument": "subtitles=https://yourcdn.com/captions.srt:force_style='Fontname=Arial,Fontsize=28,PrimaryColour=&H00FFFFFF,BackColour=&H80000000,BorderStyle=3,Outline=4,MarginV=80,Alignment=2'" } ] } EOF

The heredoc (<<'EOF') is what makes this copy-pasteable. The force_style value needs literal single quotes around it for FFmpeg's filter parser, and those single quotes inside a bash single-quoted -d '...' body would break the outer string. The heredoc sidesteps that — the single quotes inside become a plain part of the JSON.

You get a job ID back, poll for completion, and download the result. Same FFmpeg, no infrastructure to maintain.

Key Takeaways

Burn in for social, keep soft for players. Burned-in captions always show. Soft captions let viewers toggle and let you carry multiple languages.
Soft subs are fast. Use -c copy -c:s srt for MKV or -c:s mov_text for MP4. No re-encoding.
Burn-in uses the subtitles filter. Minimum command: ffmpeg -i input.mp4 -vf "subtitles=subs.srt" output.mp4.
Style with force_style. Override font, size, color, position. Remember colors are BGR, not RGB.
Graduate to ASS when you need per-line styling, multiple speakers in different colors, or anything beyond a single uniform look.
Watch for three gotchas: the font must be installed (fontconfig), Windows paths need escaping (C\\:), and SRT files must be UTF-8 without BOM.

Subtitles and Captions