Blog

On-device captions on macOS: what Apple's Speech framework actually does

On-device captions on macOS with Apple's Speech framework
On-device captions on macOS in 2026: Apple's Speech framework, what runs locally, where it falls back, and the three deliverables you get from one transcription pass.

In 2023 the easy choice was to upload your audio to Rev, Otter, Descript, or an API and let the cloud handle transcription. In 2026, on a recent Mac, the easy choice is to let the OS do it. The accuracy gap has closed for common languages; the privacy and speed gaps haven't.

What "on-device" means

"On-device captions" on macOS means the recorder hands the audio file to Apple's Speech framework — the same engine behind macOS dictation and Live Captions — and gets text back without sending the audio anywhere. No upload. No API key. No third party. The audio doesn't leave the Mac for any language the OS supports on-device.

The framework runs on the Neural Engine and GPU on Apple Silicon. On an M2 or later, transcribing a 5-minute audio file takes around 20–60 seconds in practice, depending on the language model and other system load. Cold-start matters less for batch jobs than it does for live captioning.

~30 s

typical transcription time for a 5-minute audio clip on an M-series Mac — locally, with no upload

What actually runs locally

Two things to know about Apple's Speech framework:

  1. Languages have an on-device flag. The framework exposes a property that tells the app whether the current language has an on-device model installed. If yes, transcription runs locally. If no, the audio is sent to Apple's Speech service.
  2. The on-device list grows with macOS releases. English (multiple regions), Spanish, French, German, Italian, Portuguese, Mandarin, Cantonese, Japanese, Korean, and Arabic all have on-device support. The list is in the Voice Control settings panel.

"If you record in a language with on-device support, the audio never leaves your Mac. If you record in a less common language, expect a fallback to Apple's server."

For most solo Mac creators recording in English, this means in practice that the audio never leaves the Mac. For polyglot creators, it's worth knowing where the boundary is.

Language selection

The framework needs to know what language it's transcribing. There's no auto-detect — you set the language explicitly. A good recorder lets you:

  • Default to the system locale.
  • Override per recording (you're shooting a clip in Spanish today; pick it once).
  • Show on-device-supported languages first.

If you set the wrong language, the output will be unusable — that's the failure mode to watch for. Quick scan after every transcription.

What you get from one pass

Three outputs from one transcription:

  • Burnt-in captions baked into the video file. Works on every platform, no caption track required. Used by the large majority of social viewers who watch on mute.
  • An .srt sidecar file. Upload to YouTube; YouTube uses your captions instead of its (worse) auto-generated ones, and they appear in search.
  • A plain transcript. Paste into your CMS for SEO. Skim before editing to find the parts to cut. Hand to your VA for re-purposing into blog copy.

Three deliverables, one click, no upload. This is the part that cloud transcription services charge $0.20–$0.50 per minute for.

Where on-device still loses

Speaker diarisation — labelling "Speaker 1" vs "Speaker 2" — is still better on cloud services like Rev or AssemblyAI. Apple's Speech framework returns a single stream of text and doesn't label speakers. For a published two-host podcast transcript, cloud is still the path.

Real-time live captioning during recording is also more polished on macOS Live Captions than on most app-level implementations — most recorders process the audio file after the recording stops. If you need live captions during a stream, the OS feature is closer to what you want than a third-party app.

For everything else — single-speaker product demos, screen recordings with narration, talking-head clips, tutorials — on-device is the right default.

The privacy dividend

If you record an internal walkthrough that mentions a customer by name, uploading to a cloud transcription service is an arguable GDPR concern — you'd want to confirm the vendor's processing terms cover that data and that they have an EU residency option. On-device, the question doesn't come up: the data didn't leave the laptop. More on this here.

Where it ships

OBS doesn't ship it. Loom does it server-side (cloud). Screen.studio doesn't burn captions into the export by default. QuickTime doesn't transcribe at all. CursorFlow ships on-device transcription built on Apple's Speech framework, with burnt-in captions, .srt, and transcript export, and a per-recording language picker. In the v1.0 release notes if you want the longer feature list.

Frequently asked questions

Does on-device transcription work offline?
For languages with on-device support, yes — once macOS has downloaded the language asset, the transcription runs locally with no network. For languages without on-device support, the system falls back to Apple's Speech service, which requires a connection.
Which languages have on-device support on macOS?
The on-device language list expands every macOS release. As of 2026, English (multiple regions), Spanish, French, German, Italian, Portuguese, Mandarin, Cantonese, Japanese, Korean, and Arabic have on-device support, with more added each year. Check the system Voice Control settings to see what's installed.
Can I use the .srt file on platforms other than YouTube?
.srt is the universal caption format accepted by Vimeo, LinkedIn, Facebook, TikTok, and most video platforms. Burnt-in captions work everywhere regardless of caption support.
How accurate is the on-device transcriber on technical vocabulary?
Modest. Brand names, codebase identifiers, and uncommon technical terms get guessed. The right workflow is: generate, scan the output for the words you care about, fix them inline, then export. The 30 seconds of scanning beats 90 minutes of re-recording.