On-device captions on macOS: what Apple's Speech framework actually does
In 2023 the easy choice was to upload your audio to Rev, Otter, Descript, or an API and let the cloud handle transcription. In 2026, on a recent Mac, the easy choice is to let the OS do it. The accuracy gap has closed for common languages; the privacy and speed gaps haven't.
What "on-device" means
"On-device captions" on macOS means the recorder hands the audio file to Apple's Speech framework — the same engine behind macOS dictation and Live Captions — and gets text back without sending the audio anywhere. No upload. No API key. No third party. The audio doesn't leave the Mac for any language the OS supports on-device.
The framework runs on the Neural Engine and GPU on Apple Silicon. On an M2 or later, transcribing a 5-minute audio file takes around 20–60 seconds in practice, depending on the language model and other system load. Cold-start matters less for batch jobs than it does for live captioning.
typical transcription time for a 5-minute audio clip on an M-series Mac — locally, with no upload
What actually runs locally
Two things to know about Apple's Speech framework:
- Languages have an on-device flag. The framework exposes a property that tells the app whether the current language has an on-device model installed. If yes, transcription runs locally. If no, the audio is sent to Apple's Speech service.
- The on-device list grows with macOS releases. English (multiple regions), Spanish, French, German, Italian, Portuguese, Mandarin, Cantonese, Japanese, Korean, and Arabic all have on-device support. The list is in the Voice Control settings panel.
"If you record in a language with on-device support, the audio never leaves your Mac. If you record in a less common language, expect a fallback to Apple's server."
For most solo Mac creators recording in English, this means in practice that the audio never leaves the Mac. For polyglot creators, it's worth knowing where the boundary is.
Language selection
The framework needs to know what language it's transcribing. There's no auto-detect — you set the language explicitly. A good recorder lets you:
- Default to the system locale.
- Override per recording (you're shooting a clip in Spanish today; pick it once).
- Show on-device-supported languages first.
If you set the wrong language, the output will be unusable — that's the failure mode to watch for. Quick scan after every transcription.
What you get from one pass
Three outputs from one transcription:
- Burnt-in captions baked into the video file. Works on every platform, no caption track required. Used by the large majority of social viewers who watch on mute.
- An .srt sidecar file. Upload to YouTube; YouTube uses your captions instead of its (worse) auto-generated ones, and they appear in search.
- A plain transcript. Paste into your CMS for SEO. Skim before editing to find the parts to cut. Hand to your VA for re-purposing into blog copy.
Three deliverables, one click, no upload. This is the part that cloud transcription services charge $0.20–$0.50 per minute for.
Where on-device still loses
Speaker diarisation — labelling "Speaker 1" vs "Speaker 2" — is still better on cloud services like Rev or AssemblyAI. Apple's Speech framework returns a single stream of text and doesn't label speakers. For a published two-host podcast transcript, cloud is still the path.
Real-time live captioning during recording is also more polished on macOS Live Captions than on most app-level implementations — most recorders process the audio file after the recording stops. If you need live captions during a stream, the OS feature is closer to what you want than a third-party app.
For everything else — single-speaker product demos, screen recordings with narration, talking-head clips, tutorials — on-device is the right default.
The privacy dividend
If you record an internal walkthrough that mentions a customer by name, uploading to a cloud transcription service is an arguable GDPR concern — you'd want to confirm the vendor's processing terms cover that data and that they have an EU residency option. On-device, the question doesn't come up: the data didn't leave the laptop. More on this here.
Where it ships
OBS doesn't ship it. Loom does it server-side (cloud). Screen.studio doesn't burn captions into the export by default. QuickTime doesn't transcribe at all. CursorFlow ships on-device transcription built on Apple's Speech framework, with burnt-in captions, .srt, and transcript export, and a per-recording language picker. In the v1.0 release notes if you want the longer feature list.


