Episode #353

Speech Recognition and TTS

14 minutes
Published on September 6, 2018

This video is only available to subscribers. Get access to this video and 578 others.

With special guest Yono, we dive into the system for text-to-speech and speech recognition on iOS. Yono builds an app for language practice. Along the way we become familiar with AVAudioEngine, AVSpeechSynthesizer, and SFSpeechRecognizer from the Speech Framework.

This episode was authored by Yono Mittlefeldt.

Episode Links

Getting Started

Text-to-speech needs the AVFoundation framework, but speech recognition needs the Speech framework, which already imports AVFoundation. So we can just import the latter.

import Speech


For TTS, we need three components

  • AVSpeechUtterance - the text, from which you want to synthesize speech, plus a few properties on how to synthesize it.
  • AVSpeechSynthesisVoice - the voice to use for the speech. This is currently just defined by the BCP-47 language code, as each language has only one voice and one quality level.
  • AVSpeechSynthesizer - the object that synthesizes speech

So we can synthesize speech in a few easy lines of code:

let utterance = AVSpeechUtterance(string: "La synthèse vocale est très facile")
utterance.voice = AVSpeechSynthesisVoice(language: "fr-FR")

let synthesizer = AVSpeechSynthesizer()

Speech Recognition


Before using speech recognition in an app, we need to request authorization.

SFSpeechRecognizer.requestAuthorization { (authStatus) in
    switch authStatus {
    case .authorized:
        print("User has authorized access to speech recognition!")  
    case .denied:
        print("User denied access to speech recognition")
    case .restricted:
        print("Speech recognition restricted on this device")
    case .notDetermined:
        print("Speech recognition not yet authorized")

We also need to include a privacy usage string for speech recognition in our Info.plist. The raw key for this privacy usage string is NSSpeechRecognitionUsageDescription.

Important Components

Speech recognition has three important components:

  • SFSpeechRecognizer - a speech recognizer, which can only handle a single language
  • SFSpeechRecognitionRequest - a request to recognize speech from a particular audio source. To use the microphone as an audio source, we use SFSpeechAudioBufferRecognitionRequest. If we wanted to use a pre-recorded audio file, we would use SFSpeechURLRecognitionRequest
  • SFSpeechRecognitionTask - a recognition task, with the ability to monitor the progress and cancel it, if desired.

Assume we have the following defined:

let recognizer = SFSpeechRecognizer(locale: Locale(identifier: "fr-FR"))
let request = SFSpeechAudioBufferRecognitionRequest()

Audio Input

To capture audio data from a microphone, we need to install a tap on the input node of an AVAudioEngine object and append the audio buffer to the SFSpeechAudioBufferRecognitionRequest object.

let audioEngine = AVAudioEngine()
let inputNode = audioEngine.inputNode
let format = inputNode.outputFormat(forBus: 0)
inputNode.installTap(onBus: 0, bufferSize: 1024, format: format) { (buffer, _) in

Create a Recognition Task and Monitor the Results

We can then create a recognition task for the recognizer using the recognition request. The request handler receives an optional SFSpeechRecognitionResult and an optional Error.

The SFSpeechRecognitionResult includes a list of transcriptions, sorted in descending order by confidence.

If the result is final, or if there's an error, we want stop the audio engine and remove the tap.

let task = recognizer.recognitionTask(with: request) { result, error in
    if let result = result {
        let allTranscriptions = result.transcriptions.map { $0.formattedString }
        print(allTranscriptions.joined(separator: "\n"))

    let isFinal = result?.isFinal ?? false

    if error != nil || isFinal {
        inputNode.removeTap(onBus: 0)

This episode uses Xcode 9.4.1, Swift 4.1.