Speech Recognition and TTS

This episode was authored by Yono Mittlefeldt.

Episode Links

Speech Recognition API - WWDC 2016 Session 509. This session introduces the speech recognition API

Getting Started

Text-to-speech needs the AVFoundation framework, but speech recognition needs the Speech framework, which already imports AVFoundation. So we can just import the latter.

import Speech

Text-to-Speech

For TTS, we need three components

AVSpeechUtterance - the text, from which you want to synthesize speech, plus a few properties on how to synthesize it.
AVSpeechSynthesisVoice - the voice to use for the speech. This is currently just defined by the BCP-47 language code, as each language has only one voice and one quality level.
AVSpeechSynthesizer - the object that synthesizes speech

So we can synthesize speech in a few easy lines of code:

let utterance = AVSpeechUtterance(string: "La synthèse vocale est très facile")
utterance.voice = AVSpeechSynthesisVoice(language: "fr-FR")

let synthesizer = AVSpeechSynthesizer()
synthesizer.speak(utterance)

Speech Recognition

Authorization

Before using speech recognition in an app, we need to request authorization.

SFSpeechRecognizer.requestAuthorization { (authStatus) in
    switch authStatus {
    case .authorized:
        print("User has authorized access to speech recognition!")  
    case .denied:
        print("User denied access to speech recognition")
    case .restricted:
        print("Speech recognition restricted on this device")
    case .notDetermined:
        print("Speech recognition not yet authorized")
    }
}

We also need to include a privacy usage string for speech recognition in our Info.plist. The raw key for this privacy usage string is NSSpeechRecognitionUsageDescription.

Important Components

Speech recognition has three important components:

SFSpeechRecognizer - a speech recognizer, which can only handle a single language
SFSpeechRecognitionRequest - a request to recognize speech from a particular audio source. To use the microphone as an audio source, we use SFSpeechAudioBufferRecognitionRequest. If we wanted to use a pre-recorded audio file, we would use SFSpeechURLRecognitionRequest
SFSpeechRecognitionTask - a recognition task, with the ability to monitor the progress and cancel it, if desired.

Assume we have the following defined:

let recognizer = SFSpeechRecognizer(locale: Locale(identifier: "fr-FR"))
let request = SFSpeechAudioBufferRecognitionRequest()

Audio Input

To capture audio data from a microphone, we need to install a tap on the input node of an AVAudioEngine object and append the audio buffer to the SFSpeechAudioBufferRecognitionRequest object.

let audioEngine = AVAudioEngine()
let inputNode = audioEngine.inputNode
let format = inputNode.outputFormat(forBus: 0)
inputNode.installTap(onBus: 0, bufferSize: 1024, format: format) { (buffer, _) in
    request.append(buffer)
}

Create a Recognition Task and Monitor the Results

We can then create a recognition task for the recognizer using the recognition request. The request handler receives an optional SFSpeechRecognitionResult and an optional Error.

The SFSpeechRecognitionResult includes a list of transcriptions, sorted in descending order by confidence.

If the result is final, or if there's an error, we want stop the audio engine and remove the tap.

let task = recognizer.recognitionTask(with: request) { result, error in
    if let result = result {
        let allTranscriptions = result.transcriptions.map { $0.formattedString }
        print(allTranscriptions.joined(separator: "\n"))
    }

    let isFinal = result?.isFinal ?? false

    if error != nil || isFinal {
        self.audioEngine.stop()
        inputNode.removeTap(onBus: 0)
    }
}