
This video is only available to subscribers. Start a subscription today to get access to this and 484 other videos.
Speech Recognition and TTS
This episode was authored by Yono Mittlefeldt.
Episode Links
- Speech Recognition API - WWDC 2016 Session 509. This session introduces the speech recognition API
Getting Started
Text-to-speech needs the AVFoundation
framework, but speech recognition needs the Speech
framework, which already imports AVFoundation
. So we can just import the latter.
import Speech
Text-to-Speech
For TTS, we need three components
AVSpeechUtterance
- the text, from which you want to synthesize speech, plus a few properties on how to synthesize it.AVSpeechSynthesisVoice
- the voice to use for the speech. This is currently just defined by the BCP-47 language code, as each language has only one voice and one quality level.AVSpeechSynthesizer
- the object that synthesizes speech
So we can synthesize speech in a few easy lines of code:
let utterance = AVSpeechUtterance(string: "La synthèse vocale est très facile")
utterance.voice = AVSpeechSynthesisVoice(language: "fr-FR")
let synthesizer = AVSpeechSynthesizer()
synthesizer.speak(utterance)
Speech Recognition
Authorization
Before using speech recognition in an app, we need to request authorization.
SFSpeechRecognizer.requestAuthorization { (authStatus) in
switch authStatus {
case .authorized:
print("User has authorized access to speech recognition!")
case .denied:
print("User denied access to speech recognition")
case .restricted:
print("Speech recognition restricted on this device")
case .notDetermined:
print("Speech recognition not yet authorized")
}
}
We also need to include a privacy usage string for speech recognition in our Info.plist
. The raw key for this privacy usage string is NSSpeechRecognitionUsageDescription
.
Important Components
Speech recognition has three important components:
SFSpeechRecognizer
- a speech recognizer, which can only handle a single languageSFSpeechRecognitionRequest
- a request to recognize speech from a particular audio source. To use the microphone as an audio source, we useSFSpeechAudioBufferRecognitionRequest
. If we wanted to use a pre-recorded audio file, we would useSFSpeechURLRecognitionRequest
SFSpeechRecognitionTask
- a recognition task, with the ability to monitor the progress and cancel it, if desired.
Assume we have the following defined:
let recognizer = SFSpeechRecognizer(locale: Locale(identifier: "fr-FR"))
let request = SFSpeechAudioBufferRecognitionRequest()
Audio Input
To capture audio data from a microphone, we need to install a tap on the input node of an AVAudioEngine
object and append the audio buffer to the SFSpeechAudioBufferRecognitionRequest
object.
let audioEngine = AVAudioEngine()
let inputNode = audioEngine.inputNode
let format = inputNode.outputFormat(forBus: 0)
inputNode.installTap(onBus: 0, bufferSize: 1024, format: format) { (buffer, _) in
request.append(buffer)
}
Create a Recognition Task and Monitor the Results
We can then create a recognition task for the recognizer using the recognition request. The request handler receives an optional SFSpeechRecognitionResult
and an optional Error
.
The SFSpeechRecognitionResult
includes a list of transcriptions, sorted in descending order by confidence.
If the result is final, or if there's an error, we want stop the audio engine and remove the tap.
let task = recognizer.recognitionTask(with: request) { result, error in
if let result = result {
let allTranscriptions = result.transcriptions.map { $0.formattedString }
print(allTranscriptions.joined(separator: "\n"))
}
let isFinal = result?.isFinal ?? false
if error != nil || isFinal {
self.audioEngine.stop()
inputNode.removeTap(onBus: 0)
}
}