With special guest Yono, we dive into the system for text-to-speech and speech recognition on iOS. Yono builds an app for language practice. Along the way we become familiar with AVAudioEngine, AVSpeechSynthesizer, and SFSpeechRecognizer from the Speech Framework.
This episode was authored by Yono Mittlefeldt. Episode Links Speech Recognition API - WWDC 2016 Session 509. This session introduces the speech recognition API Getting Started Text-to-speech needs the AVFoundation framework, but speech recognition needs the Speech framework, which already imports AVFoundation. So we can just import the latter. import Speech Text-to-Speech For TTS, we need three components AVSpeechUtterance - the text, from which you want to synthesize speech, plus a few properties on how to synthesize it. AVSpeechSynthesisVoice - the voice to use for the speech. This is currently just defined by the BCP-47 language code, as each language has only one voice and one quality level. AVSpeechSynthesizer - the object that synthesizes speech So we can synthesize speech in a few easy lines of code: let utterance = AVSpeechUtterance(string: "La synthèse vocale est très facile") utterance.voice = AVSpeechSynthesisVoice(language: "fr-FR") let synthesizer = AVSpeechSynthesizer() synthesizer.speak(utterance) Speech Recognition Authorization Before using speech recognition in an app, we need to request authorization. SFSpeechRecognizer.requestAuthorization { (authStatus) in switch authStatus { case .authorized: print("User has authorized access to speech recognition!") case .denied: print("User denied access to speech recognition") case .restricted: print("Speech recognition restricted on this device") case .notDetermined: print("Speech recognition not yet authorized") } } We also need to include a privacy usage string for speech recognition in our Info.plist. The raw key for this privacy usage string is NSSpeechRecognitionUsageDescription. Important Components Speech recognition has three important components: SFSpeechRecognizer - a speech recognizer, which can only handle a single language SFSpeechRecognitionRequest - a request to recognize speech from a particular audio source. To use the microphone as an audio source, we use SFSpeechAudioBufferRecognitionRequest. If we wanted to use a pre-recorded audio file, we would use SFSpeechURLRecognitionRequest SFSpeechRecognitionTask - a recognition task, with the ability to monitor the progress and cancel it, if desired. Assume we have the following defined: let recognizer = SFSpeechRecognizer(locale: Locale(identifier: "fr-FR")) let request = SFSpeechAudioBufferRecognitionRequest() Audio Input To capture audio data from a microphone, we need to install a tap on the input node of an AVAudioEngine object and append the audio buffer to the SFSpeechAudioBufferRecognitionRequest object. let audioEngine = AVAudioEngine() let inputNode = audioEngine.inputNode let format = inputNode.outputFormat(forBus: 0) inputNode.installTap(onBus: 0, bufferSize: 1024, format: format) { (buffer, _) in request.append(buffer) } Create a Recognition Task and Monitor the Results We can then create a recognition task for the recognizer using the recognition request. The request handler receives an optional SFSpeechRecognitionResult and an optional Error. The SFSpeechRecognitionResult includes a list of transcriptions, sorted in descending order by confidence. If the result is final, or if there's an error, we want stop the audio engine and remove the tap. let task = recognizer.recognitionTask(with: request) { result, error in if let result = result { let allTranscriptions = result.transcriptions.map { $0.formattedString } print(allTranscriptions.joined(separator: "\n")) } let isFinal = result?.isFinal ?? false if error != nil || isFinal { self.audioEngine.stop() inputNode.removeTap(onBus: 0) } }