Wednesday, 7 November 2018

How to correctly set up AVAudioSession and AVAudioEngine when using both SFSpeechRecognizer and AVSpeechSythesizer

I am trying to create an app that leverages both STT (Speech to Text) and TTS (Text to Speech) at the same time. However, I am running into a couple of foggy issues and would appreciate your kind expertise.

The app consists of a button at the center of the screen which, upon clicking, starts the required speech recognition functionality using the code below.

// MARK: - Constant Properties

let audioEngine = AVAudioEngine()



// MARK: - Optional Properties

var recognitionRequest: SFSpeechAudioBufferRecognitionRequest?
var recognitionTask: SFSpeechRecognitionTask?
var speechRecognizer: SFSpeechRecognizer?



// MARK: - Functions

internal func startSpeechRecognition() {

    // Instantiate the recognitionRequest property.
    self.recognitionRequest = SFSpeechAudioBufferRecognitionRequest()

    // Set up the audio session.
    let audioSession = AVAudioSession.sharedInstance()
    do {
        try audioSession.setCategory(.record, mode: .measurement, options: [.defaultToSpeaker, .duckOthers])
        try audioSession.setActive(true, options: .notifyOthersOnDeactivation)
    } catch {
        print("An error has occurred while setting the AVAudioSession.")
    }

    // Set up the audio input tap.
    let inputNode = self.audioEngine.inputNode
    let inputNodeFormat = inputNode.outputFormat(forBus: 0)

    self.audioEngine.inputNode.installTap(onBus: 0, bufferSize: 512, format: inputNodeFormat, block: { [unowned self] buffer, time in
        self.recognitionRequest?.append(buffer)
    })

    // Start the recognition task.
    guard
        let speechRecognizer = self.speechRecognizer,
        let recognitionRequest = self.recognitionRequest else {
            fatalError("One or more properties could not be instantiated.")
    }

    self.recognitionTask = speechRecognizer.recognitionTask(with: recognitionRequest, resultHandler: { [unowned self] result, error in

        if error != nil {

            // Stop the audio engine and recognition task.
            self.stopSpeechRecognition()

        } else if let result = result {

            let bestTranscriptionString = result.bestTranscription.formattedString

            self.command = bestTranscriptionString
            print(bestTranscriptionString)

        }

    })

    // Start the audioEngine.
    do {
        try self.audioEngine.start()
    } catch {
        print("Could not start the audioEngine property.")
    }

}



internal func stopSpeechRecognition() {

    // Stop the audio engine.
    self.audioEngine.stop()
    self.audioEngine.inputNode.removeTap(onBus: 0)

    // End and deallocate the recognition request.
    self.recognitionRequest?.endAudio()
    self.recognitionRequest = nil

    // Cancel and deallocate the recognition task.
    self.recognitionTask?.cancel()
    self.recognitionTask = nil

}

When used alone, this code works like a charm. However, when I want to read that transcribed text using an AVSpeechSynthesizer object, nothing seems to be clear.

I went through the suggestions of multiple Stack Overflow posts, which suggested modifying

audioSession.setCategory(.record, mode: .measurement, options: [.defaultToSpeaker, .duckOthers])

To the following

audioSession.setCategory(.playAndRecord, mode: .default, options: [.defaultToSpeaker, .duckOthers])

Yet in vain. The app was still crashing after running STT then TTS, respectively.

The solution was for me to use this rather than the aforementioned

audioSession.setCategory(.multiRoute, mode: .default, options: [.defaultToSpeaker, .duckOthers])

This got me completely overwhelmed as I really have no clue what was intricately going on. I would highly appreciate any relevant explanation!



from How to correctly set up AVAudioSession and AVAudioEngine when using both SFSpeechRecognizer and AVSpeechSythesizer

No comments:

Post a Comment