Recognize and synthesize speech Flashcards
Microsoft Azure offers both speech recognition and speech synthesis capabilities through the Speech cognitive service, which includes the following application programming interfaces (APIs):
The Speech-to-Text API
The Text-to-Speech API
Azure resources for the Speech service
- A Speech resource - choose this resource type if you only plan to use the Speech service, or if you want to manage access and billing for the resource separately from other services.
- A Cognitive Services resource - choose this resource type if you plan to use the Speech service in combination with other cognitive services, and you want to manage access and billing for these services together.
The speech-to-text API
You can use the speech-to-text API to perform real-time or batch transcription of audio into a text format. The audio source for transcription can be a real-time audio stream from a microphone or an audio file.
The model that is used by the speech-to-text API, is based on the Universal Language Model that was trained by Microsoft.
The model is optimized for two scenarios:
1. Conversational
2. Dictation.
Real-time transcription
Real-time speech-to-text allows you to transcribe text in audio streams. You can use real-time transcription for presentations, demos, or any other scenario where a person is speaking.
Batch transcription
Not all speech-to-text scenarios are real time. You may have audio recordings stored on a file share, a remote server, or even on Azure storage.
Batch transcription should be run in an asynchronous manner because the batch jobs are scheduled on a best-effort basis.
The text-to-speech API
The text-to-speech API enables you to convert text input to audible speech, which can either be played directly through a computer speaker or written to an audio file.
Speech synthesis voices
When you use the text-to-speech API, you can specify the voice to be used to vocalize the text. This capability offers you the flexibility to personalize your speech synthesis solution and give it a specific character.
(AI) solutions to accept vocal commands and provide spoken responses.
the AI system must support two capabilities:
- Speech recognition - the ability to detect and interpret spoken input.
- Speech synthesis - the ability to generate spoken output.
Speech recognition
Speech recognition is concerned with taking the spoken word and converting it into data that can be processed - often by transcribing it into a text representation.
The spoken words can be in the form of a recorded voice in an audio file, or live audio from a microphone.
Speech patterns are analyzed in the audio to determine recognizable patterns that are mapped to words. To accomplish this feat, the software typically uses multiple types of models, including:
- An acoustic model that converts the audio signal into phonemes (representations of specific sounds).
- A language model that maps phonemes to words, usually using a statistical algorithm that predicts the most probable sequence of words based on the phonemes.
In Speech recognition
The recognized words are typically converted to text, which you can use for various purposes, such as.
- Providing closed captions for recorded or live videos
- Creating a transcript of a phone call or meeting
- Automated note dictation
- Determining intended user input for further processing
Speech synthesis
Speech synthesis is in many respects the reverse of speech recognition.
It is concerned with vocalizing data, usually by converting text to speech. A speech synthesis solution typically requires the following information:
- The text to be spoken.
- The voice to be used to vocalize the speech.
To synthesize speech, the system typically tokenizes the text to break it down into individual words, and assigns phonetic sounds to each word.
These phonemes are then synthesized as audio by applying a voice, which will determine parameters such as pitch and timbre; and generating an audio wave form that can be output to a speaker or written to a file.
purposes of using the output of speech synthesis
- Generating spoken responses to user input.
- Creating voice menus for telephone systems.
- Reading email or text messages aloud in hands-free scenarios.
- Broadcasting announcements in public locations, such as railway stations or airports.
Speech cognitive service
Microsoft Azure offers both speech recognition and speech synthesis capabilities through the Speech cognitive service, which includes the following application programming interfaces (APIs):
The Speech-to-Text API
The Text-to-Speech API
Azure resources for the Speech service
- A Speech resource - choose this resource type if you only plan to use the Speech service, or if you want to manage access and billing for the resource separately from other services.
- A Cognitive Services resource - choose this resource type if you plan to use the Speech service in combination with other cognitive services, and you want to manage access and billing for these services together.
The speech-to-text API
You can use the speech-to-text API to perform real-time or batch transcription of audio into a text format. The audio source for transcription can be a real-time audio stream from a microphone or an audio file.
The model that is used by the speech-to-text API, is based on the Universal Language Model that was trained by Microsoft.