Ling 290 Final Flashcards
How does ASR work?
System receives acoustic input from a speaker thru a microphone, analyzes it using a pattern/model/algorithm and produces an output, usually in the form of text
Automatic speech recognition (ASR)
Independant, machine-based process of decoding/transcribing oral speech
What’s the difference between:
Speech recognition
Voice recognition
Speech understanding
Speech understanding/identification: determining meaning (not transcription)
Speech recognition: ability of machine to recognize what is being said (WHAT)
Voice recognition: ability of a machine to recognize speaking style (WHO)
Describe the first ASR system and who invented it
In the early 1950s Bell Telephone Laboratories Davis, Biddulph, Balashek called AUDREY Could recognize isolated digits from 0 to 9 for a single speaker Speaker dependant Required extensive training
Describe the early ASR system template and its faults
Template-based recognition based on pattern matching (comparing speakers input to stored acoustic templates/patterns)
Faults:
-not good for large vocab recognition
-can’t match speech sounds if they are a diff length
Who were the first ones to use a computer for an ASR?
Forgi and Forgi in 1959
What did researchers experiment with in the 60s to improve ASR?
Researchers experimented with time-normalization techniques (dynamic time warping DTW) to minimize diffs in speech rate
What were the three milestones for ASR in the 70s?
1- focus on recognition of continuous speech
2- development of large vocab speech recognizers
3- speaker independent systems (so it could recognize a range of voices)
What was the first commercial ASR system?
VIP-100 (won US national award)
Describe the SUR project (speech understanding research)
ARPA started it (1971-1976) with the goal of creating a system capable of understanding connected speech of many speakers with a 1000 vocabulary (in a low noise environment) and have an error rate less than 10%
What was the successful product of the SUR project?
Harpy; showed the benefits of data-based statistics models over template-based; first step towards hidden Markov modelling (HMM)
What is HMM? (One ASR model of 1980s)
Hidden Markov modeling= based on complex statistical/probabilistic analyses
Represent language units like morphemes as a sequence of states with transition probabilities between each state
Uses highest probability to predict best answer
Good and bad aspects of HMM?
Good: can analyze both temporal and spectral variations of speech signals and can decode continuous speech
Bad: require extensive training, large amount of memory, huge computational power for model-parameter storage and likelihood evaluation
What is ANN? (Second ASR of 1980s)
Artificial neural network; was reintroduced after beginning in 1950s
Network consists of interconnected processing units combined in layers with different weights that are determined on the basis of training data
Good and bad aspects of ANN?
Good: classification of static patterns (including noisy acoustic data) and useful for recognizing isolated speech units
Bad: systems just based on ANN don’t work very well, needs to be paired with HMM
First and second steps in commercialization of ASR? (1980-1990s)
ASR used in telephone networks, portable speech recognizers offered to the public and ASR integrated into PC dictation systems to air traffic control training systems
What were the 3 focuses of ASR in the 1990s? Plus two extras
1- larger vocab
2- spontaneous speech recognition
3- working in noisy environments
Plus
- human to human speech recognition
- visual speech recognition (based on lip positions and movements etc)
Three areas of further progress of ASR in 2000s?
1- development of new algorithms
2- advances in noisy speech recognition
3- integration of speech recognition into mobile technologies like cellphones
Speech recognition systems can be characterized by which 3 dimensions
1- speaker dependence (speech dependant, speech independant, adaptive)
2- speech continuity (isolated/discrete word recognition systems, connected word recognition systems, continuous speech recognition systems, word spotting systems)
3- vocab size
3 errors in ASR
Errors in discrete speech recognition (deletion, insertion, substitution and rejection errors)
Errors in continuous speech recognition (same as discrete speech + splits and fusions)
Errors in word spotting (false rejects aka word is missed and false alarms aka word misidentified)
Difference between a direct, indirect and intent error in ASR
Direct= human misspeaks/stutters Intent= speaker decides to restate what's just been said Indirect= ASR system incorrectly identifies what the speaker said
What are three reasons why ASR is good for learning a language?
- practice
- motivation
- feelings of communicating, not just repeating phrases and words
What are the two options of ASR for automatic rating of pronunciation? 3 ways to reach these goals?
Two options= give global pronunciation rating OR identify specific errors
3 ways of achieving this= ASR identify word boundaries, accurately match speech to correct targets and compare to see what was done right/wrong
Describe spoken CALL dialogue systems
Software programs for practicing spoken languages use it
Provides one line of dialogue and then speaker choses one of two responses
If response is wrong, ASR system can recognize what response has been spoken (even if there’s errors) and then the computer responds, allowing the learner to try again
Spectrograph
Bell Telephone Laboratories
Late 1930s
Later Sonagraph made by Kay Elemetrics
Made for the phonetic study of speech
Direct Translator was another one which had a florescent screen and was used to help deaf and foreign exchange students with pronunciation
Thought to be a “war project”; something to identify “voiceprints”
What is the term used to refer to forensic phonetics that is still used in Russia and many east European countries?
Phonoscopy
What are voiceprints?
Patterns seen on spectrograms for an individual voice
Kersta said they were 99% accurate and opened his own business called Voiceprint Laboratories corporation
What’s the term for speech samples that are obtained at different times and later used for identification?
Non contemporary speech samples
What are the two factors that most influence identification accuracy?
Sample duration (only if it contains more phonemes than the original) and acoustic quality (background noise and bandwidth of recordings)
What is one factor that affects formant data over the telephone?
Band-pass filtering (telephone cuts off some frequencies)
Familiarity with the speaker in phonetic forensics: testing under three conditions, what were the results?
Hollien: normal, disguised and stressed
Listeners who knew the person did better under all conditions
What type of voice disguise provided the greatest effect?
Hypernasality
What’s the percent of annual cases that used voice disguise?
15-25%
Three reasons why real ear witnessing is different than studies conducted?
- they don’t mirror real life situations
- the stress can’t be recreated
- not familiar voices
- real life witnesses are not prepared as study witnesses are
What is an ear witness line-up?
Aka voice parade; used when a person has heard (not seen) the perpetrator
Two questions asked in regards to using a voice parade? What are the answers?
How many foils should be used?
A: too many=not good, less=more accurate but you still need to be fair so best number is 5-6
How similar should the foils be?
The foils should be similar but not to the extreme (must match age, dialect, etc)
What brain text can be used to tell whether someone is lying or not? What’s used for detecting changes in blood flow to the face?
fMRI (functional magnetic resonance imaging) because it shows a difference in brain activity while telling a lie
High resolution thermal imaging
Describe the first lie detector
Polygraph
1917
Detects the stress level of the person through pulse/blood pressure/skin response