Ling 290 Final Flashcards
How does ASR work?
System receives acoustic input from a speaker thru a microphone, analyzes it using a pattern/model/algorithm and produces an output, usually in the form of text
Automatic speech recognition (ASR)
Independant, machine-based process of decoding/transcribing oral speech
What’s the difference between:
Speech recognition
Voice recognition
Speech understanding
Speech understanding/identification: determining meaning (not transcription)
Speech recognition: ability of machine to recognize what is being said (WHAT)
Voice recognition: ability of a machine to recognize speaking style (WHO)
Describe the first ASR system and who invented it
In the early 1950s Bell Telephone Laboratories Davis, Biddulph, Balashek called AUDREY Could recognize isolated digits from 0 to 9 for a single speaker Speaker dependant Required extensive training
Describe the early ASR system template and its faults
Template-based recognition based on pattern matching (comparing speakers input to stored acoustic templates/patterns)
Faults:
-not good for large vocab recognition
-can’t match speech sounds if they are a diff length
Who were the first ones to use a computer for an ASR?
Forgi and Forgi in 1959
What did researchers experiment with in the 60s to improve ASR?
Researchers experimented with time-normalization techniques (dynamic time warping DTW) to minimize diffs in speech rate
What were the three milestones for ASR in the 70s?
1- focus on recognition of continuous speech
2- development of large vocab speech recognizers
3- speaker independent systems (so it could recognize a range of voices)
What was the first commercial ASR system?
VIP-100 (won US national award)
Describe the SUR project (speech understanding research)
ARPA started it (1971-1976) with the goal of creating a system capable of understanding connected speech of many speakers with a 1000 vocabulary (in a low noise environment) and have an error rate less than 10%
What was the successful product of the SUR project?
Harpy; showed the benefits of data-based statistics models over template-based; first step towards hidden Markov modelling (HMM)
What is HMM? (One ASR model of 1980s)
Hidden Markov modeling= based on complex statistical/probabilistic analyses
Represent language units like morphemes as a sequence of states with transition probabilities between each state
Uses highest probability to predict best answer
Good and bad aspects of HMM?
Good: can analyze both temporal and spectral variations of speech signals and can decode continuous speech
Bad: require extensive training, large amount of memory, huge computational power for model-parameter storage and likelihood evaluation
What is ANN? (Second ASR of 1980s)
Artificial neural network; was reintroduced after beginning in 1950s
Network consists of interconnected processing units combined in layers with different weights that are determined on the basis of training data
Good and bad aspects of ANN?
Good: classification of static patterns (including noisy acoustic data) and useful for recognizing isolated speech units
Bad: systems just based on ANN don’t work very well, needs to be paired with HMM
First and second steps in commercialization of ASR? (1980-1990s)
ASR used in telephone networks, portable speech recognizers offered to the public and ASR integrated into PC dictation systems to air traffic control training systems
What were the 3 focuses of ASR in the 1990s? Plus two extras
1- larger vocab
2- spontaneous speech recognition
3- working in noisy environments
Plus
- human to human speech recognition
- visual speech recognition (based on lip positions and movements etc)
Three areas of further progress of ASR in 2000s?
1- development of new algorithms
2- advances in noisy speech recognition
3- integration of speech recognition into mobile technologies like cellphones
Speech recognition systems can be characterized by which 3 dimensions
1- speaker dependence (speech dependant, speech independant, adaptive)
2- speech continuity (isolated/discrete word recognition systems, connected word recognition systems, continuous speech recognition systems, word spotting systems)
3- vocab size
3 errors in ASR
Errors in discrete speech recognition (deletion, insertion, substitution and rejection errors)
Errors in continuous speech recognition (same as discrete speech + splits and fusions)
Errors in word spotting (false rejects aka word is missed and false alarms aka word misidentified)
Difference between a direct, indirect and intent error in ASR
Direct= human misspeaks/stutters Intent= speaker decides to restate what's just been said Indirect= ASR system incorrectly identifies what the speaker said
What are three reasons why ASR is good for learning a language?
- practice
- motivation
- feelings of communicating, not just repeating phrases and words
What are the two options of ASR for automatic rating of pronunciation? 3 ways to reach these goals?
Two options= give global pronunciation rating OR identify specific errors
3 ways of achieving this= ASR identify word boundaries, accurately match speech to correct targets and compare to see what was done right/wrong
Describe spoken CALL dialogue systems
Software programs for practicing spoken languages use it
Provides one line of dialogue and then speaker choses one of two responses
If response is wrong, ASR system can recognize what response has been spoken (even if there’s errors) and then the computer responds, allowing the learner to try again
Spectrograph
Bell Telephone Laboratories
Late 1930s
Later Sonagraph made by Kay Elemetrics
Made for the phonetic study of speech
Direct Translator was another one which had a florescent screen and was used to help deaf and foreign exchange students with pronunciation
Thought to be a “war project”; something to identify “voiceprints”
What is the term used to refer to forensic phonetics that is still used in Russia and many east European countries?
Phonoscopy
What are voiceprints?
Patterns seen on spectrograms for an individual voice
Kersta said they were 99% accurate and opened his own business called Voiceprint Laboratories corporation
What’s the term for speech samples that are obtained at different times and later used for identification?
Non contemporary speech samples
What are the two factors that most influence identification accuracy?
Sample duration (only if it contains more phonemes than the original) and acoustic quality (background noise and bandwidth of recordings)
What is one factor that affects formant data over the telephone?
Band-pass filtering (telephone cuts off some frequencies)
Familiarity with the speaker in phonetic forensics: testing under three conditions, what were the results?
Hollien: normal, disguised and stressed
Listeners who knew the person did better under all conditions
What type of voice disguise provided the greatest effect?
Hypernasality
What’s the percent of annual cases that used voice disguise?
15-25%
Three reasons why real ear witnessing is different than studies conducted?
- they don’t mirror real life situations
- the stress can’t be recreated
- not familiar voices
- real life witnesses are not prepared as study witnesses are
What is an ear witness line-up?
Aka voice parade; used when a person has heard (not seen) the perpetrator
Two questions asked in regards to using a voice parade? What are the answers?
How many foils should be used?
A: too many=not good, less=more accurate but you still need to be fair so best number is 5-6
How similar should the foils be?
The foils should be similar but not to the extreme (must match age, dialect, etc)
What brain text can be used to tell whether someone is lying or not? What’s used for detecting changes in blood flow to the face?
fMRI (functional magnetic resonance imaging) because it shows a difference in brain activity while telling a lie
High resolution thermal imaging
Describe the first lie detector
Polygraph
1917
Detects the stress level of the person through pulse/blood pressure/skin response
What is a sign in the voice that someone is lying?
Micro tremor
What does an Israeli based company market as a lie detector?
Layered Voice Analysis (LVA)= focuses on brain activity while someone speaks
What is the word for fake that’s used in the article for forensic phonetics
Charlatans
Focus on acquiring a native-like accent should shift to these two aspects for L2 learners
Intelligibility
Comprehensibility
(Focus on teaching melody and rhythm)
Three aspects of training in computer based segmentals lessons?
Perceptual
Acoustic
Articulatory
Waveforms
Graphical representations of sounds (visual acoustic displays)
Vertical axis= frequency of vocal fold vibration
Horizontal axis= time
3rd dimension= amplitude (loudness) of frequency at particular time represented by intensity or colour
***cant display voiceless sounds!
Describe CAPT
Computer assisted pronunciation training
Should provide output, input and feedback using ASR technology
Not as good as human feedback
SUPRASEGMENTALS consist of what 3 aspects
Prosody
Intonation
Rhythm
***most important part of Comprehensibility
Two ways to visually display prosody
1: y axis= melody
X axis= duration/length of syllables
2: waveforms that displays intensity and duration of words/sounds/periods of silence
One software for practicing sentence intonation
TELL ME MORE
Shows movement of lips and pitch curves
Two softwares that use video for prosody training
Anvil: provides a screen display of video and audio components of a speech event with pitch contour (created by Praat)
Real-Time Pitch Program + Computerized Speech lab: produces pitch contour in real time and compares learners speech with that of a native speaker by overlapping their contours
Both from KayPentax
Dimensions of accent
Salience
Intelligibility
Comprehensibility
What scale is accentedness based on?
Likert scale
Pedagogy
Science and art of education
What error causes great reduction in Comprehensibility?
High functional load (FL)
What are the three principal perspectives on pronunciation?
Medical view
Business view
Pedagogical view
Foreigner talk
When a speaker adapts their language so that it’s easier for the L2 listener to understand (aka modified input)
An L2 speaker can use accent to express identity only if the accent features are
Volitional
What are the three types of accent discrimination
1- stereotyping, usually through shibboleths
2- harassment/ mocking
3- being told that accent is unacceptable for a job that doesn’t require language skills
The way one person speaks is called what? (Forms a dialect)
Idiolect
What three variables explain the process by which speech rate affects consumer response?
Speed
Pitch
Inter phrase pausation
What syllable speed and pitch are better for advertising?
Faster than normal syllable speed
Low pitch
Three benefits of a marketer with an attractive voice
1- style of delivery can express info about the brands message or functionality
2- attract attention
3- get positive responses from listeners
What does ELM say about compression in advertising?
Elaboration Likelihood Model; compression = less time to think about ad and less attention given
What is normal interphrase pausation?
.5 seconds
How does pitch affect advertising
High pitch = less competent, less benevolent, less truthful, etc
When people aren’t able to process the message they rely on pitch
What’s a good way to fit more info into an ad without affecting the message?
Shortening interphrase pausation
Dictation software
ASR systems that transcribe speech to writing; helpful for disabled people with writing problems
What is CALL?
Computer assisted language learning; so L2 learners can learn with a computer for feedback
Stochastic vs deterministic processes
Stochastic= based on statistical probabilities (considering more than one solution to a problem) *now used for ASR
Deterministic= assumed single solution if series of steps followed *beginning of ASR
Describe Peterson and Barney’s study (ASR)
Vowel acoustics; used spectrograms to measure formant frequencies in vowels of American English
Describe what teaching accents analytically means
Teaching the actor to use particular words/consonants/word pronunciations/intonation patterns
Requires understanding of phonetic details of different accents
What’s the term for using ordinary letters to give phonetic spellings?
Faux phonetic transcription (compared to real IPA)
What are the inner circle English dialects?
General Canadian General American Uk English Australian English New Zealand English
What are the two aspects in which the inner circle English dialects differ?
1- rhoticity: rhotic English includes general Canadian English, General American English and Irish English (pronounce post-vocalic R)
Non-rhotic English includes Received Pronunciation (RP), Australian English and New Zealand English (lack post vocalic R and intrusive R)
2- vowel quality (see photo)
What’s the opposite of the analytical approach to learning accents?
Holistic approach= aiming to produce the essence of an accent by doing what seems appropriate instead of analyzing each sound (copying speech patterns/mannerisms)
Good for learning prosody and voice quality differences
Describe the McGurk effect
First reported in 1976; we can hear a difference sound depending on the visual info given (lips move diff but sound stays the same, not to us though)
Ex: /ba/ for bilabial and same when /va/ or /fa/ are presented with labiodental gesture
Hear bilabial /ba/ when seeing speaker produce velar /ga/ = alveolar /da/
Which neurons are involved in the perception of speech?
Mirror neurons
What parts of speech cannot be perceived by speech-reading?
Sounds far back in the vocal tract
Nasalization
Voicing
What is the animation technique where a human models the utterances to be produced by an animated character?
Rotoscoping; make up is used to mark the face
Downsides of rotoscoping (3)
- time consuming
- unnatural results
- human needs to say everything
Viseme
Basic facial configuration associated with a particular speech sound or group of sounds
Disney had 12
Now 22
When a computer analyzes the text of a script to be spoken by an animated character
Text-driven lip-syncing
A TTS system
Computes intermediate positions between visemes = interpolation
More advanced than text-driven lip-syncing
Speech-driven lip syncing
Uses ASR and visemes and emotions
Two diff types of visemes
Static= just sounds
Dynamic= represent sounds and their transitions
Three applications in music
Vocoder= analyzes speech into source and filter to create an encoded representation; used for telephones and in war and in music
Singing voice synthesis (SVS)= pitch and loudness are easy to create but vocal expressiveness is hard, not to create human like singing
What’s a singers formant
Concentration of acoustic energy at particular frequencies that allows singer to be heard over orchestra
Describe sound symbolism
Larger/less bright things OR smooth/mellow/rich things= back vowels
Cold/clean/crisp things= front vowels
Smaller/lighter/sharper things=voiceless consonants
Smaller/brighter/faster things= fricatives (compared to stops)
Describe negative effect
Sounds that give a feeling of disgust like ‘ew’
Characteristics of phone interviewers with low refusal rates
High pitch Louder Faster Greater pitch variation Sound knowledgable Speak clearly Higher social class Better attitude
Five types of forensic phonetic activity
Ear witness speaker identification (voice parade of probative value)
Expert speaker ID (determining the likelihood that U=K)
Speaker profiling
Content determination
Copyright infringement (CIVIL law not criminal)
Why can’t a voiceprint exist?
Fingerprints=phenotypic
Voice=learned and organic components
Two types of forensic voice comparisons
Auditory-perceptual=categorical phonetic transcription (analytical)
Holistic voice perception (holistic)
Acoustic=acoustic phonetics (analytical) using spectrograms and pitch tracks F0-F4, LTS LTF and F0
Automatic speech identification (holistic) computer based artificial intelligence for determining if U=K
How to vary pitch
Modify vocal fold tension
To make body seem bigger
Raise or lower larynx
Describe VSA
Vocal stress analysis (useless except to induce bogus pipeline effect)
4 components of pronunciation teaching
Listening
Feedback
Targeted instruction
Practice tasks