Ling 290 Final Flashcards

0
Q

How does ASR work?

A

System receives acoustic input from a speaker thru a microphone, analyzes it using a pattern/model/algorithm and produces an output, usually in the form of text

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
1
Q

Automatic speech recognition (ASR)

A

Independant, machine-based process of decoding/transcribing oral speech

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What’s the difference between:
Speech recognition
Voice recognition
Speech understanding

A

Speech understanding/identification: determining meaning (not transcription)
Speech recognition: ability of machine to recognize what is being said (WHAT)
Voice recognition: ability of a machine to recognize speaking style (WHO)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Describe the first ASR system and who invented it

A
In the early 1950s
Bell Telephone Laboratories 
Davis, Biddulph, Balashek
called AUDREY
Could recognize isolated digits from 0 to 9 for a single speaker 
Speaker dependant 
Required extensive training
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Describe the early ASR system template and its faults

A

Template-based recognition based on pattern matching (comparing speakers input to stored acoustic templates/patterns)
Faults:
-not good for large vocab recognition
-can’t match speech sounds if they are a diff length

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Who were the first ones to use a computer for an ASR?

A

Forgi and Forgi in 1959

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What did researchers experiment with in the 60s to improve ASR?

A

Researchers experimented with time-normalization techniques (dynamic time warping DTW) to minimize diffs in speech rate

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What were the three milestones for ASR in the 70s?

A

1- focus on recognition of continuous speech
2- development of large vocab speech recognizers
3- speaker independent systems (so it could recognize a range of voices)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What was the first commercial ASR system?

A

VIP-100 (won US national award)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Describe the SUR project (speech understanding research)

A

ARPA started it (1971-1976) with the goal of creating a system capable of understanding connected speech of many speakers with a 1000 vocabulary (in a low noise environment) and have an error rate less than 10%

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What was the successful product of the SUR project?

A

Harpy; showed the benefits of data-based statistics models over template-based; first step towards hidden Markov modelling (HMM)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is HMM? (One ASR model of 1980s)

A

Hidden Markov modeling= based on complex statistical/probabilistic analyses
Represent language units like morphemes as a sequence of states with transition probabilities between each state
Uses highest probability to predict best answer

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Good and bad aspects of HMM?

A

Good: can analyze both temporal and spectral variations of speech signals and can decode continuous speech
Bad: require extensive training, large amount of memory, huge computational power for model-parameter storage and likelihood evaluation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is ANN? (Second ASR of 1980s)

A

Artificial neural network; was reintroduced after beginning in 1950s
Network consists of interconnected processing units combined in layers with different weights that are determined on the basis of training data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Good and bad aspects of ANN?

A

Good: classification of static patterns (including noisy acoustic data) and useful for recognizing isolated speech units
Bad: systems just based on ANN don’t work very well, needs to be paired with HMM

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

First and second steps in commercialization of ASR? (1980-1990s)

A

ASR used in telephone networks, portable speech recognizers offered to the public and ASR integrated into PC dictation systems to air traffic control training systems

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What were the 3 focuses of ASR in the 1990s? Plus two extras

A

1- larger vocab
2- spontaneous speech recognition
3- working in noisy environments

Plus

  • human to human speech recognition
  • visual speech recognition (based on lip positions and movements etc)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Three areas of further progress of ASR in 2000s?

A

1- development of new algorithms
2- advances in noisy speech recognition
3- integration of speech recognition into mobile technologies like cellphones

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Speech recognition systems can be characterized by which 3 dimensions

A

1- speaker dependence (speech dependant, speech independant, adaptive)
2- speech continuity (isolated/discrete word recognition systems, connected word recognition systems, continuous speech recognition systems, word spotting systems)
3- vocab size

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

3 errors in ASR

A

Errors in discrete speech recognition (deletion, insertion, substitution and rejection errors)
Errors in continuous speech recognition (same as discrete speech + splits and fusions)
Errors in word spotting (false rejects aka word is missed and false alarms aka word misidentified)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Difference between a direct, indirect and intent error in ASR

A
Direct= human misspeaks/stutters
Intent= speaker decides to restate what's just been said
Indirect= ASR system incorrectly identifies what the speaker said
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What are three reasons why ASR is good for learning a language?

A
  • practice
  • motivation
  • feelings of communicating, not just repeating phrases and words
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What are the two options of ASR for automatic rating of pronunciation? 3 ways to reach these goals?

A

Two options= give global pronunciation rating OR identify specific errors
3 ways of achieving this= ASR identify word boundaries, accurately match speech to correct targets and compare to see what was done right/wrong

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Describe spoken CALL dialogue systems

A

Software programs for practicing spoken languages use it
Provides one line of dialogue and then speaker choses one of two responses
If response is wrong, ASR system can recognize what response has been spoken (even if there’s errors) and then the computer responds, allowing the learner to try again

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Spectrograph

A

Bell Telephone Laboratories
Late 1930s
Later Sonagraph made by Kay Elemetrics
Made for the phonetic study of speech
Direct Translator was another one which had a florescent screen and was used to help deaf and foreign exchange students with pronunciation
Thought to be a “war project”; something to identify “voiceprints”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

What is the term used to refer to forensic phonetics that is still used in Russia and many east European countries?

A

Phonoscopy

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

What are voiceprints?

A

Patterns seen on spectrograms for an individual voice

Kersta said they were 99% accurate and opened his own business called Voiceprint Laboratories corporation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

What’s the term for speech samples that are obtained at different times and later used for identification?

A

Non contemporary speech samples

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

What are the two factors that most influence identification accuracy?

A

Sample duration (only if it contains more phonemes than the original) and acoustic quality (background noise and bandwidth of recordings)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

What is one factor that affects formant data over the telephone?

A

Band-pass filtering (telephone cuts off some frequencies)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

Familiarity with the speaker in phonetic forensics: testing under three conditions, what were the results?

A

Hollien: normal, disguised and stressed

Listeners who knew the person did better under all conditions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

What type of voice disguise provided the greatest effect?

A

Hypernasality

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

What’s the percent of annual cases that used voice disguise?

A

15-25%

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

Three reasons why real ear witnessing is different than studies conducted?

A
  • they don’t mirror real life situations
  • the stress can’t be recreated
  • not familiar voices
  • real life witnesses are not prepared as study witnesses are
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

What is an ear witness line-up?

A

Aka voice parade; used when a person has heard (not seen) the perpetrator

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

Two questions asked in regards to using a voice parade? What are the answers?

A

How many foils should be used?
A: too many=not good, less=more accurate but you still need to be fair so best number is 5-6

How similar should the foils be?
The foils should be similar but not to the extreme (must match age, dialect, etc)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

What brain text can be used to tell whether someone is lying or not? What’s used for detecting changes in blood flow to the face?

A

fMRI (functional magnetic resonance imaging) because it shows a difference in brain activity while telling a lie

High resolution thermal imaging

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

Describe the first lie detector

A

Polygraph
1917
Detects the stress level of the person through pulse/blood pressure/skin response

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
38
Q

What is a sign in the voice that someone is lying?

A

Micro tremor

39
Q

What does an Israeli based company market as a lie detector?

A

Layered Voice Analysis (LVA)= focuses on brain activity while someone speaks

40
Q

What is the word for fake that’s used in the article for forensic phonetics

A

Charlatans

41
Q

Focus on acquiring a native-like accent should shift to these two aspects for L2 learners

A

Intelligibility
Comprehensibility
(Focus on teaching melody and rhythm)

42
Q

Three aspects of training in computer based segmentals lessons?

A

Perceptual
Acoustic
Articulatory

43
Q

Waveforms

A

Graphical representations of sounds (visual acoustic displays)
Vertical axis= frequency of vocal fold vibration
Horizontal axis= time
3rd dimension= amplitude (loudness) of frequency at particular time represented by intensity or colour
***cant display voiceless sounds!

44
Q

Describe CAPT

A

Computer assisted pronunciation training
Should provide output, input and feedback using ASR technology
Not as good as human feedback

45
Q

SUPRASEGMENTALS consist of what 3 aspects

A

Prosody
Intonation
Rhythm
***most important part of Comprehensibility

46
Q

Two ways to visually display prosody

A

1: y axis= melody
X axis= duration/length of syllables

2: waveforms that displays intensity and duration of words/sounds/periods of silence

47
Q

One software for practicing sentence intonation

A

TELL ME MORE

Shows movement of lips and pitch curves

48
Q

Two softwares that use video for prosody training

A

Anvil: provides a screen display of video and audio components of a speech event with pitch contour (created by Praat)

Real-Time Pitch Program + Computerized Speech lab: produces pitch contour in real time and compares learners speech with that of a native speaker by overlapping their contours

Both from KayPentax

49
Q

Dimensions of accent

A

Salience
Intelligibility
Comprehensibility

50
Q

What scale is accentedness based on?

A

Likert scale

51
Q

Pedagogy

A

Science and art of education

52
Q

What error causes great reduction in Comprehensibility?

A

High functional load (FL)

53
Q

What are the three principal perspectives on pronunciation?

A

Medical view
Business view
Pedagogical view

54
Q

Foreigner talk

A

When a speaker adapts their language so that it’s easier for the L2 listener to understand (aka modified input)

55
Q

An L2 speaker can use accent to express identity only if the accent features are

A

Volitional

56
Q

What are the three types of accent discrimination

A

1- stereotyping, usually through shibboleths
2- harassment/ mocking
3- being told that accent is unacceptable for a job that doesn’t require language skills

57
Q

The way one person speaks is called what? (Forms a dialect)

A

Idiolect

58
Q

What three variables explain the process by which speech rate affects consumer response?

A

Speed
Pitch
Inter phrase pausation

59
Q

What syllable speed and pitch are better for advertising?

A

Faster than normal syllable speed

Low pitch

60
Q

Three benefits of a marketer with an attractive voice

A

1- style of delivery can express info about the brands message or functionality
2- attract attention
3- get positive responses from listeners

61
Q

What does ELM say about compression in advertising?

A

Elaboration Likelihood Model; compression = less time to think about ad and less attention given

62
Q

What is normal interphrase pausation?

A

.5 seconds

63
Q

How does pitch affect advertising

A

High pitch = less competent, less benevolent, less truthful, etc
When people aren’t able to process the message they rely on pitch

64
Q

What’s a good way to fit more info into an ad without affecting the message?

A

Shortening interphrase pausation

65
Q

Dictation software

A

ASR systems that transcribe speech to writing; helpful for disabled people with writing problems

66
Q

What is CALL?

A

Computer assisted language learning; so L2 learners can learn with a computer for feedback

67
Q

Stochastic vs deterministic processes

A

Stochastic= based on statistical probabilities (considering more than one solution to a problem) *now used for ASR

Deterministic= assumed single solution if series of steps followed *beginning of ASR

68
Q

Describe Peterson and Barney’s study (ASR)

A

Vowel acoustics; used spectrograms to measure formant frequencies in vowels of American English

69
Q

Describe what teaching accents analytically means

A

Teaching the actor to use particular words/consonants/word pronunciations/intonation patterns
Requires understanding of phonetic details of different accents

70
Q

What’s the term for using ordinary letters to give phonetic spellings?

A

Faux phonetic transcription (compared to real IPA)

71
Q

What are the inner circle English dialects?

A
General Canadian
General American 
Uk English
Australian English 
New Zealand English
72
Q

What are the two aspects in which the inner circle English dialects differ?

A

1- rhoticity: rhotic English includes general Canadian English, General American English and Irish English (pronounce post-vocalic R)
Non-rhotic English includes Received Pronunciation (RP), Australian English and New Zealand English (lack post vocalic R and intrusive R)

2- vowel quality (see photo)

73
Q

What’s the opposite of the analytical approach to learning accents?

A

Holistic approach= aiming to produce the essence of an accent by doing what seems appropriate instead of analyzing each sound (copying speech patterns/mannerisms)
Good for learning prosody and voice quality differences

74
Q

Describe the McGurk effect

A

First reported in 1976; we can hear a difference sound depending on the visual info given (lips move diff but sound stays the same, not to us though)
Ex: /ba/ for bilabial and same when /va/ or /fa/ are presented with labiodental gesture
Hear bilabial /ba/ when seeing speaker produce velar /ga/ = alveolar /da/

75
Q

Which neurons are involved in the perception of speech?

A

Mirror neurons

76
Q

What parts of speech cannot be perceived by speech-reading?

A

Sounds far back in the vocal tract
Nasalization
Voicing

77
Q

What is the animation technique where a human models the utterances to be produced by an animated character?

A

Rotoscoping; make up is used to mark the face

78
Q

Downsides of rotoscoping (3)

A
  • time consuming
  • unnatural results
  • human needs to say everything
79
Q

Viseme

A

Basic facial configuration associated with a particular speech sound or group of sounds
Disney had 12
Now 22

80
Q

When a computer analyzes the text of a script to be spoken by an animated character

A

Text-driven lip-syncing
A TTS system
Computes intermediate positions between visemes = interpolation

81
Q

More advanced than text-driven lip-syncing

A

Speech-driven lip syncing

Uses ASR and visemes and emotions

82
Q

Two diff types of visemes

A

Static= just sounds

Dynamic= represent sounds and their transitions

83
Q

Three applications in music

A

Vocoder= analyzes speech into source and filter to create an encoded representation; used for telephones and in war and in music

Singing voice synthesis (SVS)= pitch and loudness are easy to create but vocal expressiveness is hard, not to create human like singing

84
Q

What’s a singers formant

A

Concentration of acoustic energy at particular frequencies that allows singer to be heard over orchestra

85
Q

Describe sound symbolism

A

Larger/less bright things OR smooth/mellow/rich things= back vowels

Cold/clean/crisp things= front vowels

Smaller/lighter/sharper things=voiceless consonants

Smaller/brighter/faster things= fricatives (compared to stops)

86
Q

Describe negative effect

A

Sounds that give a feeling of disgust like ‘ew’

87
Q

Characteristics of phone interviewers with low refusal rates

A
High pitch 
Louder
Faster
Greater pitch variation
Sound knowledgable 
Speak clearly 
Higher social class
Better attitude
88
Q

Five types of forensic phonetic activity

A

Ear witness speaker identification (voice parade of probative value)
Expert speaker ID (determining the likelihood that U=K)
Speaker profiling
Content determination
Copyright infringement (CIVIL law not criminal)

89
Q

Why can’t a voiceprint exist?

A

Fingerprints=phenotypic

Voice=learned and organic components

90
Q

Two types of forensic voice comparisons

A

Auditory-perceptual=categorical phonetic transcription (analytical)
Holistic voice perception (holistic)

Acoustic=acoustic phonetics (analytical) using spectrograms and pitch tracks F0-F4, LTS LTF and F0
Automatic speech identification (holistic) computer based artificial intelligence for determining if U=K

91
Q

How to vary pitch

A

Modify vocal fold tension

92
Q

To make body seem bigger

A

Raise or lower larynx

93
Q

Describe VSA

A

Vocal stress analysis (useless except to induce bogus pipeline effect)

94
Q

4 components of pronunciation teaching

A

Listening
Feedback
Targeted instruction
Practice tasks