Summary Flashcards
Most fundamental qualities of sound
Pitch (wavelength) and loudness (amplitude)
The larynx is formed by 4 cartilages
- Thyroid
- Cricoid
- 2 Arytenoids
Vocal folds and Vocal Tract
Vocal folds are two bands of muscle that are located within the larynx (voice box). They vibrate when air is pushed through them, producing sound.
The vocal tract is the area of the body which includes the vocal folds and all of the other structures involved in producing sound, such as the mouth, nose, and throat.
When does the vocal folds shorten and when lengthen
Short : thyroid cartilage contracts –> arytenoid slides –> decreasing of the distance vocal processes and thyroid prominence
Length: cricoid cartilage contracts –> thyroid and cricoid rotate –> increase distance vocal processes and thyroid prominence
What does contraction of cartilages do?
manipulate length of vocal folds, abduction (vocal folds further) and adduction (vocal folds closer)
3 involved systems in speech production
- sub glottal system (initiation phase –> breathing)
- glottal system (phonation phase –> Bernoulli so contraction cartilages)
- supra-glottal system (articulation phase –> oral and pharyngeal cavity)
Characterizing vowel and consonants
vowels:
- location (front, central, back) –> front means higher f1
- tongue position (high, mid, low) –> high means lower f2
- mouth position (rounded or unrounded)
consonant:
- place
-manner
- voiced
speech characteristics
- Periodicity –> voiced
- local maximum –> vowel
- silence and pre voicing –> plosive
- noise –> fricatives
- burst –> plosive
- change in amplitude –> change in sound
- change is sound structure –> change mouth position
Coarticulation
the process of blending one sound into another in order to achieve a desired pronunciation
- anticipatory (u influences word onset in stew)
- carryover (u influences consonant in use)
prosodic features
- properties of larger units of speech and reflects elements of language not encoded by grammar or choice of vocabulary
- To convey meaning and emotion
- intonation (use of pitch to convey meaning in speech)
- stress (emphasis placed on certain syllables of a word or phrase)
- Tone (the emotion or attitude in speech)
Two parts of the Fourier spectrum
- Amplitude spectrum
- Phase spectrum
Fourier transform
The Fourier transform is a mathematical technique used to transform a signal from its time domain into its frequency domain.
Explain briefly how the functionality of the cochlea is similar to Fourier Analysis
The functionality of the cochlea is similar to Fourier analysis in that it breaks down sound waves into their frequency components. This is done by converting the sound wave into an electrical signal, which is then analyzed by the cochlea. The cochlea then separates the signal into different frequency bands, allowing the auditory system to interpret the sound.
Path of sound
Ear canal –> eardrum –> ossicles –> cochlea
Three small bones (ossicles) in middle ear and function
Malleus, incus and stapes
to transmit tiny sound vibrations to the cochlea
Function and parts of inner ear
- Cochlea
- Basilair membrane
- oval window and round window are openings
responsible for converting sounds waves into electrical signals that can be interpreted by the brain
The cochlea also helps to filter out background noise and adjust the volume of incoming sounds. (Bandpass-filter)
Outer ear, parts and function
- auricle (outside)
- ear canal (connects to middle ear)
funneling the acoustic wave into ear canal
middle ear, parts and function
transfers vibrations of air particles into vibrations of mechanical structures
- Eardrum
- ossicles (malleus incus stapes)
What does the acoustic reflex?
spans the space between stapes and wall of middle ear, if this contracts it reduces the motion of the stapes
- protects ear from loud noise
Otitis media with effusion
Infections where ear cavity fills up with fluid and no longer perform an impedance bridge between air-filled ear canal and fluid filled cochlea.
Mel scale frequency
a logarithmic frequency scale used to measure the perceived pitch of a sound
Basic idea of Fourier transform
any signal can be approximated by sum of cosines
VoCoder
- Encoder coding the speech
- Decoder re-synthesizing speech
technique for coding speech for more efficiently for long distance phone calls
A3 Scrambling
to encode longer distance radio-telephone calls
- frequency bands were rearranged and inverted
- intercepted and decoded by Germans
SIGSALY (Project X or Green Hornet)
based on Vocoder
- needed for encryption (white noise stored on 2 vinyl phonographic records)
- special turntables to synchronize time
Concatenation
process of splicing together pieces together of pre-recorded speech
Signal processing modification
process of changing a pre-recorded signal to produce a desired sound
Advantages and disadvantages of concatenation
A
: ability to produce natural-sounding speech
: flexibility in creating new words
: speed of production
D
: lack of control over the sound of the speech
: its susceptibility to error
: inability to produce continuous speech
Advantages and disadvantages of signal processing modification
A
: producing greater degreee of control over the sound of input
D
: more computationally intensive
: more difficult to create new words or phrases with its technique
Challenges for speech perception
- Lack of invariance problem
- phonetic environment
- differing speech conditions (tempo)
- speaker variation (dialects)
- perceptual constancy and normalization
- ability recognize and interpret speech sounds regardless the context
- map signals to independent category
- speech segmentation problem
- difficult to identify and segment individual speech sounds
First generation speech synthesis
generated by explicit model
- articulatory synthesis –> using physiological models that stimulate movement vocal tract and articulators.
- source-filter models –> two components combined, a source (vocal folds) with a filter (vocal tract)
- formant synthesizers –> digital synthesizers that use combination of source-filter and pre-recorded vocal sample to generate realistic sounding speech
Cochlear implants (application of the SIGSALY)
–> neuroprosthetic device that bypasses the normal acoustic hearing process by electric stimulation of auditory nerve
Generations of speech synthesis
first –> source waveform is generated by explicit model
second –> source waveform is generated by data
third –> source waveform is learned from the data
second generation speech synthesis
tradeoff between processing speed and memory
- model based
- sample based
third generation of speech synthesis
input is Mel frequency cepstral coefficients
- divide signal in frames of 20-40 ms
- mel filter bank (determine filter bank energies)
- log transform
- compute discrete cosine transform (DCT)
Unit selection
- Generating speech using data base of pre-recorded speech samples and selecting most appropriate units of speech form the data base
++ more natural speech
– less generalizable and more recordings needed
Unit selection
- Generating speech using data base of pre-recorded speech samples and selecting most appropriate units of speech form the data base
++ more natural speech
– less generalizable and more recordings needed
diphones
the sound between two adjacent phones, combined to form words
advan and disadvantages for third generation speech synthesis
A
: automatically train so avoid hand written rules
: high quality synthesis and compact
D
: speech has to be generated by parametric model, final quality is dependent on parameter-to speech technique used
applications of text to speech
- people with visual impairments to listen to text
- listening to text during driving
- travel information in public transport
components of a text to speech synthesizer
- text analysis
- identify tokens
- tokenizing (split in smaller chunks)
- normalization (determine spoken variant of each token)
- linguistic analysis
- phonemes
- prosodic information (intonation, duration, stress, rhythm)
- waveform generation (1,2,3)
Corpus
a collection of texts with some unifying characteristics
regular expression
sequence of characters that define a search pattern in strings of text such as words, phrases and numbers
Major uses of corpora?
- applicative (develop nlp tools)
- analytical (empirical basis on the distribution of constructions and language phenomena)
how to do regular expression
- normalizing text (standard form)
- tokenization (splice words)
- lemmatization (find similar roots)
- stemming (make simpler to roots)
- sentence segmentation (breaking a sentence)
- compare words and strings
dimensions of variation
- multiple languages (code switching)
- genre (source of the text)
- demographic characteristics writer
- language changes over time
datasheet properties
motivation
situation
language variety
collection process
annotation process
distribution
normalization process
- tokenizing
- token learner
- token segmenter
- normalizing word formats
- case folding (lower case)
- lemmatization
- morphological parsing
- stemming
- segmenting sentences
Homophones and homographs
phones –> same sound, different spelling
graphs –> same spelling, different sound
Semantic relations
synonymy, antonymy, hypernymy/hyponymy, meronymy/holonymy, co-hyponyms
synonymy
house - villa
same sense, different word
antonymy
good - bad tegenstelling
hypernymy/ hyponymy
“dog” is a hyponym of the word “animal”
because animal is less specific
meronymy / holonymy
fingers is meronym of hand because it is a part of the hand
hand is the homonymy of fingers because it is the whole
meronymy / holonymy
fingers is meronym of hand because it is a part of the hand
hand is the homonymy of fingers because it is the whole
co-hyponyms
cat and dog are co-hyponyms because both a type of word animal
associated words
cup and coffee because belong to same semantic field
Connotation / evaluation
positive (happy) negative (sad) connotation
pos (great). neg (terrible) evaluation
important dimensions of affective meaning
1 valence (neg of pos )
2 arousal (excited or not)
3 dominance (control or not)
sentiment
positive or negative evaluation language
two most common used models in vector semantics
tf-idf and word2vec
tf-idf
measure the importance of a term in a document relative to other documents in a corpus
word2vec
methods used to represent words in a vector space in order to capture semantic and syntactic relationships between words
cosine similarity
measure of similarity between two vectors, which is calculated by taking the cosine of the angle between the vectors
PPM (point wise mutual information)
see if a word appears more often with a word than expected
Skipgram vs Cbow
two methods used to represent words in a vector space
- CBOW is method used to predict a set of context words given a target word
- Skipgram is a method used to predict a target word given a set of context words
two kind of similarities
first-order co-occurrence (wrote and book)
if they are nearby
second-order co-occurrence (wrote and said)
if they have similar neighbors
aims to identify opinions
1
- SO polarity
- PN polarity
- strength of PN polarity
- extracting opinions
Balanced corpus
big in size
mixed language
full texts
different domains and genres
range of text categories
well documented
classifying corpora
1 mode (written, spoken, mixed…)
2 representativeness (balanced, specialized)
3 time (diachronic, synchronic)
4 language (mono, multi, parallel, comparable)
5 sampling (full documents, sample)
6 mark up (raw annotated)