Audio Analysis and Assessment Flashcards
Audio represents…
sound pressure changes over time
Sound is converted into what, by a transducer?
Voltage
Audio can be … or …
Continuous or Discrete
Continuous signal represents…
Real world sound pressure variations
An example of continuous signal equipment is…
Analogue Equipment
Discrete signal represents…
sound as a series of ones and zeros
ADC stands for…
Analogue to Digital Converter
Sampling frequency is found on what axis?
X-axis
Sampling frequency is…
Number of samples taken per second
Sampling frequency is measured in…
Hertz
Amplitude quantisation is found on what axis?
Y-axis
Amplitude quantisation is a…
Binary Encoding Scheme
In amplitude quantisation, the number of bits dictate…
The number of levels we can represent
What is the amplitude quantisation bit equation?
2^n
n = the number of bits
What are three examples of digital audio files?
WAV, AIFF, AU
In digital audio files, the Y-axis could represent? (4)
- Normalised (-1 - 1)
- Sample Value
- dB
- Percentage
In digital audio files, the X-axis could represent? (2)
- Time
- Samples
What does PCM stand for?
Pulse Code Modulation
Aspects of the PCM encoding process directly affect…
Signal quality
Digital data can only represent…
A finite set of values
Digital data’s finite set of values are set by…
The number of bits
What is digital value?
The nearest approximation of the analogue signal
Approximation introduces what to each sample?
Quantisation error
What is quantisation error?
It is the difference between the analogue input signal and the quantised level assigned by the encoder
Quantisation =
Approximation
When is the maximum quantisation error reached?
At a half step
What does quantisation error create?
Quantisation Noise
What does SNR stand for?
Sound to Noise Ratio
As SNR increases, noise…
Decreases
As SNR decreases, the distance between signal and noise…
Decreases
What does SQNR stand for?
Sound to Quantisation Noise Ratio
What two factors does SQNR have?
- Number of bits encoding audio
- Input signal amplitude
What is the SQNR equation?
SQNR = (6.02*B)+1.76dB, where B = number of bits (16, 24, etc.)
Under what two conditions of the input signal makes quantisation noise similar to white noise?
When signal has large amplitudes
or
When signal has wide bandwidth
What two problems occur when input signal has a low amplitude?
- Relative magnitude of distortion increases (SQNR decreases)
- Quantisation noise is correlated with the input signal
Whats the difference between quantisation distortion and white noise?
Distortion is more annoying due to its its unpredictability
What are the two ways to reduce quantisation noise?
- Increasing bit depth
- Dither
How does increasing bit depth decrease reduce quantisation noise?
Each additional bit increases SQNR by 6dB (halving QN)
Increasing bit depth causes what issue?
Increasing bit depth increases processing burden
What is dithering?
Adding noise to signal before sampling to reduce the audible effect of quantisation error
As well as reducing the audible effects of quantisation error, dithering does what at low amplitudes?
Randomises quantisation error
Why does dither work even though quantisation error can still be audible?
Noise is easier to listen to than distortion so dither helps make audio less annoying
Most audio we hear is… (hint - digital files, streaming)
Compressed
Noise is created when quantisation depth is manipulated by…
Compression
Nyquist frequency is…
Half of sampling rate
Signals sampled at discrete intervals have…
An upper limit to frequencies
When above Nyquist frequency, there is a period between…
Samples to reproduce the input signal correctly
What is aliasing?
When frequencies greater than Nyquist frequency appear as lower frequencies within the spectrum
What happens when sampling at twice the highest frequency in the spectrum?
A correct representation of all frequency spectrum
Aliasing can be looked from both…
A time and frequency domain perspective
Aliasing can be avoided by having at least how many samples per cycle of waveform?
Two
When does aliasing occur? (2)
- When sample rate is too low
- When signal with twice the sampling frequency is observed by system
Aliasing introduces what to audio?
Unwanted frequencies
What is the aliasing equation?
Af = Fs - F
Fs = sampling frequency
F = input frequency
Aliasing affects what frequencies?
All frequencies above Nyquist frequency
Sampling process is called…
Pulse Code Modulation
What occur around carrier frequency?
Sidebands
Sidebands occur around carrier if bands arent…
Limited
Sidebands make output spectrum…
Complex
What is the sideband equation?
(n * Fc) +/- Fm
In terms of sidebands, what component of audio is the carrier and what component is the modulator?
Audio signal = Modulator
Sampling frequency = Carrier
The input signal spectrum forms sidebands around…
Integer multiples of the sampling frequency
When do sidebands move closer together (overlap)?
When sampling frequency is less than twice the highest frequency
When do sidebands increase in width (overlap)?
When audio signal is greater than Nyquist frequency
Anti-aliasing filters remove…
Frequencies above Nyquist frequency
- abs() function is used for measuring…
- Why?
- Peak on bipolar waves
- abs() ignores negative values
- How do we measure dB?
- If amplitude decreases by half, what is the change in dB?
- 20log(a/b)
- -6dB
What is the dB change for every bit increased?
6dB
What does RMS stand for?
Root Mean Square
What does RMS represent?
Distribution of sample values
What info does RMS give us?
Average energy/power
RMS can be affected by…
Compression
What is the crest equation?
Crest = 20log(peak amplitude / RMS)
The ratio between peak amplitude and RMS is called…
Crest
Crest controls…
Relationship between average energy and peak values
What are the equations for frequency and period?
f = 1 / T
T = 1 / f
Why do audio signals change dynamically over time?
Because amplitude and frequency change
What is based on frequency, amplitude and time parameters?
Human hearing response
- Distinguishing separate frequencies throughout audible frequency range isn’t…
- What is the term for the above?
- Constant
- Discrimination
As well as distinguishing separate frequencies throughout frequency range not being constant, what else is not constant?
Sensitivity
Amplitude response has a…
Very large dynamic range
What is the threshold of feeling in dB?
120dB
Give an example of a non-linear graph.
Fletcher Munson curve
The Fletcher Munson curve shows…
Non-linear sensitivity over frequency
As frequency increases, resolution…
Decreases
Humans find it harder to discriminate … frequencies.
Higher
Log scales and constant Q reflect…
Human perception of frequency/pitch
What is constant Q?
Relation of bandwidth
As band centre frequency increases, frequency…
Increases
As bandwidth increases, frequency…
Increases
- What is the equation for Q?
- What is heavy cool about this?
- Q = centre frequency / bandwidth
- Q will always remain constant
What two things are crucial to audio processing operations?
- Frequency
- Amplitude
What does audio frequency analysis do?
Extract frequency from signal
Audio frequency analysis describes…
Frequency and amplitude over time
What is the most common approach to extract frequency information?
Fourier Analysis
Our boy, Fourier, stated - ‘Any periodic function may be represented as…
An infinite series of harmonically related sinusoids’
In terms of Fourier, an input signal is a combination of…
Harmonically related sinusoids
Why do we want good frequency resolution?
To see down to the individual frequencies
Why do we want good time resolution?
To see down to a few milliseconds
We can think of Fourier analysis frequency resolution as…
A series of frequency bands or filters
- In Fourier analysis frequency resolution, bands are…
- Unlike…
- Spaced linearly
- Human hearing system
Analysis bins refer to…
Bands
Frequency resolution is determined by…
The number of samples of the input signal
Close spaced frequencies separate when…
Filters narrow
To increase accuracy, we can increase what three things?
- Transform
- Samples
- Frequency Resolution
What is the bin bandwidth equation?
Band bandwidth = Fs / length of transform (in samples)
What is the bin centre frequency equation?
Bin centre frequency = n * bin bandwidth
What is the length of transform equation?
Length of transform = Fs * t (seconds)
What is the window duration equation?
Window duration = number of samples * sample period
What is the sample period equation?
Sample period = 1 / Fs
What problem arises with frequency and time resolution?
- Good frequency resolution results in bad time resolution
- Good time resolution results in bad frequency resolution
If we analyse a whole track (3 mins), would we have good frequency or good time resolution?
Good frequency resolution
If we analyse a short segment (0.1 seconds), would we have good frequency or good time resolution?
Good time resolution
Does time resolution or frequency resolution have a smaller computational expense?
Time resolution
Fourier analysis is … on the computer
Strenuous
What method is faster than Fourier analysis?
Fast Fourier Transform (FFT)
FFT requires transform length to be…
to the power of two (256, 1024, 2048 samples)
FFT requires what to be to the power of two?
Transform length
A window size of power of two will result in…
Faster processing
What is windowing?
A series of short analytical snippets throughout duration of signal
Windowing describes…
The evolution of frequency over time
Window still has a problem. What is it?
Frequency and time resolution trade off
Time resolution can be increased by overlapping…
Windows
What do spectrograms plot?
Analytical window over time
What does a spectrograms X and Y axis show?
X = Time
Y = Frequency
What does colour on a spectrogram represent?
Magnitude (Amplitude)
What problems does frequency analysis have? (4)
- Results are estimates
- Computationally expensive
- Windowing can confuse frequency readings
- Doesn’t reflect human hearing
In terms of windowing, instead of reading signal spectrum, we get…
A combination of signal and window spectrum
What is ‘SpEcTrAl LeAkAgE’?
Unwanted Artefacts
Spurious Components are referred to as…
Side lobes
How can we reduce unwanted artefacts?
Use different window shapes
FFT has a good frequency response at…
Low frequencies
As window decreases, frequency resolution…
Decreases
FFT has good time resolution…
Throughout whole spectrum
As window decreases, time resolution…
Increases
Time and frequency resolution trade off can be resolved by…
Using adaptive window sizes
In terms of multi-resolution analysis, smaller windows would be used for…
Higher frequencies
In terms of multi-resolution analysis, we aim to have good frequency resolution at…
Lower frequencies
In terms of multi-resolution analysis, we aim to have good time resolution at…
Higher frequencies
In terms of multi-resolution analysis, window size varies with…
Frequency
Whats the benefits of multi-resolution analysis? (2)
- Resolves trade off
- Can increase time and/or frequency resolution where it matters
Two key parameters of PCM are…
- Sample rate
- Bit depth
What is the formula for data per second using values from the following - 1 second of stereo PCM audio at 44.1kHz, 16 bit?
44,100 * 2 (bytes) * 2 (stereo) = 176.4kBps
What is the formula for bits per second using 176.4kB?
176.4kB * 8 = 1.4Mbps
What does perceptual audio aim to do?
Reduce data required to represent audio
What do we call the process of cochlea hairs responding to strongest stimuli and ignoring anything weaker?
Masking
Stimuli temporarily raises…
Threshold of hearing
What are critical bands? (The Beatles aren’t one of them)
Areas influenced by the temporary change in threshold of hearing
Critical bands are … at lower frequencies.
Narrower
What pattern appears across hearing range?
Constant Q pattern
What is the CB bandwidth equation?
CB bandwidth = 94 + ( 71 * f^3/2 )
f = kHz
CB bandwidth is not … at frequencies.
Fixed
CB depends on what two components of stimuli?
Intensity and frequency
What does critical band response aid? (5)
- Frequency discrimination
- Perceived loudness
- Dissonance/Consonance
- Clarity of speech
- Masking
Scales representing spectral energy in … … help measure human perception.
Critical bands
Two common scales of CB response are…
- Bark
- Mel
What does Bark scale aim to measure?
Loudness
One critical band has the bandwidth of how many barks?
One
What does Mel scale aim to measure?
Perceived pitch
What do Bark and Mel scales help us to establish?
Sounds both audible and inaudible in signal
Bark and Mel scales underpin…
Masking
Where does masking occur in terms of frequency?
Specific range in frequency around tone (critical band)
Masking means that frequency in same range might be…
Inaudible
In terms of masking, what is the ‘Masker’?
Louder tone
In terms of masking, what is the ‘Maskee’?
Quieter tone
Masking is better as frequency…
Increases
As masker amplitude increases, masking curve becomes…
Broader
In terms of masking, temporary threshold increase…
Holds over given time
Masking threshold increase lasts longer when… (4)
- Masker is louder
- Masker and maskee are closer in frequency
- Masker has lower frequency than maskee
- Time between tones are shorter
What is backwards masking?
Sounds can be masked by tone which occurs after maskee
Backwards masking suggests that humans hear in…
Time frames
Backwards masking only occurs when both tones are in…
The same time block
What is the bits per sample equation?
bits per sample = bit rate / Fs
What is the key mechanism in perceptual codec?
Bit allocation
What is data reduction?
Dynamically altering number of bits used to represent signal to make less computationally demanding
As bits decrease, noise…
Increases
In adaptive allocation, loud tones get what to represent them?
More bits
In adaptive allocation, what aren’t encoded?
Inaudible tones
In adaptive allocation, what happens to quantisation error noise?
How?
- Its masked
- By keeping under the threshold
What does compressed audio look like to a computer?
Instructions on how to reconstruct the waveform
Input frames are split into how many with signals with transients?
How many samples does each segment frame have?
- Three
- 384 samples
Input frames are split into how many with static signals?
How many samples does each segment frame have?
Trick question
1. One frame
2. 1152 samples
As frame size decreases, noise…
Decreases
Is encoding process perfect?
No
In smaller frames, what issue occurs around transients?
Noise
What happens when noise occurs before transient?
Transient is smeared resulting in loss of definition
Input signal is split into side bands. How many and what do each bandwidth have in common?
- 32
- Equal bandwidth
Sub bands result in…
32 separate band-limited time domain signals (really rolls off the tongue)
Do sub bands increase data?
Doesn’t increase data due to ‘polyphase sub-band filter’
What effect does the ‘polyphase sub-band filter’ have?
Down sampling effect
What does the ‘polyphase sub-band filter’ do? (2)
- Reduces Fs
- While splitting signal in sub bands
What sample frames are subject to frequency analysis?
All frames
What does frequency analysis do to sub band content?
Converts content into frequency domain data
What does MDCT stand for?
Modified Discrete Cosine Transform
How much data does MDCT need to reproduce data compared to FFT?
Half the data
In the frequency domain, the masking model level is calculated for…
Each sub-band
What does SMR stand for?
Signal to mask ratio
In the masking model, frequency domain data can be used to give us what ratio?
Signal to mask ratio
What are the stages of the masking model? (5)
- Masking level calculated for each sub-band
- Calculation for SMR
- Bit allocation to sub-bands
- No. of bits assigned to sub-bands dependent of SMR
- Bit depth varies across sub-bands due to content
What is the encoding process order? (6)
- Frames
- Sub-bands
- Down sampling
- MDCT
- Masking and bit allocation
- Huffman coding
What is Huffman coding?
Statistical compression for further data reduction
What does Huffman coding represent?
Repeated sequences of data using shorter code eg 11010101 is stored as 01
What does compressed audio contain? (5)
- Instructions for decoder
- Samples in MDCT domain at reduced bit depth
- Bit allocation data
- Scale factor for each sub-band
- encoded using Huffman coding
What processes do frequency transforms have?
Inverse equivalent processes
What does the decoder apply to produce a time domain signal?
An inverse MDCT
What is simpler, decoder or encoder?
Decoder
Compression artefacts are…
Complex
What happened to sub-band data when decoded?
Data is combined
What two ways do compression artefacts vary?
- vary systematically with audio input
- Vary according to encoding
Artefacts increase as bit rate…
Decreases
What is technical quality?
Our understanding of good audio quality
Audio engineers might want to access the output of what? (3)
- Compression algorithms
- Hardware systems
- Network Codex
What is subjective audio quality assessment?
Listening tests taken by panel of listeners
What is objective audio quality assessment?
Analysis of audio signals - based on observational phenomena
What are the pros of subjective audio quality assessment? (1)
- Most accurate results
What are the cons of subjective audio quality assessment? (4)
- Expensive
- Time consuming
- Subjective
- Complex planning
What are the pros of objective audio quality assessment? (4)
- Lower cost
- Lower complexity
- Consistent (no listeners)
- Less time required
What are the cons of objective audio quality assessment? (1)
1.It is an estimation of human response
In audio quality assessment, what two things are compared?
Original and processed signal
In terms of audio quality assessment, less change in signals means…
Better quality
Why aren’t time domain comparisons helpful?
We aren’t sensitive to phase changes
Comparing SNR, segmental SNR and total harmonic distortion don’t resemble…
Human hearing response to these parameters
What does LSD stand for?
Log-squared spectral distance
What does LSD produce?
Large values for low power areas on spectrum
Whats the negative of LSD? (apart from the comedown)
It is too sensitive for spectral changes which are inaudible
What are meaningful differences?
Spectral features which characterise signals
What are formant peaks?
Cluster of energy around certain points in frequency
Formant can help us differentiate what two things?
- Speech
- Musical instruments
The human hearing range is sensitive to…
Formant peaks
What is the minimum change in frequency (%) humans can hear a difference in pitch? (worded that one badly, soz)
3-5%
Humans can hear a difference when bandwidth shifts …-…%
20-40%
What does SKL stand for?
Symmetrical Kullback-Leibler Distance
What sort of coding does SKL use for a smooth formant based spectrum?
Linear prediction coding
SKL uses linear prediction coding to achieve a…
smooth formant based spectrum
What does SKL assume?
Formant changes will be perceivable
SKL emphasises differences in what two parameters?
- High magnitudes
- Low frequencies
SKL is less sensitive to…
High frequency shifts
What does MFCC stand for?
Mel Frequency Cepstral Coefficients
MFCC is a subjective spectrum which reflects…
How we hear sounds
What does MFCC use to reflect how we hear sounds?
Psychoacoustical phenomena
Cepstrum is equal to…
Inverse FFT of the log FFT of a signal (duh)
What is inverse FFT of the log FFT of a signal equal to?
Cepstrum
What does cepstrum emphasise?
Pitch content
What does MFCC combine? (2)
- Cepstrum
- Mel
Changes in MFCC are…
Perceivable
What is the auditory transform stage?
MFCC
What is the gear meshing equation?
Fm = nt * Frg
nt = no. of teeth
Frg = speed of gear
Periodograms help emphasise…
Pitch
What are the three stages of the PSD process?
- Compare signals with itself
- Take FFT of results
- Peaks will be produced at frequency of periodic elements
What is acoustic ecology?
Environmental sound
What does NVH stand for?
Noise, Vibration, Harshness
What is an example of active sound design?
Ford Mustang mic up engine and gives user option to change between sports and comfort mode (changing volume of ‘engine’)
Product sound impacts… (3)
- Perceived quality
- Purchase
- Design and manufacture
What is cross modal perception?
An example?
- When perception is affected by two or more senses
- Louder = more powerful
Perception is measured by… (3)
- Loudness
- Roughness
- Sharpness
What does loudness measure and what is its unit?
Measure of energy across critical bands (Sone)
What does roughness measure and what is its unit?
Rapid amplitude fluctuations by interacting sounds (Asper)
What does sharpness measure and what is its unit?
Weighting/shape of spectrum (Acum)
What two parameters are in response of critical bands?
- Roughness
- Sharpness
As frequency energy increases, sharpness…
Increases
Where does sharpness occur?
In one critical band with concentrated high frequency energy
What is the term for low frequency sharpness?
Booming
What does CSA stand for?
Category Scaling of Annoyance
What is CSA used for?
Measuring annoyance of sound
What is the CSA equation?
CSA = 8.07 + ( 0.563 * N5 ) + ( 3.022 * S50 ) + ( 2.175 * R )
N = Loudness
S = Sharpness
R = Roughness
What does MIR stand for?
Music Information Retrieval
What is an example of tech that uses MIR?
Melodyne
What is the issue with query by humming?
Variation of time and pitch in humming might not be recognised
What is the solution to the ‘query by humming’ issue?
Parsons code
What is parsons code?
Codes notes changes so that system recognises C, C#, D as tonic, up, up (sorry if that ones confusing)
What is ‘query of example’? Give an example of tech that uses it.
- Looks for closest match by extracting compact and descriptive set of acoustic features
- Shazam
What are the challenges of ‘query by example’? (3)
- Database has millions of files so data must be compact
- Fingerprints must be robust enough to ignore noise
- Process must be efficient
What do constellation maps do? (2)
- Finds local maxima (peaks)
- Encodes peaks as time and frequency coordinates
What would you use if peaks overlap in constellation maps?
Use hashing process
What does the hashing process do? (2)
- Helps identify spectral features unique to music track
- Speeds up process
What three forms of analysis can be used for classification?
- Audio
- Metadata
- Symbolic Data
In term of classification, what are the two benefits of using audio data?
- Easy to get ahold of
- Can extract timber and acoustics easily
In term of classification, what is the con of using audio data?
Hard to precisely identify some features
In term of classification, what is the benefit of using symbolic data?
More detailed
In term of classification, what are the cons of using symbolic data? (2)
- No acoustic/timbre data
- Difficult to represent whole song in MIDI
What information can be gathered from spectrograms? (4)
- Timbre
- Frequency
- Intensity
- Rhythmic features
What are the two approaches of audio content analysis?
- Spectrogram
- Frame-based approach
Spectral shape gives us what four parameters? (4)
- Brightness
- Centroid
- Flatness
- Skewness
What is spectral flux?
Change of spectra over time
What would you use to identify chords in audio? (2)
- Spectrogram
- Pitch histograms
How would you identify chords in audio?
Calculate average energy for each note across spectrum
What does ‘classify by content’ do?
Classifies high level content using low level parameters
What is the ‘classify by content’ process? ( 4)
- Get audio
- Group
- Find ground truths
- Classify using ground truths
When gathering audio for classification, what should the audio be?
Typical to the genre
What are three methods of pattern machine learning?
- KNN
- GMM
- SUM
What does KNN stand for?
K Nearest Neighbour
What is the KNN equation?
KNN = square root of ( A - B ) ^2
In terms of KNN, if K = 3, how many smallest distance tracks would you choose?
Three
In terms of KNN, as K increases, neighbours should…
Increase
In terms of KNN, less neighbours can produce…
Clearer boundaries
In terms of KNN, the more neighbours there are…
The better the class represents
What is the content based problem?
Acoustical properties aren’t taken into account so there might be similarities in acoustics rather than music
What is the content based problem called?
Glass Ceiling