Part 1 : Data Acquisition and Characteristics Flashcards
Analogue to Digital conversion involves
Sampling and Quantisation.
Sampling
Ascertain the momentary value of (an analogue signal) many times a second so as to convert the signal to digital form.
Quantisation
The process of mapping a large set of input values to a (countable) smaller set.
Nyquist Shannon Sampling Theorem
If a function x(t) contains no frequencies higher than B hertz, it is completely determined by giving its ordinates at a series of points spaced 1/2B seconds apart.
Nyquist Shannon Sampling Theorem (Laymans)
If the highest frequency in the signal is f(max) the sampling rate must be at least 2f(max).
Valid distance measure D(a,b) has properties
- Non-negative
- Reflexive
- Symmetric
- Satisfies Triangular Inequality
Minowski Distant or order p (p-norm distance) is defined as
D(x,y) = (Σ|x(i) - y(i)|^p)^(1/p)
When p=1, 1-norm distance, Minowski
(aka Manhattan)
D(x,y) = Σ|x(i) - y(i)|
When p=2, 2-norm distance, Minowksi
(aka Euclidean)
D(x,y) = ((x-y)^T(x-y))^(1/2)
When p=∞, ∞-norm distance, Minowski
(aka Chebyshev)
D(x,y) = max(|x1-y1|, |x2-y2|,…,|xn-yn|)
Time series
Successive measurements made over a time interval
(Numerical Time Series), P-norm Distances can only
- Compare time series of the same length
- Very Sensitive respect to signal transformations
- Shifting
- Uniform Amplitude Scaling
- Non-Uniform Amplitude Scaling
- Uniform Time Scaling
Dynamic Time Warping (Berndt and Clifford, 1994)
- Replaces Euclidean one-to-one with many-to-one
- Recognises similar shapes, even in the presence of shifting and/or scaling
- X = (x0,…,xn) and Y = (y0,…,yn) and Rest(X) = (x1,…,xn)
- DTW(X,Y) = D(x0,y0) + min{DTW(x, REST(Y)), DTW(REST(X),Y), DTW (REST(X), REST(Y)))}
- Solved Efficiently using dynamic programming by building an nxm distance matrix
(Distance Symbolic) in text could be
- Syntactic
- Semantic
Syntactic
- Defined over symbolic data of the same length
- Measures the number of substitutions required to change one string/number into another
Syntactic e.g. Hamming Distance
Returns the number of mismatches, max = length
Syntactic e.g. Edit Distance
Measures the minimum number of ‘operations’ required to transform one sequence to another
Syntactic e.g. Edit Distance - Operations
- Insertion
- Substitutions
- Deletion
Semantic
Built on top of a hierarchy of word semantics, e.g. WordNet (Princeton)
Semantic tool WordNet
Contains 117,000 synsets (synset: set of one or more synonyms that are interchangeable in some context)
Semantic e.g. WUP ( Wu and Palmer Distance, 1994)
WUP finds the path length to the root node from the LCS (Least Common Subsrciber), the value is scaled by the sim of the path lengths from the original concepts to the root.
Semantic e.g. WUP Equation
WUP(C1,C2) = (2N3)/(N1+N2+(2N3))
WordNet Relationships - Hyponomy
(is-a relationship) e.g. furniture -> bed
WordNet Relationships - Meronymy
(part-of relationship) e.g. chair -> seat
WordNet Relationships - Troponymy
[for verb hierarchies] (specific manner) e.g. communicate -> talk -> whisper
WordNet Relationships - Antonymy
(strong contact) e.g. wet dry
Mean - 1D
µ = 1/N(Σx(i))
Variance(Spread) - 1D
σ^2 = 1/(N-1)(Σ(x(i)-µ)^2)
Standard Deviation - 1D
σ = (1/(N-1)(Σ(x(i)-µ)^2))^(1/2)
Mean - Multi-Dimensional
Calculated independently for each dimension
Variance - Multi-Dimensional
Computed along each dimension using a Covariance Matrix
Covariance Matrix
- Variances on the diagonal
- Also measures correlation
Covariance Matrix (Correlation)
- Positive Covariance, means a proportional relationship between the variables
- Negative Covariance, indicates an inverse proportional relationship
Eigenvectors and Eigenvalues
Av = λv
- v -> EigenVector
- λ -> EigenValue
Characteristic Equation
|A − λI| = 0
- I -> Identity Matrix
- A -> Determinant of the Matrix
Determinant of a Matrix
|A| = ad − bc
Eigenvectors define principle axis
- Major Axis: eigenvector corresponding to larger Eigenvalue
- Minor Axis: eigenvector corresponding to smaller Eigenvalue
- Represented using major and minor axis of ellipses
Data Normalisation Methods
- Rescaling
- Standardisation (aka z-score)
- Scaling to unit length
(Data Normalisation) Rescaling
x’ = ( x- min(x) / max(x)-min(x) )
(Data Normalisation) Standardisation
x’ = (x-µ) / σ
(Data Normalisation) Scaling to Unit Length
x’ = x / ||x||
Data Outliers
small number of points with values significantly different from that other points, not always easy to remove
Median
- Difficult, median of two sets cannot be defined in terms of the individual medians
Note - Sample Variance Vs. Variance
Only estimates, (N-1) gives unbiased estimate of the variance
Normal Distribution
N(µ, σ^2)
Normal Distribution - Standard Deviation
- 1sd - 68%
- 2sd - 95%
- 3sd - 99.9%
Normal Distribution - Multi-Dimensional
Normal Distribution - Multi-Dimensional