lecture 2 - handling sensory noise Flashcards
definition: outlier
an observation that is distant from other observations
possible causes for outliers
- measurement error
- variability in the data
outlier removal methods
- using domain knowledge (e.g., heart rate cant be over 220)
- outlier detection methods
outlying values are replaced
two types of outlier detection
- distribution based: cauvenet’s criterion and mixture models
- distance based: simple, LOF
chauvenet’s criterion: purpose + steps
- purpose: to identify values of an attribute that are unlikely, given a normal distribution
- The criterion essentially states that an observation is an outlier if it is so extreme that its probability of occurrence is less than 1/N
- The higher N the less strict Chauvenet’s criterion is.
- assume a normal distribution for a single attribute/feature
- take mean and standard deviation as parameters for that normal distribution
- for each instance of the attribute, compute the probability of the observation
- define the instance as an outlier when it’s probability is smaller than chauvenet’s criterion
mixture models
- assuming the data of an attribute to follow a single distribution might be too simple
- therefore, we describe the data with k normal distributions, where pi weights each distribution
- we find the parameters (mu, sigma) and combinations of weights (pi) that best describe the data by maximizing the likelihood
mixture models: pi
- weights of the distributions
- all weights together must sum to 1
mixture models: likelihood maximization
- finding parameters pi, mu, and sigma for each distribution
- iteratively improves the parameter estimates to maximize the likelihood
- multiplies all probabilities of each data point under the distribution.
simple distance-based approach
- points x and y are close when their distance is within d_{min} (smaller d_{min} = more strict)
- points are outliers when they are outside of d_{min} and (number of points outside of d_{min})/N is bigger than fraction f_{min}
- does not take local density into account
LOF properties
- takes local density into account
- consider k closest data points
- computationally expensive
LOF steps
- define k_{dist} for a point A as the distance to its k-nearest neighbor (furthest neigbor)
- the set of neighbours within k_{dist} is called the k-distance neighbourhood
- define the reachability distance for point A to B as max(k_dist(B), d(A, B))
- define local reachability distance of our point (inverse of the average reachability distance between A and all other points in the neighborhood)
- compare this to the neighbours local reachability distances to get the local outlier factor
k_{dist}
distance of a point to its k-nearest neighbor
- i.e., the furthest neighbor
- k_{dist} is small when a point is in a dense region
k-distance neighbourhood
the set of data points that have a distance to x^j_i smaller than k_{dist}
reachability distance
- the distance between point A and B
- the maximum of [the distance between point A and B] and [the k_{dist} of point B]
- this distance is k_{dist}(B) if A is inside of the neighbourhood of B, and the actual distance if it is outside of the neighbourhood of B
- ensures that smaller distances within the k-distance neighborhood are not overemphasized
local reachability distance
1/ ([sum of reachability distances from A to all points in the neighbourhood] / [number of points in the neighbourhood])
- inverse of the average reachability
- small value = likely an outlier
- range [0,1]
LOF formula
(sum of (lrd of an x / lrd of x^j_i)) / number of points in neighbourhood
- tells you how much of an outlier this point is compared to its neighbourhood
imputation
- replace missing values by a substituted value (e.g., mean, mode, median)
- more advanced:
- exploit information from other attributes in the same instance (row)
- values of the same attributes from other instances (column).
–> singe vs multiple missing values
exploit information o the same attribute from other instances: single missing value vs multiple missing values
- single missing measurement: take the average of the instance before and after the missing value
- multiple missing measurements: linear imputation
kalman filter
estimates expected values based on historical data with:
- prediction
- measurement
- updating
if the observed value deviates too much, we can impute it with the expected value
kalman filter steps
- measure state s_t with x_t
- state prediction: predict the next state (s-hat_{t|t-1} with transition matrix F and the previous state
- predict error covariance matrix (P_{t|t-1}) as the difference between the actual state and the predicted state
- updated state based on error (s-hat_{t|t} = prediction + kalman gain * error covariance
data transformation
- lowpass filter
- PCA
lowpass filter
- for periodic data
- series of values is decomposed into different periodic signals that come with their own frequencies
- filters out irrelevant frequencies by filtering out high frequency data
PCA
- find new features that explain most of the variability in our data
- select the number of components based on the explained variance
temporal specific outlier/imputation methods
- interpolation based imputation
- kalman filter
- lowpass filter