lecture 2 - handling sensory noise Flashcards
1
Q
definition: outlier
A
an observation that is distant from other observations
2
Q
possible causes for outliers
A
- measurement error
- variability in the data
3
Q
outlier removal methods
A
- using domain knowledge (e.g., heart rate cant be over 220)
- outlier detection methods
outlying values are replaced
4
Q
two types of outlier detection
A
- distribution based: cauvenet’s criterion and mixture models
- distance based: simple, LOF
5
Q
chauvenet’s criterion: purpose + steps
A
- purpose: to identify values of an attribute that are unlikely, given a normal distribution
- The criterion essentially states that an observation is an outlier if it is so extreme that its probability of occurrence is less than 1/N
- The higher N the less strict Chauvenet’s criterion is.
- assume a normal distribution for a single attribute/feature
- take mean and standard deviation as parameters for that normal distribution
- for each instance of the attribute, compute the probability of the observation
- define the instance as an outlier when it’s probability is smaller than chauvenet’s criterion
6
Q
mixture models
A
- assuming the data of an attribute to follow a single distribution might be too simple
- therefore, we describe the data with k normal distributions, where pi weights each distribution
- we find the parameters (mu, sigma) and combinations of weights (pi) that best describe the data by maximizing the likelihood
7
Q
mixture models: pi
A
- weights of the distributions
- all weights together must sum to 1
8
Q
mixture models: likelihood maximization
A
- finding parameters pi, mu, and sigma for each distribution
- iteratively improves the parameter estimates to maximize the likelihood
- multiplies all probabilities of each data point under the distribution.
9
Q
simple distance-based approach
A
- points x and y are close when their distance is within d_{min} (smaller d_{min} = more strict)
- points are outliers when they are outside of d_{min} and (number of points outside of d_{min})/N is bigger than fraction f_{min}
- does not take local density into account
10
Q
LOF properties
A
- takes local density into account
- consider k closest data points
- computationally expensive
11
Q
LOF steps
A
- define k_{dist} for a point A as the distance to its k-nearest neighbor (furthest neigbor)
- the set of neighbours within k_{dist} is called the k-distance neighbourhood
- define the reachability distance for point A to B as max(k_dist(B), d(A, B))
- define local reachability distance of our point (inverse of the average reachability distance between A and all other points in the neighborhood)
- compare this to the neighbours local reachability distances to get the local outlier factor
12
Q
k_{dist}
A
distance of a point to its k-nearest neighbor
- i.e., the furthest neighbor
- k_{dist} is small when a point is in a dense region
13
Q
k-distance neighbourhood
A
the set of data points that have a distance to x^j_i smaller than k_{dist}
14
Q
reachability distance
A
- the distance between point A and B
- the maximum of [the distance between point A and B] and [the k_{dist} of point B]
- this distance is k_{dist}(B) if A is inside of the neighbourhood of B, and the actual distance if it is outside of the neighbourhood of B
- ensures that smaller distances within the k-distance neighborhood are not overemphasized
15
Q
local reachability distance
A
1/ ([sum of reachability distances from A to all points in the neighbourhood] / [number of points in the neighbourhood])
- inverse of the average reachability
- small value = likely an outlier
- range [0,1]