lecture 2 - handling sensory noise Flashcards

1
Q

definition: outlier

A

an observation that is distant from other observations

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

possible causes for outliers

A
  1. measurement error
  2. variability in the data
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

outlier removal methods

A
  1. using domain knowledge (e.g., heart rate cant be over 220)
  2. outlier detection methods

outlying values are replaced

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

two types of outlier detection

A
  1. distribution based: cauvenet’s criterion and mixture models
  2. distance based: simple, LOF
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

chauvenet’s criterion: purpose + steps

A
  • purpose: to identify values of an attribute that are unlikely, given a normal distribution
  • The criterion essentially states that an observation is an outlier if it is so extreme that its probability of occurrence is less than 1/N
  • The higher N the less strict Chauvenet’s criterion is.
  1. assume a normal distribution for a single attribute/feature
  2. take mean and standard deviation as parameters for that normal distribution
  3. for each instance of the attribute, compute the probability of the observation
  4. define the instance as an outlier when it’s probability is smaller than chauvenet’s criterion
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

mixture models

A
  • assuming the data of an attribute to follow a single distribution might be too simple
  • therefore, we describe the data with k normal distributions, where pi weights each distribution
  • we find the parameters (mu, sigma) and combinations of weights (pi) that best describe the data by maximizing the likelihood
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

mixture models: pi

A
  • weights of the distributions
  • all weights together must sum to 1
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

mixture models: likelihood maximization

A
  • finding parameters pi, mu, and sigma for each distribution
  • iteratively improves the parameter estimates to maximize the likelihood
  • multiplies all probabilities of each data point under the distribution.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

simple distance-based approach

A
  1. points x and y are close when their distance is within d_{min} (smaller d_{min} = more strict)
  2. points are outliers when they are outside of d_{min} and (number of points outside of d_{min})/N is bigger than fraction f_{min}
  • does not take local density into account
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

LOF properties

A
  • takes local density into account
  • consider k closest data points
  • computationally expensive
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

LOF steps

A
  1. define k_{dist} for a point A as the distance to its k-nearest neighbor (furthest neigbor)
  2. the set of neighbours within k_{dist} is called the k-distance neighbourhood
  3. define the reachability distance for point A to B as max(k_dist(B), d(A, B))
  4. define local reachability distance of our point (inverse of the average reachability distance between A and all other points in the neighborhood)
  5. compare this to the neighbours local reachability distances to get the local outlier factor
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

k_{dist}

A

distance of a point to its k-nearest neighbor

  • i.e., the furthest neighbor
  • k_{dist} is small when a point is in a dense region
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

k-distance neighbourhood

A

the set of data points that have a distance to x^j_i smaller than k_{dist}

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

reachability distance

A
  • the distance between point A and B
  • the maximum of [the distance between point A and B] and [the k_{dist} of point B]
  • this distance is k_{dist}(B) if A is inside of the neighbourhood of B, and the actual distance if it is outside of the neighbourhood of B
  • ensures that smaller distances within the k-distance neighborhood are not overemphasized
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

local reachability distance

A

1/ ([sum of reachability distances from A to all points in the neighbourhood] / [number of points in the neighbourhood])

  • inverse of the average reachability
  • small value = likely an outlier
  • range [0,1]
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

LOF formula

A

(sum of (lrd of an x / lrd of x^j_i)) / number of points in neighbourhood

  • tells you how much of an outlier this point is compared to its neighbourhood
17
Q

imputation

A
  1. replace missing values by a substituted value (e.g., mean, mode, median)
  2. more advanced:
    - exploit information from other attributes in the same instance (row)
    - values of the same attributes from other instances (column).
    –> singe vs multiple missing values
18
Q

exploit information o the same attribute from other instances: single missing value vs multiple missing values

A
  • single missing measurement: take the average of the instance before and after the missing value
  • multiple missing measurements: linear imputation
19
Q

kalman filter

A

estimates expected values based on historical data with:

  1. prediction
  2. measurement
  3. updating

if the observed value deviates too much, we can impute it with the expected value

20
Q

kalman filter steps

A
  1. measure state s_t with x_t
  2. state prediction: predict the next state (s-hat_{t|t-1} with transition matrix F and the previous state
  3. predict error covariance matrix (P_{t|t-1}) as the difference between the actual state and the predicted state
  4. updated state based on error (s-hat_{t|t} = prediction + kalman gain * error covariance
21
Q

data transformation

A
  1. lowpass filter
  2. PCA
22
Q

lowpass filter

A
  • for periodic data
  • series of values is decomposed into different periodic signals that come with their own frequencies
  • filters out irrelevant frequencies by filtering out high frequency data
23
Q

PCA

A
  • find new features that explain most of the variability in our data
  • select the number of components based on the explained variance
24
Q

temporal specific outlier/imputation methods

A
  1. interpolation based imputation
  2. kalman filter
  3. lowpass filter
25
Q

lowpass filter parameters

A
  • f_c = cutoff frequency
  • n = how quickly something disappears from your data if it exceeds the cutoff frequency
    –> i.e., higher n = more filtering out of that frequency
26
Q

lowpass filter magnitude

A
  • high value of f = lower magnitude of the filter = frequency is not forwarded
  • low value of f = high magnitude of the filter = frequency is kept
27
Q

mean imputation

A

numeric features

28
Q

mode imputation

A

categorical and numeric features

29
Q

median imputation

A

numeric features