- assuming the data of an attribute to follow a single distribution might be too simple - therefore, we describe the data with k normal distributions, where pi weights each distribution - we find the parameters (mu, sigma) and combinations of weights (pi) that best describe the data by maximizing the likelihood

- weights of the distributions - all weights together must sum to 1

- takes local density into account - consider k closest data points - computationally expensive

1. define k_{dist} for a point A as the distance to its k-nearest neighbor (furthest neigbor) 2. the set of neighbours within k_{dist} is called the k-distance neighbourhood 3. define the reachability distance for point A to B as max(k_dist(B), d(A, B)) 4. define local reachability distance of our point (inverse of the average reachability distance between A and all other points in the neighborhood) 5. compare this to the neighbours local reachability distances to get the local outlier factor

distance of a point to its k-nearest neighbor - i.e., the furthest neighbor - k_{dist} is small when a point is in a dense region

lecture 2 - handling sensory noise Flashcards by Kiara Shivani

definition: outlier

an observation that is distant from other observations

How well did you know this?

Not at all

Perfectly

possible causes for outliers

measurement error
variability in the data

How well did you know this?

Not at all

Perfectly

outlier removal methods

using domain knowledge (e.g., heart rate cant be over 220)
outlier detection methods

outlying values are replaced

How well did you know this?

Not at all

Perfectly

two types of outlier detection

distribution based: cauvenet’s criterion and mixture models
distance based: simple, LOF

How well did you know this?

Not at all

Perfectly

chauvenet’s criterion: purpose + steps

purpose: to identify values of an attribute that are unlikely, given a normal distribution
The criterion essentially states that an observation is an outlier if it is so extreme that its probability of occurrence is less than 1/N
The higher N the less strict Chauvenet’s criterion is.

assume a normal distribution for a single attribute/feature
take mean and standard deviation as parameters for that normal distribution
for each instance of the attribute, compute the probability of the observation
define the instance as an outlier when it’s probability is smaller than chauvenet’s criterion

How well did you know this?

Not at all

Perfectly

mixture models

assuming the data of an attribute to follow a single distribution might be too simple
therefore, we describe the data with k normal distributions, where pi weights each distribution
we find the parameters (mu, sigma) and combinations of weights (pi) that best describe the data by maximizing the likelihood

How well did you know this?

Not at all

Perfectly

mixture models: pi

weights of the distributions
all weights together must sum to 1

How well did you know this?

Not at all

Perfectly

mixture models: likelihood maximization

finding parameters pi, mu, and sigma for each distribution
iteratively improves the parameter estimates to maximize the likelihood
multiplies all probabilities of each data point under the distribution.

How well did you know this?

Not at all

Perfectly

simple distance-based approach

points x and y are close when their distance is within d_{min} (smaller d_{min} = more strict)
points are outliers when they are outside of d_{min} and (number of points outside of d_{min})/N is bigger than fraction f_{min}

does not take local density into account

How well did you know this?

Not at all

Perfectly

LOF properties

takes local density into account
consider k closest data points
computationally expensive

How well did you know this?

Not at all

Perfectly

LOF steps

define k_{dist} for a point A as the distance to its k-nearest neighbor (furthest neigbor)
the set of neighbours within k_{dist} is called the k-distance neighbourhood
define the reachability distance for point A to B as max(k_dist(B), d(A, B))
define local reachability distance of our point (inverse of the average reachability distance between A and all other points in the neighborhood)
compare this to the neighbours local reachability distances to get the local outlier factor

How well did you know this?

Not at all

Perfectly

k_{dist}

distance of a point to its k-nearest neighbor

i.e., the furthest neighbor
k_{dist} is small when a point is in a dense region

How well did you know this?

Not at all

Perfectly

k-distance neighbourhood

the set of data points that have a distance to x^j_i smaller than k_{dist}

How well did you know this?

Not at all

Perfectly

reachability distance

the distance between point A and B
the maximum of [the distance between point A and B] and [the k_{dist} of point B]
this distance is k_{dist}(B) if A is inside of the neighbourhood of B, and the actual distance if it is outside of the neighbourhood of B
ensures that smaller distances within the k-distance neighborhood are not overemphasized

How well did you know this?

Not at all

Perfectly

local reachability distance

1/ ([sum of reachability distances from A to all points in the neighbourhood] / [number of points in the neighbourhood])

inverse of the average reachability
small value = likely an outlier
range [0,1]

How well did you know this?

Not at all

Perfectly

LOF formula

Study These Flashcards

(sum of (lrd of an x / lrd of x^j_i)) / number of points in neighbourhood

tells you how much of an outlier this point is compared to its neighbourhood

imputation

Study These Flashcards

replace missing values by a substituted value (e.g., mean, mode, median)
more advanced:
- exploit information from other attributes in the same instance (row)
- values of the same attributes from other instances (column).
–> singe vs multiple missing values

exploit information o the same attribute from other instances: single missing value vs multiple missing values

Study These Flashcards

single missing measurement: take the average of the instance before and after the missing value
multiple missing measurements: linear imputation

kalman filter

Study These Flashcards

estimates expected values based on historical data with:

prediction
measurement
updating

if the observed value deviates too much, we can impute it with the expected value

kalman filter steps

Study These Flashcards

measure state s_t with x_t
state prediction: predict the next state (s-hat_{t|t-1} with transition matrix F and the previous state
predict error covariance matrix (P_{t|t-1}) as the difference between the actual state and the predicted state
updated state based on error (s-hat_{t|t} = prediction + kalman gain * error covariance

data transformation

Study These Flashcards

lowpass filter
PCA

lowpass filter

Study These Flashcards

for periodic data
series of values is decomposed into different periodic signals that come with their own frequencies
filters out irrelevant frequencies by filtering out high frequency data

PCA

Study These Flashcards

find new features that explain most of the variability in our data
select the number of components based on the explained variance

temporal specific outlier/imputation methods

Study These Flashcards

interpolation based imputation
kalman filter
lowpass filter

lowpass filter parameters

- f_c = cutoff frequency - n = how quickly something disappears from your data if it exceeds the cutoff frequency --> i.e., higher n = more filtering out of that frequency

lowpass filter magnitude

- high value of f = lower magnitude of the filter = frequency is not forwarded - low value of f = high magnitude of the filter = frequency is kept

mean imputation

numeric features

mode imputation

categorical and numeric features

median imputation

numeric features

lecture 2 - handling sensory noise Flashcards

(29 cards)