Discretisation / Modelling Continous Data Flashcards

1
Q

What is Discretisation, and where might it be used?

A

Discretisation = The translation of continuous attributes into nominal attributes.

Might be used in some learners such as Decision Trees, as they generally work better with nominal attributes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Summarise some approaches to supervised discretisation

A
  • Naïve Supervised Discretisation
  • Information-Based Supervised Discretisation
  • General idea is to sort the possible values, and create nominal value for a region where most of the instances have the same label. i.e. group em’ up
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is Equal Width?

A

Equal width is an unsupervised method

  • Divides the range of possible values seen in the training set into equally-sized sub-divisions, regardless of the number of instances (sometimes 0) in each division

1) max instance - min instance = difference
2) difference / num of buckets = width of each bucket
3) min + width, …, until finished

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is Equal Frequency?

A

Equal Freq. is an unsupervised method

  • Divides the range of possible values seen in the training set, such that roughly the same number of instances appear in each bucket

1) for a specific attribute, sort the instances in ascending order
2) split according to how many buckets we want
3) if we need to transform new data that is added later, define the dividing point at the median

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is k-means in the context of discretisation?

A

K-means is a “clustering” approach, but it can work well in the context of discretisation.

  • If we want k buckets, we randomly select k points to act as seeds
  • We then have an iterative approach where we;
    • assign each instance to the bucket of the closest seed
    • update the “centroid” of the bucket with the mean of the values it currently has
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

How to calculate the (sample) mean?

A

mean of a specific attribute = 1/N (sumof(Ci))

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

How to calculate the standard deviation?

A

1) Sumof( Squaring the difference between attribute value and sample mean (Ci - Meanc) )
2) Dividing by 1 less than the number of values
3) Taking the positive square root

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

How could we use the MEAN and STANDARD DEVIATION when building a classifier?

A

Could construct a Gaussian probability density function, which would allow us to estimate the probability of observing any given value, based on counting the number of standard deviations it is from the mean (its z-score)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

WTF IS A Z-SCORE

A

???

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is a hyperparameter? What does it mean for the model to parametrise the data? How do these relate to the model being non-parametric / parametric?

A

???

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are the general two steps in discretisation?

A

1) Decide how many values (= intervals/buckets) to map the features on to
2) Map each continuous value onto a discrete value

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Pros and Cons of K-means clustering

A
Pros
- Efficient O(tkn)
n = # of instances
k = # of clusters
t = # of iterations
normally k, t << n

Cons

  • Tends to converge to local minimum; Sensitive to seed instances
  • Need to specify k in advance
  • Not able to handle non-convex clusters
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Information-Based supervised discretisation

A
  • Cluster values into two intervals which minimise the entropy

1) sort the values
2) calculate the mean information at the different breakpoints in class membership

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Naïve Supervised Discretisation

A

“cluster” values into class-contiguous intervals

1) sort the values and identify breakpoints in class membership
2) reposition any breakpoints where there is no change in numeric values
3) set the breakpoints midway between the neighbouring values

*SIMPLE TO IMPLEMENT

  • LEADS TO OVERFITTING
  • to avoid overfitting, delay inserting a breakpoint until each cluster contains at least n instances
  • or, merge neighbouring clusters until they reach a certain size/at least n instances of the majority class
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is Gaussian Distribution (aka normal distribution)

A

Given the mean and standard deviation of a distribution, it is possible to estimate the probability density for x via Gaussian Distribution.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Why is smoothing important in NB?

A

Prevents 0 probabilities from just decimating our entire thing.