Week 5: Data Exploring + Pre-Processing Flashcards

1
Q

Mean

A

Average of all numbers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Median

A

Middle number in a sequence

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Mode

A

Number that occurs most often within a set

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Range

A

Difference between highest and lowest values

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Standard Derivation is a measure used to

A

quantify the amount of variation of data values

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Histogram (2 points)

  • is similiar to
  • gives a rough sense of
A
  • a bar chart but groups numbers into ranges (bins)
  • density
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Name the distribution

A

normal

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Name the distribution

A

right skewed (where tail goes)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Name type of distribution

A

Multimodal

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Draw

  • positive linear association
  • negative linear association
  • non-linear associaition
  • no association
A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Scatter plots show…

A

how much one variable is affected by another

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Correlations show

A

how strongly pairs of variables are related

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is the measure of correlation?

A

correlation coefficient r

1 is perfect

0 is no correlation

-1 is perfectly negative correlation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

An outlier is

A

an observation that lies an abnormal distance from other values in a random sample

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

How do you identify outliers? (BPRD)

A
  • box plot
  • probablitity plot
  • dions test
  • rosners test
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Handling missing values on small scale < 5%

A

Drop or omit

17
Q

Handling missing values on larger scale methods (MKFS)

A
  • Mean
  • K-nearest neighbors
  • fuzzy K-means
  • singular value decomposition
18
Q

3 types of invalid data

A

missing data values

invalid values that suggest true values

invalid values that provide no information regarding true values

19
Q

What is scaling?

A

scaling features to lie between a given minium and maximum value

20
Q

Transformation is…

A

converting data from one format or structure into another format or structure

21
Q

Feature selection is…

A

the process of selecting a subset of relevant features for use in model construction

22
Q

4 Reasons for using feature selection (REIR)

A

reduces the complexity of a model

enables the machine learning algorithm to train faster

improve the accuracy if the right subset is chosen

reduces overfitting

23
Q

5 methods for dimensionality reduction

A
  1. Decision Tree
  2. Random forest
  3. high correlation
  4. factor analysis
  5. principal component analysis
24
Q

Dimensionality reduction…

A

creates new combination of attributes

25
Q
A