Lecture 4 - Data Understanding II Flashcards

1
Q

What does data preparation do with the information provided by data understanding?

A
  • Selects attributes
  • Reduces the dimension of the data set
  • Selects records
  • Treats missing values
  • Treats outliers
  • Improves data quality
  • Unify and transform data
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is feature extraction?

A

The construction of new features from the given attributes

For example instead of tasks finished, hours worked, number of hours needed usually for task -> creating new attribute “efficiency”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What can be used for simple models feature extraction?

A

Non-linear functions like x^p, divided by x, log(x), sin(x) etc.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

How to predict y from x?

A

Prior knowledge, is the y dependant on x, visualization, trial and error

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What’s the disadvantage of methods like PCA for feature extraction?

A

Dimensionality reduction techniques like PCA lead to features that can no longer be interpreted in a meaningful way, how to understand a feature that is a linear combination of 10 attributes?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are some complex data type feature extractions?

A

Text data analysis -> frequency of keywords
Time series data analysis -> fourier or wavelet coefficients
Graph data-analysis -> number of vertices, edges

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What does feature selection refer to?

A

Techniques used to choose a subset of the features that is as small as possible and sufficient for the data analysis

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What are the reasons for feature selection?

A
  • Prior knowledge: we know something is irrelevant
  • Quality control: majority of values missing or bad
  • Non-informative: eg all values same
  • Redundancy: Identical or correlated values
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What does record selection refer to? Why is it done?

A

Selecting only some rows of the data.
- Timeliness: older data might be outdated
- Representativeness: The sample in the database might not be representative for the whole population
- Rare events: Useful for something like stock market crashes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

How to choose records for rare events?

A
  • Artificially increase the proportion of the rare events by adding copies
  • Choose only subset of the data
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What does data cleansing refer to?

A

Detecting and correcting/removing inaccurate, incorrect or incomplete records from the data set

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

How to improve data quality?

A
  • Turn all characters same sensitivity
  • Remove spaces etc.
  • Fix the format of numbers
  • Split the fields “Chocolate, 100g” -> “chocolate” “100.0”
  • Normalize the writing
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What are the four discretizations?

A
  1. Equi-width discretization: Splits range into same length intervals [0-20, 20-40, 40-60]
  2. Equi-frequency discretization: Splits range into intervals with roughly the same number of records [4,4,4,4]
  3. V-optimal discretization: Minimizes the sum of n*V, where n is the number of data objects and V is the sample variance
  4. Minimal entropy discretization: minimizes the uncertainty
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Why should data sometimes be normalized?

A

To guarantee impartiality for models that use distances

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is min-max normalization?

A

All the values are scaled between 0 and 1, outliers affect a lot.
x = x-min_x / (max_x - min_x)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is z-score standardization?

A

scales the data to have a mean of 0 and deviation of 1
x = x - mean(x) / variance(x)

17
Q

What is robust z-score standardization?

A

x = x - median(x) / IQR(x)

18
Q

What is decimal scaling?

A

For attribute X and the smallest integer value s larger than log_10(max(x))

x = x/10^s

19
Q

What does centering the data matrix mean?

A

Removing the mean from all the rows of matrrix X, it moves the data to the center.

20
Q

What the number of possible 2D scatter plots for attributes m?

A

m(m-1), so for 50 50*49 = 2450

21
Q

Why do we want to change data to lower dimensional?

A

There could be hundreds of thousands of attributes, to include them all in a plot we need to define a measure that evaluates lower-dimensional plots of daata in term of how well the plot preserves the original structure

22
Q

What are parallel coordinates?

A

They draw the coordinate axes parallel to each other, so that there is no limitation for the number of axes to be displayed.

Aka plot for multiple attributes like \/_/_

23
Q

What is the basic idea for dimensionality reduction?

A

Change the data from n-dimensional space to q-dimensional space (q= 2 or 3)

R^n -> R^q

24
Q

What is a linear map?

A

New attributes are linear combinations of old ones.

new_feature = 0.5feature_1 + 0.3feature_2

25
Q

How does PCA work?

A

PCA uses the variance in the data as the structure preservation criterion? It then tries to preserve as much of the original variance of the data when projected to a lower-dimensional space.

It uses an orthogonal transformation to convert a set of observation of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components.

26
Q

How does the principal components get determined?

A

The first PC has the largest possible variance, and each succeedin component in turn has the highest variance under the constraint that it is orthogonal to the preceding components

27
Q

Is PCA sensitive to the relative scaling of the original variables?

A

Yes, usually Z-score standardized

28
Q

What is an eigenvector?

A

The number of principle components.

29
Q

what is t-SNE?

A

t-distributed stochastic neighbor embedding. Non-linear dimensionality reduction method

30
Q

How does t-SNE work?

A

Similar items end up close together points and dissimilar at distant points
Generates clusters even when the data doesn’t support this

31
Q

What are the two stages of t-SNE?

A
  1. A probability distribution over pairs of high-dimensional objects is constructed so that similar objects receive higher probability while dissimilar points receive lower probability
  2. A similar probability distribution is generated for the points in the low-dimensional map
32
Q

What are some dimensionality reduction methods?

A
  1. PCA
  2. t-SNE
  3. Kernel PCA (non-linear)
  4. Linear discriminant analysis (used in classification, finds low dimensional reprsentation of the data such that separates classes well)