Features: Simple numbers Flashcards

Question 1

Q

What are the three aspects we need to consider when dealing with features?

Answer

A

1) Check whether the magnitude matters. Do we just need to know whether it’s positive or negative? or do we need a precise value? 2) Check the scale of the features. It is often a good idea to normalize the features so that the output stays on an expected scale. 3) Check the distribution if numeric features. The distribution of input features matters to some models. For instance, a linear regression model assume the error to be from a normale distribution.

Question 2

Q

What do we mean by quantization or binning?

Answer

A

It a way to split the data in categories. For exemple:

0-12 years old

12-17 years old

18-25 years old

etc.

This method is useful because large counts could wreak havoc in some models.

Question 3

Q

What is the log transformation and why is it used ?

Answer

A

We simply apply the log to a feature in order to break the scale down.

The log transformation can be used to make highly skewed distributions less skewed. This can be valuable both for making patterns in the data more interpretable and for helping to meet the assumptions of inferential statistics.

Question 4

Q

What is features engineering?

Answer

A

Feature engineering is the process of using domain knowledge to extract features from raw data via data mining techniques. These features can be used to improve the performance of machine learning algorithms. Feature engineering can be considered as applied machine learning itself

Question 5

Q

What is a power transform?

Answer

A

In statistics, a power transform is a family of functions that are applied to create a monotonic transformation of data using power functions. This is a useful data transformation technique used to stabilize variance, make the data more normal distribution-like.

Question 6

Q

When should we use feature scaling or normalization?

Answer

A

Models that are smooth functions of the input, such as linear regression, logistic regression, or anything that involves a matrix, are affected by the scale of the input.

Question 7

Q

What is Min-Max scaling?

Answer

A

xf = (xi-min(x))/(max(x)-min(x))

Min-Max scaling squeezes (or stretches) all feature values to be within the range of [0,1].

Question 8

Q

What is standardization?

Answer

A

xf = (x-mean(x)) / sqrt(var(x))

It subtracts off the mean of the feature (over all data points) and divides by the variance. Hence, it can also be called variance scaling. The resulting scaled feature has a mean of 0 and a variance of 1.

Question 9

Q

What is the central limit theorem ?

Answer

A

In the study of probability theory, the central limit theorem (CLT) states that the distribution of sample means approximates a normal distribution (also known as a “bell curve”), as the sample size becomes larger, assuming that all samples are identical in size, and regardless of the population distribution shape. In the picture attached, Z ~ N(0,1)

Question 10

Q

What is the L2 normalization?

Answer

A

This technique normalizes (divides) the orginal feature value by what’s known as the L2 norm, also known as the euclidean norm.

xf = xi / ||x||₂

||x||₂ = sqrt(x₁²+x₂²+…+x_m²)

Question 11

Q

What do you need to consider when adding interaction features?

Answer

A

They are simple to formulate, but they are expensive to use. The training and scoring time of a linear model with pairwise interaction features would go from O(n) to O(n²), where n is the number of singleton features.

Question 12

Q

What are the three classes of feature selection techniques?

Answer

A

1) Filtering: Filtering techniques preprocess features to removes ones that are unlikely to be useful for the model. For example, one could compute the correlation or mutual information between each feature and the response variable, and filter out the features that fall below a threshold.

2) Wrapper methods: These techniques allow you to try out subsets of features, which mean you will not accidentally prune away features that are uninformative by themselves but useful when taken in combination.

3) Embedded methods: These methods perform feature selection as part of the model training process. In other words, they are part of the model. For example, a decision tree inherently performs feature selection because it selects one feature on which to split the tree at each training step.

Question 13

Q