Representation Flashcards

Question 1

Q

one-hot encoding

Answer

A

A vector that contains only one 1 that represents a specific feature value from the feature value vocabulary.

Question 2

Q

multi-hot encoding

Answer

A

A vector that contains multiple 1 that represent feature values from the feature value vocabulary.

Question 3

Q

Feature engineering

Answer

A

Feature engineering is the process of mapping raw data into a feature vector (feature values).

Feature engineering means transforming raw data into a feature vector. Expect to spend significant time doing feature engineering.

Question 4

Q

How does feature engineering for numerical values work?

Answer

A

A numerical value can easily be taken over to a feature vector (e.g. an integer can be mapped to a float value).

Question 5

Q

How does feature engineering for non-numerical values (categorical values) work?

Answer

A

A non-numerical value needs to be mapped to a numerical value. It can a number or a vector.

We can accomplish this by defining a mapping from the feature values, which we’ll refer to as the vocabulary of possible values, to integers. Since not every street in the world will appear in our dataset, we can group all other streets into a catch-all “other” category, known as an OOV (out-of-vocabulary) bucket.

This approach effectively creates a Boolean variable for every feature value (e.g., street name). Here, if a house is on Shorebird Way then the binary value is 1 only for Shorebird Way. Thus, the model uses only the weight for Shorebird Way.

Similarly, if a house is at the corner of two streets, then two binary values are set to 1, and the model uses both their respective weights.

Question 6

Q

What is OOV?

Answer

A

out-of-vocabulary

A value that is mapped to all feature values, that are not in the vocabulary defined.

Question 7

Q

What do you need to consider when feature engineering categorical values?

Answer

A

However, if we incorporate these index numbers directly into our model, it will impose some constraints that might be problematic:

We’ll be learning a single weight that applies to all streets. For example, if we learn a weight of 6 for street_name, then we will multiply it by 0 for Charleston Road, by 1 for North Shoreline Boulevard, 2 for Shorebird Way and so on. Consider a model that predicts house prices using street_name as a feature. It is unlikely that there is a linear adjustment of price based on the street name, and furthermore, this would assume you have ordered the streets based on their average house price. Our model needs the flexibility of learning different weights for each street that will be added to the price estimated using the other features.

We aren’t accounting for cases where street_name may take multiple values. For example, many houses are located at the corner of two streets, and there’s no way to encode that information in the street_name value if it contains a single index.

To remove both these constraints, we can instead create a binary vector for each categorical feature in our model that represents values as follows:

For values that apply to the example, set corresponding vector elements to 1.
Set all other elements to 0.
The length of this vector is equal to the number of elements in the vocabulary. This representation is called a one-hot encoding when a single value is 1, and a multi-hot encoding when multiple values are 1.

Question 8

Q

What is a sparse feature?

Answer

A

Feature vector whose values are predominately zero or empty. For example, a vector containing a single 1 value and a million 0 values is sparse. As another example, words in a search query could also be a sparse feature— there are many possible words in a given language, but only a few of them occur in a given query.

Question 9

Q

What is a sparse representation?

Answer

A

A representation of a tensor (vector) that only stores non-zero value elements.

Question 10

Q

Qualities of good features

Answer

A

1) Avoid rarely used discrete feature values. Good feature values should appear more than 5 or so times in a data set.
2) Prefer clear and obvious meanings. Each feature should have a clear and obvious meaning to anyone on the project. (e.g. age should not be a value like 34098)
3) Don’t mix “magic” values with actual data. Good floating-point features don’t contain peculiar (strange) out-of-range discontinuities or “magic” values. For example, suppose a feature holds a floating-point value between 0 and 1. A value of -1 would be a “magic value”.

To work around magic values, convert the feature into two features. One feature holds only quality ratings, never magic values. One feature holds a boolean value indicating whether or not a quality_rating was supplied. Give this boolean feature a name like is_quality_rating_defined.

4) Account for upstream instability. The definition of a feature shouldn’t change over time. For example, the following value is useful because the city name probably won’t change. (Note that we’ll still need to convert a string like “br/sao_paulo” to a one-hot vector.) But gathering a value inferred by another model carries additional costs. Perhaps the value “219” currently represents Sao Paulo, but that representation could easily change on a future run of the other model.

Question 11

Q

Cleaning data - Scaling features

Answer

A

Scaling means converting floating-point feature values from their natural range (for example, 100 to 900) into a standard range (for example, 0 to 1 or -1 to +1). If a feature set consists of only a single feature, then scaling provides little to no practical benefit. If, however, a feature set consists of multiple features, then feature scaling provides the following benefits:

1) Helps gradient descent converge more quickly.
2) Helps avoid the “NaN trap,” in which one number in the model becomes a NaN (e.g., when a value exceeds the floating-point precision limit during training), and—due to math operations—every other number in the model also eventually becomes a NaN.
3) Helps the model learn appropriate weights for each feature. Without feature scaling, the model will pay too much attention to the features having a wider range.

You don’t have to give every floating-point feature exactly the same scale. Nothing terrible will happen if Feature A is scaled from -1 to +1 while Feature B is scaled from -3 to +3. However, your model will react poorly if Feature B is scaled from 5000 to 100000.

Question 12

Q

Do all features have to be scaled within the same range?

Answer

A

You don’t have to give every floating-point feature exactly the same scale. Nothing terrible will happen if Feature A is scaled from -1 to +1 while Feature B is scaled from -3 to +3. However, your model will react poorly if Feature B is scaled from 5000 to 100000.

Question 13

Q

how to calculate the standard deviation?

Answer

A

Calculate the mean value.
Sum up the squared differences (value-mean).
Devide by the number of values (if the values are the total popultation). Or by number of values - 1 (if the values are a sample of the population).

Question 14

Q

how to scale numerical data?

Answer

A

One obvious way to scale numerical data is to linearly map [min value, max value] to a small scale, such as [-1, +1].

Another popular scaling tactic is to calculate the Z score of each value. The Z score relates the number of standard deviations away from the mean. In other words:

For example, given:

mean = 100
standard deviation = 20
original value = 130
then:

scaled_value = (130 - 100) / 20
scaled_value = 1.5
Scaling with Z scores means that most scaled values will be between -3 and +3, but a few values will be a little higher or lower than that range.

Question 15

Q

what is the Z score of a value?

Answer

A

Another popular scaling tactic is to calculate the Z score of each value. The Z score relates the number of standard deviations away from the mean. In other words:

For example, given:

mean = 100
standard deviation = 20
original value = 130
then:

scaled_value = (130 - 100) / 20
scaled_value = 1.5
Scaling with Z scores means that most scaled values will be between -3 and +3, but a few values will be a little higher or lower than that range.

Question 16

Q

How to handle extreme outliers?

Answer

Study These Flashcards

A

one way would be to take the log of every value

simply “cap” or “clip” the maximum value

Binning: Divide values into bins.
In the data set, latitude is a floating-point value. However, it doesn’t make sense to represent latitude as a floating-point feature in our model. That’s because no linear relationship exists between latitude and housing values. For example, houses in latitude 35 are not 35/34 more expensive (or less expensive) than houses at latitude 34. And yet, individual latitudes probably are a pretty good predictor of house values.

To make latitude a helpful predictor, let’s divide latitudes into “bins”.

Thanks to binning, a model can learn completely different weights for each value.

Question 17

Q

Talk about binning.

Answer

Study These Flashcards

A

Divide values into bins as a part of feature engineering (scaling).

One can use whole numbers as bin boundaries or if we wanted finer-grain resolution, we could split bin boundaries at, say, every tenth of a degree. Adding more bins enables the model to learn different behaviors, but only if there are sufficient examples at each tenth of a latitude.

Another approach is to bin by quantile, which ensures that the number of examples in each bucket is equal. Binning by quantile completely removes the need to worry about outliers.

Question 18

Q

Talk about scrubbing.

Answer

Study These Flashcards

A

In real-life, many examples in data sets are unreliable due to one or more of the following:

Omitted values. For instance, a person forgot to enter a value for a house’s age.

Duplicate examples. For example, a server mistakenly uploaded the same logs twice.

Bad labels. For instance, a person mislabeled a picture of an oak tree as a maple.

Bad feature values. For example, someone typed in an extra digit, or a thermometer was left out in the sun.

Once detected, you typically “fix” bad examples by removing them from the data set. To detect omitted values or duplicated examples, you can write a simple program. Detecting bad feature values or labels can be far trickier.

Question 19

Q

How can you detect bad feature values?

Answer

Study These Flashcards

A

In addition to detecting bad individual examples, you must also detect bad data in the aggregate. Histograms are a great mechanism for visualizing your data in the aggregate. In addition, getting statistics like the following can help:

Maximum and minimum
Mean and median
Standard deviation
Consider generating lists of the most common values for discrete features. For example, do the number of examples with country:uk match the number you expect. Should language:jp really be the most common language in your data set?

Question 20

Q

Knowing your data - what 3 rules to follow?

Answer

Study These Flashcards

A

Follow these rules:

Keep in mind what you think your data should look like.
Verify that the data meets these expectations (or that you can explain why it doesn’t).

Double-check that the training data agrees with other sources (for example, dashboards).

Treat your data with all the care that you would treat any mission-critical code. Good ML relies on good data.

Look here: https://developers.google.com/machine-learning/guides/rules-of-ml/#ml_phase_ii_feature_engineering

Question 21

Q

What is the dataframe function for correlation?

Answer

Study These Flashcards

A

correlation_dataframe = training_examples.copy()
correlation_dataframe["target"] = training_targets["median_house_value"]

correlation_dataframe.corr()

Representation Flashcards

(21 cards)