Representation Flashcards
one-hot encoding
A vector that contains only one 1 that represents a specific feature value from the feature value vocabulary.
multi-hot encoding
A vector that contains multiple 1 that represent feature values from the feature value vocabulary.
Feature engineering
Feature engineering is the process of mapping raw data into a feature vector (feature values).
Feature engineering means transforming raw data into a feature vector. Expect to spend significant time doing feature engineering.
How does feature engineering for numerical values work?
A numerical value can easily be taken over to a feature vector (e.g. an integer can be mapped to a float value).
How does feature engineering for non-numerical values (categorical values) work?
A non-numerical value needs to be mapped to a numerical value. It can a number or a vector.
We can accomplish this by defining a mapping from the feature values, which we’ll refer to as the vocabulary of possible values, to integers. Since not every street in the world will appear in our dataset, we can group all other streets into a catch-all “other” category, known as an OOV (out-of-vocabulary) bucket.
This approach effectively creates a Boolean variable for every feature value (e.g., street name). Here, if a house is on Shorebird Way then the binary value is 1 only for Shorebird Way. Thus, the model uses only the weight for Shorebird Way.
Similarly, if a house is at the corner of two streets, then two binary values are set to 1, and the model uses both their respective weights.
What is OOV?
out-of-vocabulary
A value that is mapped to all feature values, that are not in the vocabulary defined.
What do you need to consider when feature engineering categorical values?
However, if we incorporate these index numbers directly into our model, it will impose some constraints that might be problematic:
We’ll be learning a single weight that applies to all streets. For example, if we learn a weight of 6 for street_name, then we will multiply it by 0 for Charleston Road, by 1 for North Shoreline Boulevard, 2 for Shorebird Way and so on. Consider a model that predicts house prices using street_name as a feature. It is unlikely that there is a linear adjustment of price based on the street name, and furthermore, this would assume you have ordered the streets based on their average house price. Our model needs the flexibility of learning different weights for each street that will be added to the price estimated using the other features.
We aren’t accounting for cases where street_name may take multiple values. For example, many houses are located at the corner of two streets, and there’s no way to encode that information in the street_name value if it contains a single index.
To remove both these constraints, we can instead create a binary vector for each categorical feature in our model that represents values as follows:
For values that apply to the example, set corresponding vector elements to 1.
Set all other elements to 0.
The length of this vector is equal to the number of elements in the vocabulary. This representation is called a one-hot encoding when a single value is 1, and a multi-hot encoding when multiple values are 1.
What is a sparse feature?
Feature vector whose values are predominately zero or empty. For example, a vector containing a single 1 value and a million 0 values is sparse. As another example, words in a search query could also be a sparse feature— there are many possible words in a given language, but only a few of them occur in a given query.
What is a sparse representation?
A representation of a tensor (vector) that only stores non-zero value elements.
Qualities of good features
1) Avoid rarely used discrete feature values. Good feature values should appear more than 5 or so times in a data set.
2) Prefer clear and obvious meanings. Each feature should have a clear and obvious meaning to anyone on the project. (e.g. age should not be a value like 34098)
3) Don’t mix “magic” values with actual data. Good floating-point features don’t contain peculiar (strange) out-of-range discontinuities or “magic” values. For example, suppose a feature holds a floating-point value between 0 and 1. A value of -1 would be a “magic value”.
To work around magic values, convert the feature into two features. One feature holds only quality ratings, never magic values. One feature holds a boolean value indicating whether or not a quality_rating was supplied. Give this boolean feature a name like is_quality_rating_defined.
4) Account for upstream instability. The definition of a feature shouldn’t change over time. For example, the following value is useful because the city name probably won’t change. (Note that we’ll still need to convert a string like “br/sao_paulo” to a one-hot vector.) But gathering a value inferred by another model carries additional costs. Perhaps the value “219” currently represents Sao Paulo, but that representation could easily change on a future run of the other model.
Cleaning data - Scaling features
Scaling means converting floating-point feature values from their natural range (for example, 100 to 900) into a standard range (for example, 0 to 1 or -1 to +1). If a feature set consists of only a single feature, then scaling provides little to no practical benefit. If, however, a feature set consists of multiple features, then feature scaling provides the following benefits:
1) Helps gradient descent converge more quickly.
2) Helps avoid the “NaN trap,” in which one number in the model becomes a NaN (e.g., when a value exceeds the floating-point precision limit during training), and—due to math operations—every other number in the model also eventually becomes a NaN.
3) Helps the model learn appropriate weights for each feature. Without feature scaling, the model will pay too much attention to the features having a wider range.
You don’t have to give every floating-point feature exactly the same scale. Nothing terrible will happen if Feature A is scaled from -1 to +1 while Feature B is scaled from -3 to +3. However, your model will react poorly if Feature B is scaled from 5000 to 100000.
Do all features have to be scaled within the same range?
You don’t have to give every floating-point feature exactly the same scale. Nothing terrible will happen if Feature A is scaled from -1 to +1 while Feature B is scaled from -3 to +3. However, your model will react poorly if Feature B is scaled from 5000 to 100000.
how to calculate the standard deviation?
Calculate the mean value.
Sum up the squared differences (value-mean).
Devide by the number of values (if the values are the total popultation). Or by number of values - 1 (if the values are a sample of the population).
how to scale numerical data?
One obvious way to scale numerical data is to linearly map [min value, max value] to a small scale, such as [-1, +1].
Another popular scaling tactic is to calculate the Z score of each value. The Z score relates the number of standard deviations away from the mean. In other words:
For example, given:
mean = 100
standard deviation = 20
original value = 130
then:
scaled_value = (130 - 100) / 20
scaled_value = 1.5
Scaling with Z scores means that most scaled values will be between -3 and +3, but a few values will be a little higher or lower than that range.
what is the Z score of a value?
Another popular scaling tactic is to calculate the Z score of each value. The Z score relates the number of standard deviations away from the mean. In other words:
For example, given:
mean = 100
standard deviation = 20
original value = 130
then:
scaled_value = (130 - 100) / 20
scaled_value = 1.5
Scaling with Z scores means that most scaled values will be between -3 and +3, but a few values will be a little higher or lower than that range.