Lecture 13 - Feature Engineering Flashcards
List the 3 steps in machine learning.
1) Get labeled training data.
2) Convert your training data into n-dimentional vectors (feature selection)
3) Run the ML algorithm
In order to do supervised machine learning we must have what?
Labeled training data.
How can we get labeled training data?
- Find a dataset that includes labels
- Label it ourselves
- Trick users into labeling it
- Hire users to label it
Why is it difficult to label data by hand?
Assumes you have domain expertise
Slow
Time consuming
Expensive
Why is it difficult to trick users into labeling data by hand?
Takes time to collect
May take effort to create the system to record the desired behaviors.
In order to send data as input to ML algorithms we must do what?
We must convert it into a vector of numbers. For quantitative variables this is easy but not as easy for categorical variables.
What can we do to ordinal variables to convert them to vector numbers?
We can assign them to a sequence of intergers
Why can’t we just assign random numbers to nominal variables?
The algorithm would get confused since it assumes those distances are meaningful.
Describe on-hot encoding
It is the process of taking categorical values and using a binarizer to turn the categorical value into a vector number.
Describe bag of words and how it works?
It is an approach used to vectorize words.
Take all the words in the corpus and assign them a number
Make a new vector where each index means a word
Take all documents and map them to a vector and mark 1 for each word position.