Lecture 13 - Feature Engineering Flashcards

1
Q

List the 3 steps in machine learning.

A

1) Get labeled training data.
2) Convert your training data into n-dimentional vectors (feature selection)
3) Run the ML algorithm

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

In order to do supervised machine learning we must have what?

A

Labeled training data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

How can we get labeled training data?

A
  • Find a dataset that includes labels
  • Label it ourselves
  • Trick users into labeling it
  • Hire users to label it
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Why is it difficult to label data by hand?

A

Assumes you have domain expertise
Slow
Time consuming
Expensive

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Why is it difficult to trick users into labeling data by hand?

A

Takes time to collect

May take effort to create the system to record the desired behaviors.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

In order to send data as input to ML algorithms we must do what?

A

We must convert it into a vector of numbers. For quantitative variables this is easy but not as easy for categorical variables.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What can we do to ordinal variables to convert them to vector numbers?

A

We can assign them to a sequence of intergers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Why can’t we just assign random numbers to nominal variables?

A

The algorithm would get confused since it assumes those distances are meaningful.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Describe on-hot encoding

A

It is the process of taking categorical values and using a binarizer to turn the categorical value into a vector number.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Describe bag of words and how it works?

A

It is an approach used to vectorize words.

Take all the words in the corpus and assign them a number
Make a new vector where each index means a word
Take all documents and map them to a vector and mark 1 for each word position.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly