Lesson 4 Flashcards

1
Q

How does NLP Transfer learning work?

A

1) fit language model; it predicts the next word of a sentence. This is hard! You need to know a lot about English and a lot about the world! e.g., fit on WikiText 103 dataset; most of the largest articles on Wikipedia. About 1bn tokens. This is pre-trained model. 2) transfer learning… fine tuning to predict the next word of you domainm aka target corpus (e.g., movie reviews). Don’t need any labels at all! “Self-supervised model” 3) fine tune for classifier with labels on smaller set

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Trick in creating language model?

A

Use train and test… all text… to train the language model!

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is process to fit and fine tune language model?

A

language_model_learner … It creates an RNN drop_mult=0.3 is dropout(?) lr_find fit_one_cycle unfreeze learn.fit_one_cycle(10,…) 0.30 accuracy is great (so ~1/3 of the time you can predict the exact next word!) ** This training could take over a day ** use learn.predict… to check it is sensible. You are generating sentences 26:30

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

How do you go from language model to classifier?

A

save the encoder (don’t need the decoder which is the generator) need to ensure you use the SAME VOCABULARY as the language model learn = text_classifier_learner(clf, drop_mult=0.5) learn.load_encoder(‘fine_tuned_enc’) learn.freeze() lr_find learn.fit_one_cycle(…)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What does learn.freeze_to(-2) mean?

A

Just unfreeze the last two layers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is process to fine tune classifier?

A

fit_one_cycle() learn.freeze_to(-2) fit_one_cycle() learn.freeze_to(-3) fit_one_cycle() It helps with text class to unfreeze one layer at a time lastly, unfreeze the entire thing 31:29

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Discrimiigtive learning rate

A

How much do I decrease the learning rate as I move from layer to layer

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is 2.6

A

35:4- Stephen merity, Frank Hudder; how can use Random Forrest to find optimal hyperparameters. Like autoML.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What do you use embeddings for?

A

Categorical data is converted to embeddings Continuous data is fed in as is

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

How do we deal missing missing data?

A

Replace with median, add binary column with is_missing

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

How can you make a validation set with contiguous periods in fastai 1.x?

A

TabularList.from_df(…).split_by_idx()

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

How do you make the tabular learner

A

get_tabular_learner(data, layers=[200,100], metrics=…)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is collab filtering?

A

Recommender system… user and who like what; bunch of users; most simple dataset: userid, movieid, numberofstars Think of it as big sparse matrix with movies on one axis, user on another, rating as value.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

cold start problem?

A

have a second model, meta-data driven model, for new users or new movies; or like Netflix UX, when you sign up they ask you a bunch of questions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

For tabular time series

A

Jeremy says not to use RNN when there are other features you can use (store open? promotion? weather? day of week, etc).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

How does collab filtering work?

A

Lesson 4;1:09; its a matrix completion problem; M is userid x movieid matrix. M = AB, where I is 5xnum movies matrix and B is num users x 5 matrix. A and B are initialized randomly.

It’s not really matrix mult: it’s embedding mult of vectors. Dot product of each vector -> scaler in the M matrix

loss function is diff between given matrix and M, squared; add up

use gradient descent to make loss smaller

This is a single linear layer :-)

17
Q

What is an embedding?

A

A matrix of weights

A matrix of weighs which you can lookup into and grab one vector out of

Designed as something you can index into as an array and grab one vector out of

Collab filtering has two embedding matrices: user and movie

then need to add bias per user and per movie

18
Q

How do you force a contiuous value into a range?

A

sigmoid(res)*(max-min)+min

19
Q

Inputs

Weights/parameters

Activations

Output

Loss

Metric

Cross-entropy

Softmax

Fine-tuning

A

In pytorch weigths are called parameters (could be weights or biases)

input @ weights = activations

actviation function(activations) also called activations

activation is the result of a matrix mult or activation

last layer likely to be a sigmoid b/c you want something between two values