12 - Terra Incognita Flashcards

1
Q

What is the addition operation described in the context of binary numbers?

A

Addition is performed modulo-97, meaning sums wrap around between 0 and 96.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

How is the sum expressed in modulo-97 addition?

A

sum = x + (some multiple of 97), where 0 ≤ x ≤ 96.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What does ‘grokking’ refer to in the context of neural networks?

A

‘Grokking’ is a term used to describe a deep understanding and internalization of information by a neural network.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is the relationship between the number of parameters in a neural network and its performance?

A

More parameters can lead to overfitting, while fewer parameters can lead to underfitting.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is overfitting in machine learning?

A

Overfitting occurs when a model learns details and noise in the training data to the extent that it negatively impacts its performance on new data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is underfitting in machine learning?

A

Underfitting occurs when a model is too simple to capture the underlying pattern of the data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Describe the bias-variance trade-off.

A

High bias leads to underfitting, while high variance leads to overfitting. The goal is to find a balance between the two.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is the role of the test dataset in machine learning?

A

The test dataset is used to evaluate the model’s performance on unseen data, indicating its ability to generalize.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What happens when a model is too complex?

A

It may overfit the training data, leading to poor performance on test data due to capturing noise.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is a simple model’s performance on noisy data?

A

A simple model may ignore noise, leading to high training and test errors.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Fill in the blank: The number of ________ in a model determines its complexity and capacity.

A

[parameters]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

True or False: A model with too few parameters will have a high risk of overfitting.

A

False

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is the consequence of a model that tracks every variation in the training data?

A

It leads to overfitting and poor generalization to test data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What does the capacity of a hypothesis class refer to?

A

It refers to the range of functions that a model can approximate based on its parameters.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is the universal approximation theorem?

A

It states that a neural network with sufficient neurons can approximate any function.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

How does model complexity affect training risk?

A

Training risk decreases as model complexity increases, up to a point, after which it may increase due to overfitting.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What is the consequence of using a very simple model on complex data?

A

It will likely result in high training and test errors due to underfitting.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Fill in the blank: The goal of an ML engineer is to find the sweet spot between ________ and variance.

A

[bias]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What is one of the most significant challenges in model selection?

A

Determining the right level of complexity to avoid underfitting and overfitting.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What happens to test error when a model is too complex?

A

The test error tends to increase due to overfitting.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Describe the performance of a model that has a high bias.

A

It will likely underfit the training data and perform poorly on test data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What does a highly complex, nonlinear model do during training?

A

It minimizes training errors but may generalize poorly to test data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What happens to the capacity of the hypothesis class when the number of parameters is increased?

A

It increases the capacity of the hypothesis class.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What does the dashed curve in the figure represent?

A

The training risk, or the risk that the model makes errors on the training dataset.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

What is the effect of extremely simple models on training data?

A

They do badly because they are underfitting the data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

What happens to the training risk as model complexity increases?

A

It goes to zero as models start overfitting.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

What does the solid curve in the figure represent?

A

The risk of error during testing.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

What is the ‘Goldilocks zone’ in model selection?

A

The optimal balance between underfitting and overfitting.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

What is the relationship between minimizing test error and generalization ability?

A

Minimizing test error implies minimizing generalization error.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

What is the conventional wisdom regarding deep neural networks and their parameters?

A

They are over-parameterized and should not generalize well.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

What surprising observation did Neyshabur and colleagues make about deep neural networks?

A

Increasing the size of the network did not cause it to overfit the training data.

32
Q

What phenomenon occurs when noise is introduced into the dataset?

A

The model accommodates the noise and may still achieve zero training error.

33
Q

What does it mean when a model is said to ‘shatter’ the training data?

A

It fits the training data perfectly, including any noise.

34
Q

What did Neyshabur and colleagues find regarding test error with noisy data?

A

Test error continued decreasing as network size increased past the size required for zero training error.

35
Q

What potential explanation did Neyshabur and colleagues suggest for this behavior?

A

Implicit regularization by stochastic gradient descent.

36
Q

What did Chiyuan Zhang and colleagues conclude about neural networks in their 2016 paper?

A

The effective capacity of several successful neural network architectures is large enough to shatter the training data.

37
Q

What is ‘benign overfitting’?

A

The phenomenon where models can fit noisy data without significant overfitting.

38
Q

What is the role of hyperparameters in neural network training?

A

They are set by engineers before training begins and influence model architecture and training process.

39
Q

What are feedforward neural networks?

A

Networks where information flows one way from input to output.

40
Q

What are recurrent neural networks known for?

A

Allowing feedback connections to remember previous inputs.

41
Q

What is the backpropagation algorithm used for?

A

Training neural networks by minimizing loss.

42
Q

What is a loss function in the context of neural networks?

A

A function that calculates the error made by the network.

43
Q

What is the purpose of regularization in neural networks?

A

To prevent overfitting by controlling model complexity.

44
Q

What is one method of explicit regularization?

A

Preventing the values for the weights from getting too large.

45
Q

What is a unique method used to prevent overfitting during training?

A

Randomly dropping some connections in the network.

46
Q

What must activation functions be for backpropagation to work?

A

Differentiable.

47
Q

What is a method to prevent overfitting in neural networks?

A

Randomly drop some connections during training

This technique reduces the number of effective parameters.

48
Q

What is required for an activation function to work with backpropagation?

A

The activation functions must be differentiable

Some functions, like ReLU, are not differentiable at specific points but can still be used.

49
Q

What is the primary difference between supervised learning and unsupervised learning?

A

Supervised learning requires labeled training data, while unsupervised learning does not

In unsupervised learning, the algorithm identifies clusters without explicit labels.

50
Q

What is self-supervised learning?

A

A method that creates implicit labels from unlabeled data without human involvement

This approach has led to significant advancements in AI, such as ChatGPT.

51
Q

Who developed a deep neural network solution for pattern analysis at UC Berkeley in 2014?

A

Jitendra Malik and colleagues

Their work focused on the PASCAL VOC dataset.

52
Q

What does R-CNN stand for?

A

Region-based Convolutional Neural Network

R-CNN outperformed existing methods in object detection after being fine-tuned.

53
Q

What was the initial concern of Alexei Efros about R-CNN?

A

Why a network trained on ImageNet could detect object boundaries well after fine-tuning

Efros believed the CNN needed general information from ImageNet for effective boundary detection.

54
Q

What was the outcome of the bet between Efros and Malik?

A

Efros lost the bet when R-CNN remained the best for object detection

The bet was about achieving object detection without human annotations.

55
Q

How do large language models (LLMs) like GPT-3 learn?

A

They predict masked words in sentences from a large corpus of text

The learning process involves calculating loss and updating parameters.

56
Q

What is the goal of the masked auto-encoder (MAE) developed by Kaiming He and colleagues?

A

To generate unmasked images from masked input images

The MAE learns latent representations of key features in images.

57
Q

What is double descent in machine learning?

A

A phenomenon where increasing model capacity leads to improved performance beyond interpolation

It includes a first descent to a minimum test error, followed by an ascent, and a second descent.

58
Q

What does the term ‘terra incognita’ refer to in the context of deep learning?

A

The unexplored mathematical underpinnings of observed behaviors in over-parameterized neural networks

This contrasts with the well-understood behavior in under-parameterized regimes.

59
Q

What tension exists in the machine learning community according to Tom Goldstein?

A

The tension between theoretical and experimental approaches in machine learning

This has implications for the development of new models and understanding their behavior.

60
Q

What is a significant challenge when dealing with the loss function of deep neural networks?

A

The loss function is non-convex with many local minima

This complicates the process of finding the global minimum.

61
Q

True or False: The loss landscape of deep neural networks is well understood.

A

False

There are conflicting theories about the existence of local minima in the loss landscape.

62
Q

What is the nature of the loss landscape in deep neural networks?

A

It is extremely complicated and may contain local minima or global minima

The landscape’s complexity is a significant challenge for theorists.

63
Q

What did Goldstein’s empirical study reveal about local minima in neural networks?

A

Neural networks can get stuck in not-so-good local minima where the loss is non-zero

This occurs despite the networks being over-parameterized.

64
Q

What is stochastic gradient descent?

A

A method where gradient descent is performed using small batches of training data

It approximates the descent direction rather than using the exact steepest descent.

65
Q

What does the term ‘grokking’ refer to in the context of neural networks?

A

The phenomenon where a neural network learns to generalize beyond mere memorization after extensive training

It involves understanding deeper patterns in the data.

66
Q

What is a transformer in machine learning?

A

A type of architecture especially suited for processing sequential data

Examples include LLMs like ChatGPT.

67
Q

How did the transformer network used by Power’s team learn to add numbers?

A

It was trained on a table of modulo-97 addition examples and learned to represent numbers in a high-dimensional space

The learning process involved predicting masked numbers in the addition equations.

68
Q

What happens when a transformer network stops training after reaching zero training loss?

A

It likely interpolates the training data and memorizes it

This results in poor performance on unseen test data.

69
Q

What is a phase change in the context of grokking?

A

The transition from memorizing a table of answers to understanding underlying knowledge

It is likened to physical phase changes, such as water turning to ice.

70
Q

What is the significance of Minerva in language models?

A

It was the first LLM to correctly answer about 50% of high school-level math questions in the MATH dataset

Minerva used a large language model architecture to predict answers based on token sequences.

71
Q

How does Minerva generate answers to math questions?

A

It converts the question into a sequence of tokens and predicts the answer token by token

This raises questions about whether it is reasoning or merely pattern matching.

72
Q

What are AI winters?

A

Periods of stagnation in AI research due to lack of progress or overhyped expectations

Notable AI winters occurred in the late 1960s, 1970s, and late 1980s.

73
Q

What does Goldstein argue about the current state of AI research?

A

We may still be experiencing an AI winter regarding tasks that involve text comprehension and logical reasoning

There’s ongoing debate about the effectiveness of neural networks alone in achieving true AI.

74
Q

What was the training approach used for both PaLM and Minerva?

A

They were trained using self-supervised learning to predict masked tokens

This method does not involve explicit reasoning or problem-solving.

75
Q

Fill in the blank: The addition operation in the training data used by Power’s team was _______.

A

modulo-97 addition