Midterm 2 Flashcards

All content necessary for midterm 2

1
Q

What Is artificial intelligence?

A

A computer program that mimics the intelligence of humans

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is machine learning?

A

technique which a computer can use to learn from data using complex rules

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is deep learning?

A

A technique for machine learning based on the neurons in the brain

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

List the three types of Machine learning and explain them

A
  1. Unsupervised learning - no feedback given to algorithm
  2. supervised learning - every example has a label
  3. Reinforcement - reward or punishment per action
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

How does supervised learning training work?

A

Training is a collection of labelled examples {xi, yi} where xi is a feature vector with D dimensions and y is a label

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is k-nearest neighbors?

A

Looks at neighbors closest to the data (similar feature values) and what they are classified as

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is linear regression?

A

Supervised machine learning used on continuous numerical data. It enables us to identify a linear trend and outliers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is binary classification

A

Supervised learning where the objective is to organize everything into one of 2 classes (logistic regression)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

what is multi-class classification

A

supervised learning into 3 or more discrete classes. can be transformed into binary
- one vs all (OvA)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Explain one vs all

A

separate binary classifier for each class. each classifier labels one class as positive and all others as negative. final assignment is based on the classifier with the highest confidence score

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is a decision boundary?

A

A boundary which partitions the underlying feature space into regions corresponding to different class labels

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is linearly separable data?

A

data is linearly separable when 2 classes can be perfectly separated by a single linear boundary (line for 2d, plane for 3d, hyperplane for >3d)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

what is the difference between a simple decision boundary and a complex one?

A

simple is when the boundary comes from a polynomial function.
complex is an irregular decision boundary generated by decision trees

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is logistic regression?

A

it is a binary (0,1) classification algorithm which determines the probability that a given instance xi belongs to the positive class

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Explain the logistic function

A

maps a real valued input to the open interval 0-1. It is called a squashing function because it maps a wide input domain to a constrained output

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is underfitting?

A

Machine learning concept where the model is too simple to accurately classify the data. It is underfitting if it has poor performance on both training and test data and adding more data doesn’t correct the issue

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What is overfitting?

A

When the model is too complex for a given classification problem (tall decision tree, deep and wide neural networks). Too many features creates excellent performance on the training set but poor performance on the testing set

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Explain learning curves?

A

Displays the performance of our model by using Root mean square error (RMSE) on both the training and test sets

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What is the Bias/Variance trade off?

A

Bias -> error created by overly simplistic models, high bias = underfitting
Variance -> error from overly complex models that is sensitive to fluctuations in the training data. High variance = overfitting
Tradeoff -> aim for a model that generalizes new data well

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Explain the confusion matrix

A

A matrix which displays the true positives, false negatives, false positives and true negatives for all labels

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What is accuracy?

A

The ratio of correctly predicted instances and the total number of predictions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What is precision

A

ratio of true positives (TP) / total number of positives

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Explain the holdout method

A

Allocate roughly 80% of your dataset for training and reserve the remaining 20% for testing
- Training error generally low otherwise there is something wrong
- Generalization error - error rate observed when the model is evaluated on new unseen data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What is cross validation?

A

method to evaluate models and improve performance. Involves partitioning the dataset into multiple subsets

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

Explain k-fold cross validation

A
  1. Divide the dataset into k equally sized folds
  2. Training and validation - for each iteration, one fold is used as the validation remaining as training
  3. Evaluation - models performance is evaluated in each iteration, resulting in k performance measures
  4. Aggregation - stats are calculated based on k performance measures
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

What are the benefits of k fold compared to normal test train split?

A

Much more reliable estimate of model performance.
Results in better generalization and reduced variability
Works very well for hyper parameter tuning

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

Challenges of multi fold

A

Computationally costly - takes forever to train and doing it a bunch of times increases that
Class imbalance - folds may not represent minority classes (if one fold contains a ton of one class it could skew training or validation)
Error prone

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

What is a hyperparameter

A

A hyperparameter is a configuration external to the model that is set prior to the training process and dictates the learning process

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

Grid search

A
  1. Enumerates through all possible hyperparameter combinations
  2. train on training set, evaluate on validation set
30
Q

Data augmentation

A

a technique used to increase the diversity of a dataset by applying various transformations to the existing data

31
Q

What is one-hot encoding?

A

A technique that converts categorical variables into a binary vector representation where each category is represented with a single 1 and all others as 0 (e.g. instead of something just being labelled 5 it is 0, 0, 0, 0, 1)

32
Q

Explain why one-hot encoding is beneficial

A

Increases the dimensionality of feature vectors. it helps it avoid bias

33
Q

What is Binning (feature engineering)

A

placing things into bin categories. e.g. ages into: child, teen, adult and senior

34
Q

What is normalization

A

A scaling technique which accelerates optimization -> algorithms perform optimally when feature values are within similar ranges and this helps with it

35
Q

What is standardization?

A

Transforms each feature to have a normal distribution with a mean of 0 and a standard deviation of 1

36
Q

Standardization or Normalization?

A

-> standardization for unsupervised learning or if features resemble a normal distribution
-> standardization handles outliers better otherwise use normalization

37
Q

What is data imputation

A

Data imputation -> the process of replacing missing values in a dataset using statistics or machine learning

38
Q

Data imputation strategies

A
  • mean, median or mode replacement
  • special value method -> value outside normal range as a notifier of a missing value
39
Q

What is a class imbalance

A

A scenario where the number of instances in one class significantly outnumbers the instances of another one -> the model becomes biased towards the dominant majority class

40
Q

Explain the solutions to class imbalance

A

Oversampling the minority class -> can lead to overfitting and poor performance on the test data

Undersampling the majority class -> loss of info about majority class can lead to underfitting

Synthetic data -> generate fake minority data

41
Q

Deep learning - how it works

A

Machine learning technique that can be applied to supervised learning, unsupervised learning and reinforcement learning
It is inspired by neurons and uses layers of them connected to classify things

42
Q

Explain the layers of a neural network

A

Input: where the data is input it corresponds to the number of attributes in the data
Hidden layer: the process for which the computer sorts the data
Output layer: where the data is classified completely

43
Q

What is an activation function in relation to neural networks

A

Activation function is applied to the entire neural network and introduces non-linearity into the neural network -> a neuron is fired or activated when the requirement passed to the node exceeds the value stored within the node

44
Q

Explain the three common activation functions?

A

Sigmoid -> (sin based function) produces outputs inbetween (0, 1)
Tanh -> (tan based function)outputs in between (-1, 1)
ReLU ->(rectified linear unit) outputs values in the interval [0, infinity)

45
Q

Explain the universal approximation theorem

A

A neural network with a single hidden layer can approximate any continuous function

46
Q

What is back-propagation

A

A learning procedure which repeatedly adjusts the weights of the connections in the network so as to minimize a measure of the difference between the actual output vector of the net and the desired output vector.

47
Q

Explain the backpropagation steps

A
  1. Initialization
  2. Forward pass
  3. compute loss
  4. backpropagation
  5. repeat 2 to 5

The algorithm stops either after a number of epochs or when the convergence criteria is satisfied

48
Q

Explain the initialization step

A

Initializing the weights and biases of the neural network
1. Zero initialization - all weights are 0
- doesn’t work well for symmetry as all neurons produce identical outputs
2. Random initialization - weights are initialized randomly using uniform or normal distributions

49
Q

Explain the forward pass step

A

Data is passed to the first layer -> for each of the hidden layers, compute the activations by applying the weighted sum of inputs plus bias -> followed by an activation function

50
Q

Explain the compute loss step

A

Calculate the error (loss) using a suitable function by comparing predicted values with actual target values -> A smaller loss indicates that predicted values are closer to the actual target values

51
Q

Explain the Backwards pass step

A

Output layer: Compute the gradient of the loss with respect to the output layers weights and biases using the chain rule of calculus

Hidden layers: propagate the error backwards through the network layer by layer. For each layer compute the gradient loss with respect to the weights and biases

Update the weights and biases: adjust the weights and biases using the calculated gradients and a learning rate

52
Q

What is gradient descent

A

An optimization algorithm used to minimize the loss function by iteratively moving towards the steepest descent as defined by the negative gradient

53
Q

Explain weight initialization and how vanishing gradients were solved

A

Random initialization breaks the symmetry and allows for effective learning

Glorot initialization fixes it for sigmoid and tanh’

He initialization is optimal for ReLU and its variants

54
Q

Explain how learning rate is important to optimization

A

Learning rate determines the step size during optimization

55
Q

Explain the hierarchy of concepts

A

Each layer detects patterns from the output of the layer preceding it -> in other words the network uncovers patterns of patterns

56
Q

What is a Convolutional Neural Network

A

Crucial pattern info is often local (e.g. top left edge)
convolutional layers, reduce parameters significantly because neurons are not fully connected to the preceding layer but rather their receptive fields

57
Q

What is a kernel in the context of machine learning

A

A kernel is a small matrix that slides over input data such as an image to perform convolution. The kernel is moved through the entire image one n pixels at a time (kernel is nxn) the values in the kernel are multiplied by the value in the input matrix region the overlap and then all the values are summed to make a single scalar value -> output matrix is the feature map

58
Q

What is a receptive field

A

Each unit is connected to neurons in its receptive fields -> unit i, j in layer l is connected to the units (i to i + fh -1) and (j to j+fw-1) of the layer l-1

59
Q

What is padding?

A

Zero padding -> to have layers of the same size the grid can be padded with zeroes -> allows it to recognize edges

60
Q

Explain the stride

A

Stride -> it is possible to connect a larger layer (l-1) to a smaller one (l) by skipping units. The number of units skipped is called the stride

61
Q

what are filters

A

A window of size fh x fw is moved over the output layers l-1 referred to as the input feature map
- For each location, the product is calculated between the extracted patch and a matrix of the same size known as the convolution kernel or filter

62
Q

Explain the kernel parameters and where they originate

A

The parameters of the kernel are learned through backpropagation allowing the network to optimize its feature extraction capabilities based on the training data

63
Q

what is a feature map

A

in CNN the output of a convolution operation is the feature map

64
Q

What is the bias term

A

a single bias term is added uniformly to all entries of the feature map -> this bias helps adjust the activation level

65
Q

what is pooling

A

basically a convolutional layer except there is no weights instead there is aggregating function normally max or mean-> each neuron in a pooling layer is connected to neurons in the receptive field

66
Q

Advantages to pooling

A

Dimensionality reduction -> reduces spatial dimensions of input feature maps decreasing the # of parameters and computational load

Feature extraction -> essentially summarizes the region discarding less important details

Translation invariance -> network becomes less sensitive to small changes

Noise reduction -> smooths noise through aggregation

67
Q

What are the environmental characteristics

A
  1. Observability: partially or fully
  2. agent composition: single or multiple
  3. Predictability: deterministic or nah
  4. State dependency: stateless or stateful
  5. temporal dynamics: static or dynamic
  6. state representation: discrete or continuous
68
Q

Explain the search problem

A

A collection of states (state space) -> an initial state where the agent begins -> one or more goal states -> a set of actions available in the state -> a transition model that determines the next state based on the current state and action

69
Q

What is informed search

A

searching with heuristic functions involved to estimate costs

70
Q

Explain best-first search

A

Breadth first improvement -> uses heuristics to prioritize nodes that seem closer to the goal -> it uses a priority queue sorted by estimated cost

71
Q

Manhattan distance heuristic

A

specifically for 8-tile problem -> calculates the sum of the distance of tiles from their goal position