Midterm 2 Flashcards
All content necessary for midterm 2
What Is artificial intelligence?
A computer program that mimics the intelligence of humans
What is machine learning?
technique which a computer can use to learn from data using complex rules
What is deep learning?
A technique for machine learning based on the neurons in the brain
List the three types of Machine learning and explain them
- Unsupervised learning - no feedback given to algorithm
- supervised learning - every example has a label
- Reinforcement - reward or punishment per action
How does supervised learning training work?
Training is a collection of labelled examples {xi, yi} where xi is a feature vector with D dimensions and y is a label
What is k-nearest neighbors?
Looks at neighbors closest to the data (similar feature values) and what they are classified as
What is linear regression?
Supervised machine learning used on continuous numerical data. It enables us to identify a linear trend and outliers
What is binary classification
Supervised learning where the objective is to organize everything into one of 2 classes (logistic regression)
what is multi-class classification
supervised learning into 3 or more discrete classes. can be transformed into binary
- one vs all (OvA)
Explain one vs all
separate binary classifier for each class. each classifier labels one class as positive and all others as negative. final assignment is based on the classifier with the highest confidence score
What is a decision boundary?
A boundary which partitions the underlying feature space into regions corresponding to different class labels
What is linearly separable data?
data is linearly separable when 2 classes can be perfectly separated by a single linear boundary (line for 2d, plane for 3d, hyperplane for >3d)
what is the difference between a simple decision boundary and a complex one?
simple is when the boundary comes from a polynomial function.
complex is an irregular decision boundary generated by decision trees
What is logistic regression?
it is a binary (0,1) classification algorithm which determines the probability that a given instance xi belongs to the positive class
Explain the logistic function
maps a real valued input to the open interval 0-1. It is called a squashing function because it maps a wide input domain to a constrained output
What is underfitting?
Machine learning concept where the model is too simple to accurately classify the data. It is underfitting if it has poor performance on both training and test data and adding more data doesn’t correct the issue
What is overfitting?
When the model is too complex for a given classification problem (tall decision tree, deep and wide neural networks). Too many features creates excellent performance on the training set but poor performance on the testing set
Explain learning curves?
Displays the performance of our model by using Root mean square error (RMSE) on both the training and test sets
What is the Bias/Variance trade off?
Bias -> error created by overly simplistic models, high bias = underfitting
Variance -> error from overly complex models that is sensitive to fluctuations in the training data. High variance = overfitting
Tradeoff -> aim for a model that generalizes new data well
Explain the confusion matrix
A matrix which displays the true positives, false negatives, false positives and true negatives for all labels
What is accuracy?
The ratio of correctly predicted instances and the total number of predictions
What is precision
ratio of true positives (TP) / total number of positives
Explain the holdout method
Allocate roughly 80% of your dataset for training and reserve the remaining 20% for testing
- Training error generally low otherwise there is something wrong
- Generalization error - error rate observed when the model is evaluated on new unseen data
What is cross validation?
method to evaluate models and improve performance. Involves partitioning the dataset into multiple subsets
Explain k-fold cross validation
- Divide the dataset into k equally sized folds
- Training and validation - for each iteration, one fold is used as the validation remaining as training
- Evaluation - models performance is evaluated in each iteration, resulting in k performance measures
- Aggregation - stats are calculated based on k performance measures
What are the benefits of k fold compared to normal test train split?
Much more reliable estimate of model performance.
Results in better generalization and reduced variability
Works very well for hyper parameter tuning
Challenges of multi fold
Computationally costly - takes forever to train and doing it a bunch of times increases that
Class imbalance - folds may not represent minority classes (if one fold contains a ton of one class it could skew training or validation)
Error prone
What is a hyperparameter
A hyperparameter is a configuration external to the model that is set prior to the training process and dictates the learning process
Grid search
- Enumerates through all possible hyperparameter combinations
- train on training set, evaluate on validation set
Data augmentation
a technique used to increase the diversity of a dataset by applying various transformations to the existing data
What is one-hot encoding?
A technique that converts categorical variables into a binary vector representation where each category is represented with a single 1 and all others as 0 (e.g. instead of something just being labelled 5 it is 0, 0, 0, 0, 1)
Explain why one-hot encoding is beneficial
Increases the dimensionality of feature vectors. it helps it avoid bias
What is Binning (feature engineering)
placing things into bin categories. e.g. ages into: child, teen, adult and senior
What is normalization
A scaling technique which accelerates optimization -> algorithms perform optimally when feature values are within similar ranges and this helps with it
What is standardization?
Transforms each feature to have a normal distribution with a mean of 0 and a standard deviation of 1
Standardization or Normalization?
-> standardization for unsupervised learning or if features resemble a normal distribution
-> standardization handles outliers better otherwise use normalization
What is data imputation
Data imputation -> the process of replacing missing values in a dataset using statistics or machine learning
Data imputation strategies
- mean, median or mode replacement
- special value method -> value outside normal range as a notifier of a missing value
What is a class imbalance
A scenario where the number of instances in one class significantly outnumbers the instances of another one -> the model becomes biased towards the dominant majority class
Explain the solutions to class imbalance
Oversampling the minority class -> can lead to overfitting and poor performance on the test data
Undersampling the majority class -> loss of info about majority class can lead to underfitting
Synthetic data -> generate fake minority data
Deep learning - how it works
Machine learning technique that can be applied to supervised learning, unsupervised learning and reinforcement learning
It is inspired by neurons and uses layers of them connected to classify things
Explain the layers of a neural network
Input: where the data is input it corresponds to the number of attributes in the data
Hidden layer: the process for which the computer sorts the data
Output layer: where the data is classified completely
What is an activation function in relation to neural networks
Activation function is applied to the entire neural network and introduces non-linearity into the neural network -> a neuron is fired or activated when the requirement passed to the node exceeds the value stored within the node
Explain the three common activation functions?
Sigmoid -> (sin based function) produces outputs inbetween (0, 1)
Tanh -> (tan based function)outputs in between (-1, 1)
ReLU ->(rectified linear unit) outputs values in the interval [0, infinity)
Explain the universal approximation theorem
A neural network with a single hidden layer can approximate any continuous function
What is back-propagation
A learning procedure which repeatedly adjusts the weights of the connections in the network so as to minimize a measure of the difference between the actual output vector of the net and the desired output vector.
Explain the backpropagation steps
- Initialization
- Forward pass
- compute loss
- backpropagation
- repeat 2 to 5
The algorithm stops either after a number of epochs or when the convergence criteria is satisfied
Explain the initialization step
Initializing the weights and biases of the neural network
1. Zero initialization - all weights are 0
- doesn’t work well for symmetry as all neurons produce identical outputs
2. Random initialization - weights are initialized randomly using uniform or normal distributions
Explain the forward pass step
Data is passed to the first layer -> for each of the hidden layers, compute the activations by applying the weighted sum of inputs plus bias -> followed by an activation function
Explain the compute loss step
Calculate the error (loss) using a suitable function by comparing predicted values with actual target values -> A smaller loss indicates that predicted values are closer to the actual target values
Explain the Backwards pass step
Output layer: Compute the gradient of the loss with respect to the output layers weights and biases using the chain rule of calculus
Hidden layers: propagate the error backwards through the network layer by layer. For each layer compute the gradient loss with respect to the weights and biases
Update the weights and biases: adjust the weights and biases using the calculated gradients and a learning rate
What is gradient descent
An optimization algorithm used to minimize the loss function by iteratively moving towards the steepest descent as defined by the negative gradient
Explain weight initialization and how vanishing gradients were solved
Random initialization breaks the symmetry and allows for effective learning
Glorot initialization fixes it for sigmoid and tanh’
He initialization is optimal for ReLU and its variants
Explain how learning rate is important to optimization
Learning rate determines the step size during optimization
Explain the hierarchy of concepts
Each layer detects patterns from the output of the layer preceding it -> in other words the network uncovers patterns of patterns
What is a Convolutional Neural Network
Crucial pattern info is often local (e.g. top left edge)
convolutional layers, reduce parameters significantly because neurons are not fully connected to the preceding layer but rather their receptive fields
What is a kernel in the context of machine learning
A kernel is a small matrix that slides over input data such as an image to perform convolution. The kernel is moved through the entire image one n pixels at a time (kernel is nxn) the values in the kernel are multiplied by the value in the input matrix region the overlap and then all the values are summed to make a single scalar value -> output matrix is the feature map
What is a receptive field
Each unit is connected to neurons in its receptive fields -> unit i, j in layer l is connected to the units (i to i + fh -1) and (j to j+fw-1) of the layer l-1
What is padding?
Zero padding -> to have layers of the same size the grid can be padded with zeroes -> allows it to recognize edges
Explain the stride
Stride -> it is possible to connect a larger layer (l-1) to a smaller one (l) by skipping units. The number of units skipped is called the stride
what are filters
A window of size fh x fw is moved over the output layers l-1 referred to as the input feature map
- For each location, the product is calculated between the extracted patch and a matrix of the same size known as the convolution kernel or filter
Explain the kernel parameters and where they originate
The parameters of the kernel are learned through backpropagation allowing the network to optimize its feature extraction capabilities based on the training data
what is a feature map
in CNN the output of a convolution operation is the feature map
What is the bias term
a single bias term is added uniformly to all entries of the feature map -> this bias helps adjust the activation level
what is pooling
basically a convolutional layer except there is no weights instead there is aggregating function normally max or mean-> each neuron in a pooling layer is connected to neurons in the receptive field
Advantages to pooling
Dimensionality reduction -> reduces spatial dimensions of input feature maps decreasing the # of parameters and computational load
Feature extraction -> essentially summarizes the region discarding less important details
Translation invariance -> network becomes less sensitive to small changes
Noise reduction -> smooths noise through aggregation
What are the environmental characteristics
- Observability: partially or fully
- agent composition: single or multiple
- Predictability: deterministic or nah
- State dependency: stateless or stateful
- temporal dynamics: static or dynamic
- state representation: discrete or continuous
Explain the search problem
A collection of states (state space) -> an initial state where the agent begins -> one or more goal states -> a set of actions available in the state -> a transition model that determines the next state based on the current state and action
What is informed search
searching with heuristic functions involved to estimate costs
Explain best-first search
Breadth first improvement -> uses heuristics to prioritize nodes that seem closer to the goal -> it uses a priority queue sorted by estimated cost
Manhattan distance heuristic
specifically for 8-tile problem -> calculates the sum of the distance of tiles from their goal position