AI Flashcards
Define deep learning
Specific sub-field of machine learning: a new take on learning representations from data that puts an emphasis on learning successive layers of increasingly meaningful representations.
What is binary cross entropy
A loss function commonly used in deep learning for binary classification tasks. It measures the difference between the predicted probabilities by the model and the actual binary labels in the data.
What are some key points about binary cross-entropy
-It’s a differentiable function, allowing optimization algorithms like gradient descent to efficiently adjust the model’s weights during training.
-Lower binary cross entropy indicates better model performance, meaning the predictions are more aligned with the true labels. High deviations from expected properties are punished
-It’s suitable for problems where the outcome can be classified into two categories.
What must be present in the compliation step of a deep learning model?
A loss function
An optimiser
Metrics to monitor during training and testing
What is Categorical cross entropy
A loss function that works very similar to binary cross-entropy for multi-classification tasks. It measures the difference between the probability distribution that the model predicts (often called the softmax layer) and the actual probability distribution of the correct class.
When and why might Categorical cross entropy be used?>
-It is used for multi-classification problems
-It is a differentiable function, allowing optimization algorithms to efficiently adjust the models weights during training
What is backpropagation and how does it work
-Training technique for deep neural networks
-Works to minimize the loss function by adjusting weights and biases within the network
-Uses a reversed flow of data and calculates the error at the output layer then propagates back through the network
Define hyperparameter tuning
For a given neural network, there are several multiple parameters that can be optimised including the number of hidden neurons, BATCH_SIZE, number of epochs.
Hyperparameter tuning is the process of finding the optimal combination of those parameters that minimise the loss function.
Define learning rate for gradient decent algorithms
Learning rate
Gradient descent algorithms multiply the magnitude of the gradient by a scalar known as learning rate (also sometimes called step size) to determine the next point.
When should we use SGD(Stochastic Gradient Descent) vs Mini-batch SGD
SGD:
-simpler to implement
-Can escape local minima more easily due to the noisy updates
-Slow for large datasets
Mini-Batch SGD:
-faster than SGD for large datasets( fewer updates per epoch)
- requires tuning batch size
- may be less accurate
Define Overfitting and some signs it’s occuring
Overfitting occurs when the model becomes too focused on memorizing the specific details and noise present in the training data, rather than learning the underlying patterns and relationships that generalize well to unseen data.
Signs:
High training accuracy, low validation accuracy - The model performs well on the training data but struggles on the validation data
How can we avoid overfitting
Reduce model complexity: This can involve using fewer layers, neurons, or connections in the network. A simpler model has less capacity to overfit.
Data augmentation: Artificially increasing the size and diversity of your training data by techniques like flipping images, adding noise, or cropping.
Regularization: Techniques like L1/L2 regularization penalize large weights, discouraging the model from becoming too complex and overfitting the data.
Early stopping: Stop training the model before it starts to overfit. Monitor the validation accuracy and stop training when it starts to decrease.
What are activation functions, and why are they necessary in deep learning?
They add non-linearity, crucial for complex learning.
Without them, networks can only learn simple patterns.
Activation functions transform neuron output (e.g. squashing values, using thresholds).
Different activation functions (ReLU, sigmoid, tanh) exist for various tasks.
What is meant by non linearity
Non-linearity means the relationship between the input and output of a neuron is not a straight line.
What do non linear activation functions do?
Non-linear activation functions solve the following limitations of linear activation functions:
They allow backpropagation because now the derivative function would be related to the input.
They allow the stacking of multiple layers of neurons as the output would now be a non-linear combination of input passed through multiple layers.
What is the vanishing gradient problem
Vanishing Gradient Problem: In deep learning, gradients are used to train the network. This problem occurs when these gradients become very small as they travel through the network layers during training.
Impact: Small gradients make it difficult to update weights in earlier layers, hindering the network’s ability to learn complex patterns.
What are some causes of the vanishing gradient problem
Blame the Activation Function: Certain activation functions (like sigmoid) have outputs that flatten out at extremes (very positive or negative inputs).
Backpropagation Culprit: During backpropagation, gradients are multiplied by the activation function’s derivative. With flattening activation, these derivatives become very small, shrinking the gradients as they travel back through layers.
Small Gradients, Big Problem: Tiny gradients make it hard to adjust weights in earlier layers, hindering learning in those crucial parts of the network.
What is the difference between classification and regression in supervised machine learning
Classification:
Goal: Predict discrete categories (classes)
Output: Labels (e.g., spam/not spam, cat/dog)
Think of: Sorting things into groups
Regression:
Goal: Predict continuous values
Output: Numbers (e.g., house price, temperature)
Think of: Estimating a value on a spectrum
What does a loss function do??
The loss function measures how badly the AI system did by comparing it’s predicted output to the ground truth
What is the ground truth?
Ground truth refers to the correct or true information that a model is trained on and ultimately tries to predict.
How does the mean-squared error loss function work
Mean squared error (MSE) is a common loss function used in machine learning, particularly in regression tasks. It measures the average of the squared differences between the predicted values by a model and the actual values (ground truth).
-Effective for continuous tasks
-Sensitive to outliers
What is the Log Loss loss function(binary cross entropy)
Log loss leverages the logarithm function to penalize models for predicting probabilities that are far from the actual labels (0 or 1 in binary classification). The core idea is that the loss should be higher when the predicted probability diverges from the truth and lower when it aligns with the truth. The logarithmic function inherently satisfies this property because:
Logarithms return smaller values for smaller inputs (closer to 0).
Conversely, logarithms return larger negative values for inputs further away from 1.
-Used for binary tasks
What is the difference between binary and multi-class cross entropy
Binary cross entropy deals with two classes (0 or 1), while multi-class cross entropy handles scenarios with more than two possible categories.
In a multi-class problem, the model outputs a vector of probabilities, where each element represents the probability of the input belonging to a specific class.
The individual loss terms are averages to get an overal multi-class cross entropy
How does a vector output work for multi-class classification tasks?
the model will output a vector with the probability it believes the object being classified is for each possible class. e.g if there were apples, oranges and pears the model might output [0.9, 0.02, 0.01] assuming the object was an apple.