AI Flashcards
Define deep learning
Specific sub-field of machine learning: a new take on learning representations from data that puts an emphasis on learning successive layers of increasingly meaningful representations.
What is binary cross entropy
A loss function commonly used in deep learning for binary classification tasks. It measures the difference between the predicted probabilities by the model and the actual binary labels in the data.
What are some key points about binary cross-entropy
-It’s a differentiable function, allowing optimization algorithms like gradient descent to efficiently adjust the model’s weights during training.
-Lower binary cross entropy indicates better model performance, meaning the predictions are more aligned with the true labels. High deviations from expected properties are punished
-It’s suitable for problems where the outcome can be classified into two categories.
What must be present in the compliation step of a deep learning model?
A loss function
An optimiser
Metrics to monitor during training and testing
What is Categorical cross entropy
A loss function that works very similar to binary cross-entropy for multi-classification tasks. It measures the difference between the probability distribution that the model predicts (often called the softmax layer) and the actual probability distribution of the correct class.
When and why might Categorical cross entropy be used?>
-It is used for multi-classification problems
-It is a differentiable function, allowing optimization algorithms to efficiently adjust the models weights during training
What is backpropagation and how does it work
-Training technique for deep neural networks
-Works to minimize the loss function by adjusting weights and biases within the network
-Uses a reversed flow of data and calculates the error at the output layer then propagates back through the network
Define hyperparameter tuning
For a given neural network, there are several multiple parameters that can be optimised including the number of hidden neurons, BATCH_SIZE, number of epochs.
Hyperparameter tuning is the process of finding the optimal combination of those parameters that minimise the loss function.
Define learning rate for gradient decent algorithms
Learning rate
Gradient descent algorithms multiply the magnitude of the gradient by a scalar known as learning rate (also sometimes called step size) to determine the next point.
When should we use SGD(Stochastic Gradient Descent) vs Mini-batch SGD
SGD:
-simpler to implement
-Can escape local minima more easily due to the noisy updates
-Slow for large datasets
Mini-Batch SGD:
-faster than SGD for large datasets( fewer updates per epoch)
- requires tuning batch size
- may be less accurate
Define Overfitting and some signs it’s occuring
Overfitting occurs when the model becomes too focused on memorizing the specific details and noise present in the training data, rather than learning the underlying patterns and relationships that generalize well to unseen data.
Signs:
High training accuracy, low validation accuracy - The model performs well on the training data but struggles on the validation data
How can we avoid overfitting
Reduce model complexity: This can involve using fewer layers, neurons, or connections in the network. A simpler model has less capacity to overfit.
Data augmentation: Artificially increasing the size and diversity of your training data by techniques like flipping images, adding noise, or cropping.
Regularization: Techniques like L1/L2 regularization penalize large weights, discouraging the model from becoming too complex and overfitting the data.
Early stopping: Stop training the model before it starts to overfit. Monitor the validation accuracy and stop training when it starts to decrease.
What are activation functions, and why are they necessary in deep learning?
They add non-linearity, crucial for complex learning.
Without them, networks can only learn simple patterns.
Activation functions transform neuron output (e.g. squashing values, using thresholds).
Different activation functions (ReLU, sigmoid, tanh) exist for various tasks.
What is meant by non linearity
Non-linearity means the relationship between the input and output of a neuron is not a straight line.
What do non linear activation functions do?
Non-linear activation functions solve the following limitations of linear activation functions:
They allow backpropagation because now the derivative function would be related to the input.
They allow the stacking of multiple layers of neurons as the output would now be a non-linear combination of input passed through multiple layers.
What is the vanishing gradient problem
Vanishing Gradient Problem: In deep learning, gradients are used to train the network. This problem occurs when these gradients become very small as they travel through the network layers during training.
Impact: Small gradients make it difficult to update weights in earlier layers, hindering the network’s ability to learn complex patterns.
What are some causes of the vanishing gradient problem
Blame the Activation Function: Certain activation functions (like sigmoid) have outputs that flatten out at extremes (very positive or negative inputs).
Backpropagation Culprit: During backpropagation, gradients are multiplied by the activation function’s derivative. With flattening activation, these derivatives become very small, shrinking the gradients as they travel back through layers.
Small Gradients, Big Problem: Tiny gradients make it hard to adjust weights in earlier layers, hindering learning in those crucial parts of the network.
What is the difference between classification and regression in supervised machine learning
Classification:
Goal: Predict discrete categories (classes)
Output: Labels (e.g., spam/not spam, cat/dog)
Think of: Sorting things into groups
Regression:
Goal: Predict continuous values
Output: Numbers (e.g., house price, temperature)
Think of: Estimating a value on a spectrum
What does a loss function do??
The loss function measures how badly the AI system did by comparing it’s predicted output to the ground truth
What is the ground truth?
Ground truth refers to the correct or true information that a model is trained on and ultimately tries to predict.
How does the mean-squared error loss function work
Mean squared error (MSE) is a common loss function used in machine learning, particularly in regression tasks. It measures the average of the squared differences between the predicted values by a model and the actual values (ground truth).
-Effective for continuous tasks
-Sensitive to outliers
What is the Log Loss loss function(binary cross entropy)
Log loss leverages the logarithm function to penalize models for predicting probabilities that are far from the actual labels (0 or 1 in binary classification). The core idea is that the loss should be higher when the predicted probability diverges from the truth and lower when it aligns with the truth. The logarithmic function inherently satisfies this property because:
Logarithms return smaller values for smaller inputs (closer to 0).
Conversely, logarithms return larger negative values for inputs further away from 1.
-Used for binary tasks
What is the difference between binary and multi-class cross entropy
Binary cross entropy deals with two classes (0 or 1), while multi-class cross entropy handles scenarios with more than two possible categories.
In a multi-class problem, the model outputs a vector of probabilities, where each element represents the probability of the input belonging to a specific class.
The individual loss terms are averages to get an overal multi-class cross entropy
How does a vector output work for multi-class classification tasks?
the model will output a vector with the probability it believes the object being classified is for each possible class. e.g if there were apples, oranges and pears the model might output [0.9, 0.02, 0.01] assuming the object was an apple.
What is a convoluted Neural network used for and why
It’s commonly used in image classification problems as it’s able to sucessfully capture the spatial and temportal dependancies of an image through the application of relevent filters
Do Convoluted neural networks need more or less pre-processing compared to other classification algorithms
It will require much less
Note: the role of a Convoluted neural network is to reduce the images to a form that is easier to process
What are some key points about Convuluted neural networks
-CNNs learn features automatically
-Processing images in terms of RGB channels allows the network to capture color dependent features needed for image recognition
-progressively builds a hierarchy of features
What is a basic overview of how CNNs work
-The image is entered as a 3D cube, the height represents the pixel location and the depth represents the three color channnels (RGB)
-filters will slide across the image looking for patterns for each color
-the filters will create feature maps showing where they found interesting features
-pooling layers will shrink these maps and grab the key points
-This will be repeated and then the network will flatten everything and feed it to a regular neural network
What is max pooling vs average pooling in a CNN?
max pooling will take the maximum value in a specific window and will capture the most prominent features within a local area
-is good for object recognistion tasks where key features are crucial but might lose some spacial information
Average pooling will the average value for a specific region of the feature map and will capture a more generalised representation of the features within a local area
-Will provide more spatial information than max pooling but may blur sharper features
Note pooling is the step in a CNN where the data is shrunk, works by summerising the data in a region feature map
What is the class imbalance problem?
occurs when the dataset has a significant skew in the distribution of classes.
This is problematic as most algorithms will favor the majority class and poorly identify the minority class
What is an issue with having limited data
-The model may not be able to capture all the complexities present in the data
-The model may perform well on data it’s seen during training but may struggle with unseen data
What is overfitting and why is it a problem?
Overfitting is where the model will memorise the training data and learn from random noise or quirks in that data set.
This means that the model will perform well on the training data but will fail on anything new.
What are some solutions to a class imbalance?
-collect more data
-Delete data from the majority class
-Create synthetic data
Random over/undersampling is a solution to the class imbalance problem, what does it mean?
Random oversampling is where random data points from the minority class are duplicated
Random undersampling is where random datapoints from the majority class are deleted
What are some problems with over/under sampling
It can cause loss of information(undersampling)
It can cause overfitting and fixed boundaries(for over sampling)
When training a CNN on small/class imbalanced data overfitting and bias are two issues that can arise. What are they and how can they be solved
Overfitting is caused by having too few samples to learn from, rendering the model unable to generalise to new data
Bias is caused by having class imbalance, the model is unable to learn the boundaries between the classes
These can be solved by regularisation and data augmentation
What are some techniques for data augmentation
Rotation range - takes a value in degrees which signifies a range in which to rotate pictures
width shift and height shift - takes a value within which will randomly translate pictures vertically or horizontally
Shear range - will randomly apply shear transformations
zoom range- will randomly zoom into pictures
horizontal flip - will randomly flip half the images horizontally
fill mode- Will fill in newly created pixels which can appear after a rotation or width shift
What is Synthetic Minorty Oversampling Technique(SMOTE)
It’s a data augmentation technique used to address a class imbalance in machine learning tasks
-Should only be performed after the train-test split
-Should only ever be performed on the training data
How does SMOTE work (4 steps)?
- Identifiying minority class Instances
- Select instances randomly for oversampling
- Find the nearest neighbours of the selected instance from the minority class
- Generate Synthetic Samples
-Create synthetic samples by interpolating between the selected instance and it’s K nearest neighbour e.g. take the difference between the two
The synthetic samples can then be added to the dataset
What are three performance metrics used for a CNN?
Accuraccy - can be misleading for imbalanced data sets
recall - measure of the model corectly identifiying true positives
precision - the ratio between true positives and all the positives
What is a pretrained model and what is it commonly used for?
It’s a saved network that was previously trained on a large dataset
it’s commonly used when attempting deep learning on smaller datasets
What is an example of a pretrained model?
ImageNet - has 1.4 million labeled images and 1000 different classes
Contains many different animal classes and will perform well at animal identification
Give three examples of pre-trained models and their uses
Faster R-CNN:
Faster R-CNN is a widely used model for object detection.
Mask R-CNN:
An extension of Faster R-CNN, providing pixel-wise segmentation
masks along with bounding boxes.
YOLO (You Only Look Once):
YOLO is an object detection algorithm that divides the input
image into a grid and predicts bounding boxes and class
probabilities directly. YOLOv3 and YOLOv4 are some of the
popular versions.
Define feature extraction
The process of identifying and extracting meaningful characteristics from data. Features help a deep learning model understand data by focusing only on the relevant parts, making the data easier for the model to process and analyse. The model can then learn patterns and relationships more effectively
Why is feature extraction a useful part of pre-trained models?
As these features are often learned from massive datasets pre trained models will already have learned useful features
How can we use extracted features and pre-trained models to improve our models
We can freeze the the weights(parameters) of the pre-trained models initial layers(convolutional base) so they can act as a feature extractor and their knowledge will remain unchanged. The later layers can be trained specifically for the required task(This is pretty much fine-tuning tbh)
How can we create a feature extractor?
We can remove the final layers of a pre-trained model, removing the decision making sections and leaving only a feature extractor
How can we use fine-tuning in pre-trained models?
Fine-tuning is about using the pre-trained models large amount of generalised knowledge and adapting it to a specific problem. This can be done by freezing the initial layers, keeping the feature extraction knowledge intact. Later layers responsible for classifying categories are left unfrozen and additional layers can be added specifically for the classification task`
What are some benefits of fine tunning?
Reduced Training Data Need: Since the pre-trained model already has a strong foundation, you can often achieve good results with a smaller amount of your own data for fine-tuning.
Faster Training Times: By only training a portion of the model, fine-tuning is generally much faster than training a model from scratch.
Improved Performance: By leveraging pre-trained knowledge and adapting it to your specific task, fine-tuning can significantly improve the accuracy of your final model.
When reusing a convulutional base in a CNN what is the difference between deep and shallow mode?
In deep mode, also known as heavy fine-tuning, you essentially fine-tune a significant portion of the pre-trained convolutional base along with your newly added layers.
-Improved performance
-Overfitting risk & increased training time
Shallow mode, also known as light fine-tuning, focuses on making smaller adjustments to the pre-trained model.
-Fast training
-Reduced overfitting risk
-limited performance
What is dimensionality reduction?
Dimensionality reduction is the technique of representing multi-dimensional data (data with multiple features having acorrelation with each other) in a lower dimension.
Why do we need dimensionality reduction?
The curse of dimensionality is a phenomenon that happens because the sample density decreases exponentially with the increase of the dimensionality. When we keep adding (or increasing the number) features without increasing the size/number of training samples, the dimensionality of the feature space grows and becomes sparser.Due to this sparsity, it becomes much easier to find a perfect solution for the machine learning model which highly likely leads to overfitting.
e.g.
As the number of dimensions increases, the amount of data needed to effectively train a model grows exponentially. This can lead to issues like overfitting and computational inefficiency. Dimensionality reduction helps alleviate this curse by reducing the number of features the model needs to learn from.
What are autoencoders?
Autoencoders are a type of unsupervised ANN. Its primary objective is to learn a representation (encoding) of the input data in a way that it captures the essential features, reducing the dimensionality of the data. This encoding is then used to reconstruct the original input as closely as possible.
What are the four parts of an autoencoder
Encoder: is the part of the autoencoder responsible for encoding/compressing the input data into a lower-dimensional representation.
Decoder: it takes the compressed representation generated by the encoder and attempts to reconstruct the original input data from this representation.
Latent Space (code): the compressed representation learned by the encoder is often referred to as the “latent space.”
Objective Function: to minimize the difference between the input data and its reconstruction, where could be achieved by using a loss function such as mean squared error (MSE) to measure the difference between the input and the reconstructed output.
What are denoising autoencoders?
A specific type of autoencoder architecture designed to tackle noisy data. In addition to dimensionality reduction and feature extraction(like a normal autoencoder) they can remove noise from the data iteself
How do denoising autoencoders work?
An autoencoder is trained on blurry or noisy images. The DAE tries to reconstruct a clean, clear version of the image from the corrupted input. By forcing this reconstruction, it learns to identify and remove the noise while capturing the underlying features of the data.
-They are trained on a corrupted version of the data, this can be anything from salt-and-pepper noise to occluding or masking part of the data
One application for AutoEncoders is adding color to black and white images
What is the goal of natural language processing?
To make machines understand and interpret human language the way it is written or spoken
What are the two levels of linguistic analysis that must be done before performing NLP
syntax- what part of the text is grammatically correct
semantics - What is the meaning of the given text
One thing that NLP has to deal with is morphology, what does that mean?
The formation of words and how they relate with each other
What is natural language understanding and what are the ambiguities present in natural language?
Understanding the meaning of a given text
The following are ambiguities NLP will attempt to resolve
Lexical ambiguity - words have different meanings
syntactic - sentence has multiple parse trees
sematic- sentence has multiple meanings
anaphoric - phrase or word previoulsly mentioned which has a different meaning
What are 4 stages of Natural Language Understanding?
Syntax Analysis: identify the structure of the sentence, including parts of speech and sentence structure.
Semantics: Understand the meaning of words and phrases
Named Entity Recognition(NER): recognise named entities in the input. e.g. Exeter is a location
Intent Recognition: Understand the Users Intent
What are the stages of the NLP pipeline?
Import text file
Sentence Segmentation- identify sentence boundaries in the given text e.g full stops new lines etc
Tokenisation - Identify different words numbers and punctuation
Stemming - Strip the ending of words e.g. eating is reduced to eat
Part of speech(POS) tagging - assign each word in a sentence it’s own tag such as designating word as noun or adverb
Parsing - Divide text into different categories to answer a question e.g. this part of sentence modifies another part
Named Entity Recognition - Identifies entities such as persons, location and times
co-reference - Define the relationship of the given words in a sentence with previous and next sentence
How does Sentence Segmentation work?
Will split a large section of text into component sentences, often using punctuation like fullstops, to apply sentence tokenisation with NLTK we can use the NLTK.sent_tokenise function
How does Tokenisation work?
Will split the segmented sentences up into “tokens”. Tokenised sentences are essentially just arrays of words
What are stop words?
Stop words usually refer to the most common words such as “and”, “the”, “a” in a language, but there is no single universal list of stopwords. The list of the stop words can change depending on your application.
Stop words are removed before filtering text data as they are considered to have little meaning
What does Lemmatization and Stemming do?
The goal of both stemming and lemmatization is to reduce inflectional forms and sometimes derivation-ally related forms of a word to a common base form.
am, are, is => be
dog, dogs, dog’s, dogs’ => dog
Define Lemmatization
Lemmatization usually refers to doing things properly with the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma.
e.g. The word better has “good” as it’s lemma