AI Flashcards

1
Q

Define deep learning

A

Specific sub-field of machine learning: a new take on learning representations from data that puts an emphasis on learning successive layers of increasingly meaningful representations.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is binary cross entropy

A

A loss function commonly used in deep learning for binary classification tasks. It measures the difference between the predicted probabilities by the model and the actual binary labels in the data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are some key points about binary cross-entropy

A

-It’s a differentiable function, allowing optimization algorithms like gradient descent to efficiently adjust the model’s weights during training.

-Lower binary cross entropy indicates better model performance, meaning the predictions are more aligned with the true labels. High deviations from expected properties are punished

-It’s suitable for problems where the outcome can be classified into two categories.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What must be present in the compliation step of a deep learning model?

A

A loss function
An optimiser
Metrics to monitor during training and testing

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is Categorical cross entropy

A

A loss function that works very similar to binary cross-entropy for multi-classification tasks. It measures the difference between the probability distribution that the model predicts (often called the softmax layer) and the actual probability distribution of the correct class.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

When and why might Categorical cross entropy be used?>

A

-It is used for multi-classification problems

-It is a differentiable function, allowing optimization algorithms to efficiently adjust the models weights during training

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is backpropagation and how does it work

A

-Training technique for deep neural networks
-Works to minimize the loss function by adjusting weights and biases within the network
-Uses a reversed flow of data and calculates the error at the output layer then propagates back through the network

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Define hyperparameter tuning

A

For a given neural network, there are several multiple parameters that can be optimised including the number of hidden neurons, BATCH_SIZE, number of epochs.

Hyperparameter tuning is the process of finding the optimal combination of those parameters that minimise the loss function.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Define learning rate for gradient decent algorithms

A

Learning rate

Gradient descent algorithms multiply the magnitude of the gradient by a scalar known as learning rate (also sometimes called step size) to determine the next point.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

When should we use SGD(Stochastic Gradient Descent) vs Mini-batch SGD

A

SGD:
-simpler to implement
-Can escape local minima more easily due to the noisy updates
-Slow for large datasets

Mini-Batch SGD:
-faster than SGD for large datasets( fewer updates per epoch)
- requires tuning batch size
- may be less accurate

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Define Overfitting and some signs it’s occuring

A

Overfitting occurs when the model becomes too focused on memorizing the specific details and noise present in the training data, rather than learning the underlying patterns and relationships that generalize well to unseen data.

Signs:
High training accuracy, low validation accuracy - The model performs well on the training data but struggles on the validation data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

How can we avoid overfitting

A

Reduce model complexity: This can involve using fewer layers, neurons, or connections in the network. A simpler model has less capacity to overfit.

Data augmentation: Artificially increasing the size and diversity of your training data by techniques like flipping images, adding noise, or cropping.

Regularization: Techniques like L1/L2 regularization penalize large weights, discouraging the model from becoming too complex and overfitting the data.

Early stopping: Stop training the model before it starts to overfit. Monitor the validation accuracy and stop training when it starts to decrease.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What are activation functions, and why are they necessary in deep learning?

A

They add non-linearity, crucial for complex learning.
Without them, networks can only learn simple patterns.
Activation functions transform neuron output (e.g. squashing values, using thresholds).
Different activation functions (ReLU, sigmoid, tanh) exist for various tasks.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is meant by non linearity

A

Non-linearity means the relationship between the input and output of a neuron is not a straight line.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What do non linear activation functions do?

A

Non-linear activation functions solve the following limitations of linear activation functions:

They allow backpropagation because now the derivative function would be related to the input.

They allow the stacking of multiple layers of neurons as the output would now be a non-linear combination of input passed through multiple layers.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is the vanishing gradient problem

A

Vanishing Gradient Problem: In deep learning, gradients are used to train the network. This problem occurs when these gradients become very small as they travel through the network layers during training.

Impact: Small gradients make it difficult to update weights in earlier layers, hindering the network’s ability to learn complex patterns.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What are some causes of the vanishing gradient problem

A

Blame the Activation Function: Certain activation functions (like sigmoid) have outputs that flatten out at extremes (very positive or negative inputs).
Backpropagation Culprit: During backpropagation, gradients are multiplied by the activation function’s derivative. With flattening activation, these derivatives become very small, shrinking the gradients as they travel back through layers.
Small Gradients, Big Problem: Tiny gradients make it hard to adjust weights in earlier layers, hindering learning in those crucial parts of the network.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What is the difference between classification and regression in supervised machine learning

A

Classification:

Goal: Predict discrete categories (classes)
Output: Labels (e.g., spam/not spam, cat/dog)
Think of: Sorting things into groups
Regression:

Goal: Predict continuous values
Output: Numbers (e.g., house price, temperature)
Think of: Estimating a value on a spectrum

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What does a loss function do??

A

The loss function measures how badly the AI system did by comparing it’s predicted output to the ground truth

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What is the ground truth?

A

Ground truth refers to the correct or true information that a model is trained on and ultimately tries to predict.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

How does the mean-squared error loss function work

A

Mean squared error (MSE) is a common loss function used in machine learning, particularly in regression tasks. It measures the average of the squared differences between the predicted values by a model and the actual values (ground truth).

-Effective for continuous tasks
-Sensitive to outliers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What is the Log Loss loss function(binary cross entropy)

A

Log loss leverages the logarithm function to penalize models for predicting probabilities that are far from the actual labels (0 or 1 in binary classification). The core idea is that the loss should be higher when the predicted probability diverges from the truth and lower when it aligns with the truth. The logarithmic function inherently satisfies this property because:

Logarithms return smaller values for smaller inputs (closer to 0).
Conversely, logarithms return larger negative values for inputs further away from 1.

-Used for binary tasks

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What is the difference between binary and multi-class cross entropy

A

Binary cross entropy deals with two classes (0 or 1), while multi-class cross entropy handles scenarios with more than two possible categories.
In a multi-class problem, the model outputs a vector of probabilities, where each element represents the probability of the input belonging to a specific class.
The individual loss terms are averages to get an overal multi-class cross entropy

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

How does a vector output work for multi-class classification tasks?

A

the model will output a vector with the probability it believes the object being classified is for each possible class. e.g if there were apples, oranges and pears the model might output [0.9, 0.02, 0.01] assuming the object was an apple.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

What is a convoluted Neural network used for and why

A

It’s commonly used in image classification problems as it’s able to sucessfully capture the spatial and temportal dependancies of an image through the application of relevent filters

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

Do Convoluted neural networks need more or less pre-processing compared to other classification algorithms

A

It will require much less

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

Note: the role of a Convoluted neural network is to reduce the images to a form that is easier to process

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

What are some key points about Convuluted neural networks

A

-CNNs learn features automatically
-Processing images in terms of RGB channels allows the network to capture color dependent features needed for image recognition
-progressively builds a hierarchy of features

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

What is a basic overview of how CNNs work

A

-The image is entered as a 3D cube, the height represents the pixel location and the depth represents the three color channnels (RGB)
-filters will slide across the image looking for patterns for each color
-the filters will create feature maps showing where they found interesting features
-pooling layers will shrink these maps and grab the key points
-This will be repeated and then the network will flatten everything and feed it to a regular neural network

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

What is max pooling vs average pooling in a CNN?

A

max pooling will take the maximum value in a specific window and will capture the most prominent features within a local area
-is good for object recognistion tasks where key features are crucial but might lose some spacial information

Average pooling will the average value for a specific region of the feature map and will capture a more generalised representation of the features within a local area
-Will provide more spatial information than max pooling but may blur sharper features

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

Note pooling is the step in a CNN where the data is shrunk, works by summerising the data in a region feature map

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

What is the class imbalance problem?

A

occurs when the dataset has a significant skew in the distribution of classes.
This is problematic as most algorithms will favor the majority class and poorly identify the minority class

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

What is an issue with having limited data

A

-The model may not be able to capture all the complexities present in the data
-The model may perform well on data it’s seen during training but may struggle with unseen data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

What is overfitting and why is it a problem?

A

Overfitting is where the model will memorise the training data and learn from random noise or quirks in that data set.
This means that the model will perform well on the training data but will fail on anything new.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

What are some solutions to a class imbalance?

A

-collect more data

-Delete data from the majority class

-Create synthetic data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

Random over/undersampling is a solution to the class imbalance problem, what does it mean?

A

Random oversampling is where random data points from the minority class are duplicated

Random undersampling is where random datapoints from the majority class are deleted

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

What are some problems with over/under sampling

A

It can cause loss of information(undersampling)

It can cause overfitting and fixed boundaries(for over sampling)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

When training a CNN on small/class imbalanced data overfitting and bias are two issues that can arise. What are they and how can they be solved

A

Overfitting is caused by having too few samples to learn from, rendering the model unable to generalise to new data

Bias is caused by having class imbalance, the model is unable to learn the boundaries between the classes

These can be solved by regularisation and data augmentation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
38
Q

What are some techniques for data augmentation

A

Rotation range - takes a value in degrees which signifies a range in which to rotate pictures

width shift and height shift - takes a value within which will randomly translate pictures vertically or horizontally

Shear range - will randomly apply shear transformations

zoom range- will randomly zoom into pictures

horizontal flip - will randomly flip half the images horizontally

fill mode- Will fill in newly created pixels which can appear after a rotation or width shift

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
39
Q

What is Synthetic Minorty Oversampling Technique(SMOTE)

A

It’s a data augmentation technique used to address a class imbalance in machine learning tasks
-Should only be performed after the train-test split
-Should only ever be performed on the training data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
40
Q

How does SMOTE work (4 steps)?

A
  1. Identifiying minority class Instances
  2. Select instances randomly for oversampling
  3. Find the nearest neighbours of the selected instance from the minority class
  4. Generate Synthetic Samples
    -Create synthetic samples by interpolating between the selected instance and it’s K nearest neighbour e.g. take the difference between the two

The synthetic samples can then be added to the dataset

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
41
Q

What are three performance metrics used for a CNN?

A

Accuraccy - can be misleading for imbalanced data sets

recall - measure of the model corectly identifiying true positives

precision - the ratio between true positives and all the positives

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
42
Q

What is a pretrained model and what is it commonly used for?

A

It’s a saved network that was previously trained on a large dataset

it’s commonly used when attempting deep learning on smaller datasets

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
43
Q

What is an example of a pretrained model?

A

ImageNet - has 1.4 million labeled images and 1000 different classes

Contains many different animal classes and will perform well at animal identification

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
44
Q

Give three examples of pre-trained models and their uses

A

Faster R-CNN:
Faster R-CNN is a widely used model for object detection.

Mask R-CNN:
An extension of Faster R-CNN, providing pixel-wise segmentation
masks along with bounding boxes.

YOLO (You Only Look Once):
YOLO is an object detection algorithm that divides the input
image into a grid and predicts bounding boxes and class
probabilities directly. YOLOv3 and YOLOv4 are some of the
popular versions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
45
Q

Define feature extraction

A

The process of identifying and extracting meaningful characteristics from data. Features help a deep learning model understand data by focusing only on the relevant parts, making the data easier for the model to process and analyse. The model can then learn patterns and relationships more effectively

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
46
Q

Why is feature extraction a useful part of pre-trained models?

A

As these features are often learned from massive datasets pre trained models will already have learned useful features

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
47
Q

How can we use extracted features and pre-trained models to improve our models

A

We can freeze the the weights(parameters) of the pre-trained models initial layers(convolutional base) so they can act as a feature extractor and their knowledge will remain unchanged. The later layers can be trained specifically for the required task(This is pretty much fine-tuning tbh)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
48
Q

How can we create a feature extractor?

A

We can remove the final layers of a pre-trained model, removing the decision making sections and leaving only a feature extractor

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
49
Q

How can we use fine-tuning in pre-trained models?

A

Fine-tuning is about using the pre-trained models large amount of generalised knowledge and adapting it to a specific problem. This can be done by freezing the initial layers, keeping the feature extraction knowledge intact. Later layers responsible for classifying categories are left unfrozen and additional layers can be added specifically for the classification task`

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
50
Q

What are some benefits of fine tunning?

A

Reduced Training Data Need: Since the pre-trained model already has a strong foundation, you can often achieve good results with a smaller amount of your own data for fine-tuning.

Faster Training Times: By only training a portion of the model, fine-tuning is generally much faster than training a model from scratch.

Improved Performance: By leveraging pre-trained knowledge and adapting it to your specific task, fine-tuning can significantly improve the accuracy of your final model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
51
Q

When reusing a convulutional base in a CNN what is the difference between deep and shallow mode?

A

In deep mode, also known as heavy fine-tuning, you essentially fine-tune a significant portion of the pre-trained convolutional base along with your newly added layers.
-Improved performance
-Overfitting risk & increased training time

Shallow mode, also known as light fine-tuning, focuses on making smaller adjustments to the pre-trained model.
-Fast training
-Reduced overfitting risk
-limited performance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
52
Q

What is dimensionality reduction?

A

Dimensionality reduction is the technique of representing multi-dimensional data (data with multiple features having acorrelation with each other) in a lower dimension.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
53
Q

Why do we need dimensionality reduction?

A

The curse of dimensionality is a phenomenon that happens because the sample density decreases exponentially with the increase of the dimensionality. When we keep adding (or increasing the number) features without increasing the size/number of training samples, the dimensionality of the feature space grows and becomes sparser.Due to this sparsity, it becomes much easier to find a perfect solution for the machine learning model which highly likely leads to overfitting.

e.g.
As the number of dimensions increases, the amount of data needed to effectively train a model grows exponentially. This can lead to issues like overfitting and computational inefficiency. Dimensionality reduction helps alleviate this curse by reducing the number of features the model needs to learn from.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
54
Q

What are autoencoders?

A

Autoencoders are a type of unsupervised ANN. Its primary objective is to learn a representation (encoding) of the input data in a way that it captures the essential features, reducing the dimensionality of the data. This encoding is then used to reconstruct the original input as closely as possible.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
55
Q

What are the four parts of an autoencoder

A

Encoder: is the part of the autoencoder responsible for encoding/compressing the input data into a lower-dimensional representation.

Decoder: it takes the compressed representation generated by the encoder and attempts to reconstruct the original input data from this representation.

Latent Space (code): the compressed representation learned by the encoder is often referred to as the “latent space.”

Objective Function: to minimize the difference between the input data and its reconstruction, where could be achieved by using a loss function such as mean squared error (MSE) to measure the difference between the input and the reconstructed output.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
56
Q

What are denoising autoencoders?

A

A specific type of autoencoder architecture designed to tackle noisy data. In addition to dimensionality reduction and feature extraction(like a normal autoencoder) they can remove noise from the data iteself

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
57
Q

How do denoising autoencoders work?

A

An autoencoder is trained on blurry or noisy images. The DAE tries to reconstruct a clean, clear version of the image from the corrupted input. By forcing this reconstruction, it learns to identify and remove the noise while capturing the underlying features of the data.

-They are trained on a corrupted version of the data, this can be anything from salt-and-pepper noise to occluding or masking part of the data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
58
Q

One application for AutoEncoders is adding color to black and white images

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
59
Q

What is the goal of natural language processing?

A

To make machines understand and interpret human language the way it is written or spoken

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
60
Q

What are the two levels of linguistic analysis that must be done before performing NLP

A

syntax- what part of the text is grammatically correct

semantics - What is the meaning of the given text

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
61
Q

One thing that NLP has to deal with is morphology, what does that mean?

A

The formation of words and how they relate with each other

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
62
Q

What is natural language understanding and what are the ambiguities present in natural language?

A

Understanding the meaning of a given text
The following are ambiguities NLP will attempt to resolve
Lexical ambiguity - words have different meanings
syntactic - sentence has multiple parse trees
sematic- sentence has multiple meanings
anaphoric - phrase or word previoulsly mentioned which has a different meaning

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
63
Q

What are 4 stages of Natural Language Understanding?

A

Syntax Analysis: identify the structure of the sentence, including parts of speech and sentence structure.

Semantics: Understand the meaning of words and phrases

Named Entity Recognition(NER): recognise named entities in the input. e.g. Exeter is a location

Intent Recognition: Understand the Users Intent

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
64
Q

What are the stages of the NLP pipeline?

A

Import text file

Sentence Segmentation- identify sentence boundaries in the given text e.g full stops new lines etc

Tokenisation - Identify different words numbers and punctuation

Stemming - Strip the ending of words e.g. eating is reduced to eat

Part of speech(POS) tagging - assign each word in a sentence it’s own tag such as designating word as noun or adverb

Parsing - Divide text into different categories to answer a question e.g. this part of sentence modifies another part

Named Entity Recognition - Identifies entities such as persons, location and times

co-reference - Define the relationship of the given words in a sentence with previous and next sentence

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
65
Q

How does Sentence Segmentation work?

A

Will split a large section of text into component sentences, often using punctuation like fullstops, to apply sentence tokenisation with NLTK we can use the NLTK.sent_tokenise function

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
66
Q

How does Tokenisation work?

A

Will split the segmented sentences up into “tokens”. Tokenised sentences are essentially just arrays of words

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
67
Q

What are stop words?

A

Stop words usually refer to the most common words such as “and”, “the”, “a” in a language, but there is no single universal list of stopwords. The list of the stop words can change depending on your application.

Stop words are removed before filtering text data as they are considered to have little meaning

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
67
Q

What does Lemmatization and Stemming do?

A

The goal of both stemming and lemmatization is to reduce inflectional forms and sometimes derivation-ally related forms of a word to a common base form.

am, are, is => be
dog, dogs, dog’s, dogs’ => dog

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
67
Q

Define Lemmatization

A

Lemmatization usually refers to doing things properly with the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma.

e.g. The word better has “good” as it’s lemma

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
68
Q

Define Stemming

A

Stemming usually refers to a crude heuristic process that chops off the ends of words in the hope of achieving this goal correctly most of the time, and often includes the removal of derivational affixes

69
Q

How does Part of Speech(POS) tagging work?

A

Part of speech (POS) tagging is a fundamental technique used to understand the grammatical function of each word in a sentence. It’s like dissecting a sentence and labeling each word based on its role.

Each word in a text is assigned a category based on its grammatical function. Common categories include nouns, verbs, adjectives, adverbs, pronouns, prepositions, conjunctions, interjections, and determiners

-By knowing the POS of each word NLP applications can grasp the sentence structure and how words relate to each other

70
Q

How does the bag-of-words model work?

A

The bag-of-words model is a popular and simple feature extraction technique used when we work with text. It describes the occurrence of each word within a document.

Any information aboutthe orderorstructureof wordsis discarded. That’s why it’s called abagof words. This model is trying to understand whether a known word occurs in a document, but don’t know where is that word in the document.
The intuition is thatsimilar documentshavesimilar contents. Also, from a content, we can learn something about the meaning of the document.

71
Q

Using the bag of words model(sentence Level) how would these sentences be represented
The quick dog jumped over the lazy fox
The dog is quick

A

Note: im not sure whether stop words should be removed so i won’t here but keep in mind that they can be

[2, 1, 1, 1, 1, 1, 1, 0]
[1, 0, 1, 0, 0, 0, 0, 1]

72
Q

How can we measure the similarity of a BOW model?

A

measured as the cosine of the angle between two vectors

73
Q

What is TF-IDF

A

TF-IDF, short for term frequency-inverse document frequency is a statistical measure used to evaluate the importance of a word to a document in a collection or corpus.

The goal is to highlight words that are both important within a specific document and distinctive in the entire corpus.

74
Q

How can Term frequency be calculated?

A

A scoring of the frequency of the word in the current document.
This part measures how often a word appears in a document.
The more frequently a word appears, the higher its TF score for that document.

TF = times term appears in doc/total items in doc

75
Q

How can Inverse Term frequencey be calculated?

A

Aa scoring of how rare the word is across documents.
This part measures how unique or rare a word is across multiple documents in the corpus.
The rarer the word, the higher its IDF score.

IDF = log(no. documents/ documents with term in it)

76
Q

How can TFIDF be calculated?

A

TFIDF(term) = TF(term)*IDF(term)

77
Q

What is the basic idea of N-gram word prediction?

A

Basic idea underlying the statistical approach to word prediction is to use the probabilities of SEQUENCES OF WORDS to choose the most likely next word/correction of spelling error

for all words we predict the next word for which this (conditional) probability is highest. We can get this probability by estimating by relative frequency

78
Q

What is MLE and how does it work?

A

MLE is a statistical technique for estimating the parameters of a probability distribution. In NLP, it’s used to estimate the likelihood of a sequence of words occurring in a natural language.

By analyzing a large corpus of text, MLE estimates the probability of each word appearing in a specific context. This information is then used to predict the most likely next word in a sequence.

79
Q

How can we use the chain rule to calculate the probability of a sequence of words?

A

Imagine a sentence as a sequence of words, like “The quick brown fox jumps.” The chain rule allows you to calculate the probability of this entire sentence by breaking it down into the probability of each word appearing given the words that came before it.

Using the chain rule:

P(S) = P(w1) * P(w2 | w1) * P(w3 | w1, w2) * … * P(wn | w1, w2, …, wn-1)

Directly calculating all these individual probabilities can be impractical, especially for long sentences.
In practice, NLP models estimate these probabilities using techniques like n-grams that rely on statistical learning from large text corpora.

80
Q

Define the Markov assumption

A

only prior local context – the last few words – affects the next word
making the Markov assumption for word prediction means assuming that the probability of a word only depends on the previous N-1 words (N-GRAM model)

81
Q

What is the difference between a unigram, bigram and trigram n-gram model?

A

Unigram: This is the simplest type of n-gram, where n = 1. It just considers the probability of a single word appearing in isolation. For example, the probability of the word “the” appearing in a text.

Bigram: Here, n = 2. This considers the probability of a two-word sequence appearing together. For instance, the probability of the word “the” being followed by the word “cat” is likely higher than “the” being followed by “pineapple.”

Trigram: This type of n-gram uses a sequence of three words (n = 3). So, it would consider the probability of a three-word sequence like “I went to” being followed by the word “the” or “the store.”

82
Q

What are some drawbacks of N-gram language models?

A

The higher the N, the better is the model usually. But this leads to lots of computation overhead that requires large computation power in terms of RAM.
N-grams are a sparse representation of language. It will give zero probability to all the words that are not present in the training corpus.

83
Q

Give an overview of a Recurrent Neural Network

A

They are designed to handle sequential data where the order matters. They will use internal memory to store information about past inputs and influence how they process future ones

The idea is to have multiple copies of the same network, each passing a message to a successor. In this way the network can remember information.

RNN can be demonstrated as a graph of RNN cells, where each cell performs the same operation on every element in the input sequence.

84
Q

How does an RNN process sequences?

A

RNN processes sequences by iterating through the sequence elements and maintaining a state containing information relative to what it has seen so far.

85
Q

What is padding sequences?

A

it’s a common pre-processing step when working with sequence data. The purpose of it is to ensure that all sequences have the same length which is required for feeding data into networks that expect fixed-size inputs

86
Q

How does word embedding work?

A

The embedding layer represents words as dense vectors where each dimension encodes information about how the word relates to others in the given context

87
Q

What is the training objective of word embedding?

A

The objective of the model during training is typically a language modeling task, where it predicts the probability distribution over the vocabulary for the next word given the context. This task encourages the model to learn representations that capture semantic and syntactic relationships.

88
Q

How can we create an RNN used for Word prediction?

A

-Token text into words represented by a unique integer index
-Create input output sequences where the input is a sequence of words and the output is the next word in the sequence
Building the model:
-Use an Embedding layer to convert word indices into dense vectors.
-Include one or more recurrent layers to capture sequential dependencies.
-Use a Dense layer with a softmax activation for predicting the next word.
-Choose an appropriate loss function, optimizer, and metrics for your problem.
-Fit the model on the training data.

89
Q

What is the long term dependency problem

A

Also known as the vanishing gradient problem. This is where RNN struggles to learn dependencies in data points far away from each other in long sequences.

Technical explanation:
For long sequences, the multiplication of gradients across many time steps can lead to extremely small values. If these values become close to zero, they may cause the gradients for early time steps to vanish.

90
Q

What is a Long Short-Term memory network?

A

Long Short-Term Memory (LSTM) networks are a special kind of Recurrent Neural Network (RNN) specifically designed to address the vanishing gradient problem that hinders traditional RNNs. They excel at handling long-term dependencies in sequential data, making them a powerful tool for tasks like machine translation, speech recognition, and time series analysis.

LSTM networks are a type of RNN that uses special units that include a ‘memory cell’ that can maintain information in memory for long periods of time.
A set of gates is used to control when information enters the memory, when it’s output, and when it’s forgotten.

91
Q

What are the three gates an LSTM uses to manipulate input information

A

Input gate to control the intake of new information.
-Determines what new information to store in the cell state. It considers the current input and the previous state, deciding which elements are relevant and assigning weights between 0 and 1.

Forget gate to determine what part of the cell state to be updated.
-Decides what information to discard from the previous cell state. It analyzes the current input and the previous state, assigning a value between 0 and 1 to each element in the cell state. Values close to 0 indicate forgetting, while 1 means retaining the information.

Output gate to determine what part of the cell state to output.
-Controls what information from the cell state is used to update the hidden state of the LSTM. It examines the current cell state and the previous hidden state, assigning weights between 0 and 1.

92
Q

What is the difference between machine and deep learning?

A

One major difference is that in machine learning feature extraction must be done manually, the quality of the models perfromance relies on the quality of the chosen features

In deep learning neural networks can automatically learn features from raw data

93
Q

How does the amount of data effect the performance of deep vs classical learning?

A

While deep learning and classical learning will perform at similar levels with a small amount of data, as the amount of data increases deep learning models will benefit much more

94
Q

Define regression in predictive modelling

A

Estimating numerical value of an attribute, based on the history of values assigned to this attribute and other affecting measurements

95
Q

Define classification in predictive modeling

A

Assigning a label to each instance depending on the values
of a set of attributes

96
Q

How does logistic regression work for classification?

A

It will create a linear decision boundary(e.g a straight line). This will be based on a formula that considers features of the objects being classified, objects below this score will be classed as something different to something above the line

97
Q

How do Decision trees work for classification

A

Is almost like a binary tree of yes/no decisions with each level being a question about the object being classified. The model will continue asking questions until it reaches a leaf node containing a classification for the object.

98
Q

What are the steps to create a decision tree?

A

Determine attribute to select as the root;
Partition input examples into subsets according to values of the root attribute;
Construct Decision Tree recursively for each subset;
Connect the roots for the sub-trees to the root of the whole tree via Labeled links

99
Q

What is the first attribute of a decision tree called and how can we determine what it should be

A

It’s called the root and it should be the attribute considered the best by some metric of goodness e.g. information gain for the attribute. Root node can be determined by computing the entropy of the classification data and creating a frequency table for the classes

100
Q

When does the exit decision for the decision tree occur?

A

Exit condition occurs when all examples belong to one class (other conditions exist to stop growing the tree earlier).

101
Q

How can we calculate the entropy for a decision tree

A

E = -Σ (pi * log2(pi))

102
Q

How can we calculate the information gain for attributes in a decision tree

A

IG(Feature) = E - Σ (pi * E(Di))

103
Q

How can we use decision trees for regression

A

Decision tree regression is a supervised learning technique used for predicting continuous values.

Objective Function:
Common metrics for regression include Mean Squared Error (MSE) or Mean Absolute Error (MAE), or other regression-specific metrics.

Leaf Node Values:
In classification trees, the leaves represent the predicted class. In regression trees, the leaves should represent a continuous value.

Splitting Criteria:
Regression trees aim to minimise the chosen regression metric (e.g., MSE or MAE).

104
Q

What is the random forest model and how does it work?

A

The random forest model is a powerful ensemble learning technique used for both classification (predicting discrete categories) and regression (predicting continuous values) tasks.

A random forest is built by creating a multitude of decision trees at training time. Each tree is like an individual expert, and the final prediction is made by combining the predictions of all the trees (like a democratic vote).

105
Q

How can we create a random forest model?

A

Random Subsets: When creating each tree, the algorithm samples a random subset of data points (with replacement) from the original training data. This technique, called bagging (bootstrap aggregating), helps to introduce diversity among the trees and reduce overfitting.

Random Features: At each node of a tree, the algorithm randomly selects a subset of the features (instead of considering all features) to determine the best splitting rule. This randomness further diversifies the trees and prevents them from becoming too similar.

Growing the Trees: Each tree is grown to full depth without pruning (unlike some decision tree approaches). This ensures that each tree captures a unique aspect of the data.

Prediction: For classification tasks, the most frequent class predicted by the individual trees is chosen as the final prediction. For regression tasks, the average of the individual tree predictions is used as the final prediction.

106
Q

What are some advantages and disadvantages of the random forest model?

A

Advantages:
High Accuracy: By combining multiple decision trees, random forests can achieve higher accuracy than a single decision tree.
Robust to Overfitting: Randomization through bagging and feature selection helps to reduce overfitting, a common problem in decision trees.
Handles Missing Data: The algorithm can handle missing data points effectively.
Interpretability: While not as interpretable as a single decision tree, feature importance scores can provide insights into which features are most influential for the model’s predictions.

Disadvantages:
Can be Black Box: The inner workings of the individual trees can be complex, making it challenging to understand exactly why the model makes specific predictions.
Computationally Expensive: Training a random forest can be computationally expensive, especially for large datasets.

107
Q

What is nearest neighbor classification

A

Nearest Neighbor (NN) classification is a fundamental machine learning technique used to classify new data points based on their similarity to existing labeled data points.

108
Q

What are some key issues with nearest neighbor classification?

A

no model is built! all the data is retained
training could take up to O(np) per observation
will not do well when number of features is large

109
Q

What is the k-NN algorithm(also known as nearest neighbor regression)?

A

User inputs k:an integer representing the number of nearest neighbors (instances) to search for
With each unlabeled instance: calculate the distance between it and all the instances in the data set
Find the k nearest neighbors
Count the assigned class labels in k nearest neighbour for each class
The class with the highest count (majority vote) is the output

110
Q

What is nearest neighbor regression

A

Nearest Neighbor (NN) regression, also called k-Nearest Neighbors (KNN) regression, is a technique for predicting continuous values using the k nearest neighbors of a new data point.

111
Q

How does nearest neighbor regression work?

A

Calculate Distances:
Compute the distances between the query point and all other points in the training dataset based on the chosen distance metric (e.g., Euclidean distance).

Identify Neighbours:
Select the K nearest neighbours based on the calculated distances.

Aggregate Target Values:
For regression, instead of counting votes, take the average (or another aggregation measure) of the target variable values of the K neighbours. This aggregated value is the prediction for the query point.

112
Q

Define a Multi-layer perceptron and the it’s components

A

A feed-forward multi-layer perceptron is essentially a simple kind of neural network with:
input vector {x1, x2, …, xn}
output value y
intermediate “hidden” values {h1, …, hm}

113
Q

How are multi layer perceptrons used for classification?

A

a multi-layer perceptron is a non-linear function that maps the input vector to the output value
the hidden values are also non-linear functions
non-linear sigmoid threshold function ensures that the output is
a value between [0; 1], i.e., 0 <= y <= 1
weights are applied at each layer

114
Q

Describe the learning process of an MLP

A

the learning process follows a steepest descent approach
the idea is that we search the landscape of possible output values, and we want to descend from large errors to small (ideally 0) error
backpropagation is the classic example
the weights are adjusted by moving backwards:
first from the hidden > output layer
and then from the input > hidden layer
amount of adjustment is proportional to the value of error function
big errors > big adjustments
small errors . small adjustments
the error rate should decline during the learning process

115
Q

Define unsupervised learning

A

Unsupervised learning is a type of machine learning where algorithms learn patterns and insights from unlabeled data. Unsupervised learning allows the model to discover hidden structures or groupings within the data on its own.

116
Q

What are some advantages of unsupervised learning

A

Unsupervised learning excels at finding hidden patterns and relationships within unlabeled data
Can handle a wide variety of data types like text images and sensor data
Good for anomaly detection

117
Q

What are some disadvantages of unsupervised learning?

A

There is no ground truth and therefore no clear “correct answer”, evaluating the result can be subjective

118
Q

Define clustering

A

Clustering is a type of unsupervised learning
A cluster is a collection of objects which are similar in some way
Clustering is the process of grouping similar objects into groups
Clustering is a type of unsupervised learning
A cluster is a collection of objects which are similar in some way
Clustering is the process of grouping similar objects into groups

119
Q

What is the purpose of clustering algorithms

A

To see whether the data fall into distinct groups, with members within each group being similar to other members in that group but different from members of other groups.

120
Q

How do clustering algorithms work?

A

We have to define a similarity or distance metric that computes how close two instances are to each other. Each instance is assigned to a cluster(group) based on this metric. Some instances are considered outliers and do not belong in any cluster

121
Q

What is the formula for euclidean similarity measures?

How can we calculate the ecludian distance for two instances p = (7,10 ) and q (5, 14)

A

sqrt((q1 - p1)^2 + (q2 - p2)^2 +…+ (qn - pn)^2)

for p=(7,10) q=(5, 14)

= sqrt((5 - 7)^2 + (14 - 10)^2) = 4.47

122
Q

What is the formula for manhattan similarity measure?

How can we calculate the manhattan similarity measure for two instances p = (7,10 ) and q (5, 14)

A

Σ(n, i = 1) |pi - qi|

for p=(7,10) q=(5, 14)

= |7 - 5 | + |10 - 14|

123
Q

What do each of these terms mean in clustering?
Centroids
size
variations

A

Centroids: The centre of each cluster(average of each feature of all instances belonging to the cluster )
size: number of instances in the cluster
variations: the variance or standard deviation of the instances belonging to each cluster

124
Q

How do i calculate standard deviation and variance

A

Variance is the mean of the squares minus the square of the mean
Ex^2/n - (Ex/n)^2
Standard deviation is the sqrt of variance

125
Q

What is the cluster center?

A

A single data point that manages to describe best a collection of objects(stays in the centre)
For numeric data the centre of mass
-calculated using the mean of all data points
For nominal or ordinary data it could be the most frequent mode

Is a score function

126
Q

What are Cluster variations and how are they calculated?

A

Cluster variation is a score function that can show how compact/tight clusters are, it’s calculated as follows:

Calculate the difference between each data point and the centriod of the structure then square it

Add up the squared distance for every data point for a single cluster

We can then do this for ever cluster and sum them to find the variation for the entire clustering

127
Q

What is the K-means algorithm(5-steps)

A
  1. define number of clusters
  2. choose k data objects randomly to serve as initial centriods for k clusters
  3. assign each data object to the cluster represented by it’s nearest centriod
  4. find a new centriod for each cluster by calculating the mean vector of it’s members
  5. Remove all memberships and repeat from step 3 until cluster membership does not change/ max iterations are reached
128
Q

What are some strengths of the k-means algorithm?

A

simple and easy to implement
quite efficient

129
Q

What are some weaknesses of the k-means algorithm?

A

Meed to specify the value of k, but we may not know what the value should be beforehand
You may want to experiment with k value (i.e., number of clusters)
Sensitive to the initialisation
Sensitive to noise
Clusters of different size

130
Q

What is a hierarchical based clustering algorithm

A

Imagine a family tree. Hierarchical clustering builds a similar structure for your data points, where data points are grouped based on similarities, forming a hierarchy of clusters.
At the bottom of the hierarchy, each data point is in its own individual cluster.
As you move up the hierarchy, clusters are merged based on their similarities, forming larger and more general clusters.
The final result is a dendrogram, which shows the merging process and the relationships between clusters at different levels.

131
Q

What are the two main approaches for a hierarchical based clustering algorithm

A

Agglomerative Hierarchical Clustering(Bottom-up): start with each data point in it’s own cluster and iteratively merge similar clusters until one remains

Divisive Hierarchical Clustering(top-down): start will all data in a single cluster and recursively split the cluster into subclusters based on dissimilarities

132
Q

What are the steps for an Agglomerative Clustering algorithm

A

Take n data objects as individual clusters and build an n x n dissimilarity matrix storing distances between any pair of objects

While there is more than one cluster:
-find the two data objects with the minimum distance and merge into a bigger cluster
-replace the entries in the matrix for the original clusters or
objects by the cluster tag of the newly formed cluster
-re-calculate relevant distances and update the matrix

This whole process will produce a dendogram and relies of the definition of a distance metric between clusters

133
Q

What are the strengths of Agglomerative clustering algorithms

A

Deterministic results
Multiple possible versions of clustering
Mo need to specify the value of a k beforehand
Can create clusters of arbitrary shapes (single-link

134
Q

What is the weakness of Agglomerative clustering algorithms

A

Does not scale up for large data sets

135
Q

What do these metrics about Clustering algorithms mean:
nearest neighbor(single link)
furthest neighbor(complete link)
centriod distance measurement
group average

A

nearest neighbor(single link) - Distance between two closest points in distinct clusters
furthest neighbor(complete link) - Distance between two furthest away points in 2 clusters
distance between centriods for clusters
Group average- average of all distances between pairs of points in two clusters(every points links to each other)

136
Q

What is the algorithm for self organising maps(this is like one slide so idk what’s going on here)

A

Select the size and type of the map.
Initialise all node/neuron weight vectors randomly.
Choose a random data point from training data and present it to the SOM.
Find the “Best Matching Unit” (BMU) in the map – the most similar node. Similarity is calculated using the Euclidean distance formula.
Determine the nodes within the “neighbourhood” of the BMU.
Adjust weights of nodes in the BMU neighbourhood towards the chosen datapoint.Repeat Steps 2-5 for N iterations / convergence.

137
Q

What can we do to solve the issue of summary statistics being deceptive?

A

We can visualize our data attributes

138
Q

Why should we use data visualisation?

A

It is a good way to communicate complex information.
It is critical tool in AI, it provides an effective way to identify summaries, structure, relationships, differences, and abnormalities in the data.

139
Q

What type of chart can we use to show categories as percentage?

A

Pie charts show categories as a proportion or a percentage of the whole. Use pie charts to show the composition of categorical data with each segment proportional to the quantity it represents.

140
Q

What is a grammer of graphics?

A

A framework that enables us to concisely describe the components of any graphics

141
Q

Why is the grammer of graphics important?

A

Learning the grammar will help you not only create graphics that you know about now, but will also help you to think about new graphics that would be even better. Without the grammar, there is no underlying theory and existing graphics packages are just a big collection of special cases.

142
Q

What are the key components in the grammar of graphs

A

Data: The raw information you want to visualize.

Aesthetic Mappings: How visual properties (aesthetics) like color, size, position, etc. are linked to the data variables

Geometrical Shapes: The basic visual marks used to represent data points (like points, lines, bars, etc.).

Scales: How data values are mapped to visual properties (e.g., color scale for temperature).

Coordinates: The system that defines the position of visual elements (e.g., x and y axes in a scatter plot).

Themes: The overall visual design elements that provide a consistent look and feel to the graphic (e.g., color palettes, fonts, backgrounds).

143
Q

What is principle component analysis?

A

Principal Component Analysis (PCA) is a dimensionality reduction technique used in machine learning and data analysis. It helps simplify complex datasets by focusing on the most important features that capture the majority of the variation in the data.

144
Q

What is mean centering in PCA

A

Imagine a dataset with features like height and weight. Without centering, features with larger scales (e.g., height in cm) can dominate the analysis.
Mean centering subtracts the mean value of each feature from all the data points in that feature.
This essentially shifts the data such that each feature has a mean of zero.
By centering the data around the origin (0, 0), PCA focuses on the direction of the spread (variance) rather than absolute values.

145
Q

What is Covariance Matrix in PCA?

A

This matrix captures how much two features in your data vary together.
After centering, the covariance matrix shows the relationship between features in terms of their deviations from the mean.
A positive covariance indicates features tend to move together (higher values in one feature correspond to higher values in the other).
A negative covariance indicates features move in opposite directions (higher values in one feature correspond to lower values in the other).
A value close to zero suggests little linear relationship between the features.

146
Q

What is Eigenvalues and Eigenvectors in PCA?

A

Find the eigenvalues and eigenvectors of the covariance matrix. Eigenvalues represent the amount of variance captured in each principal component, and eigenvectors give the direction of these components

147
Q

What are the Principle Components in PCA?

A

The eigenvectors are the principal components, and they are ranked based on their corresponding eigenvalues. The first principal component captures the maximum variance in the data, followed by the second, third, and so on.

148
Q

What is a search algorithm?

A

Search algorithms provide search solutions through a sequence of actions that transform the start state to the goal state.

149
Q

What do these properties of search algorithms mean?
Completeness
Optimal
Time Complexity
Space Complexity

A

Completeness:A search algorithm is complete if it provides a solution for a given input when there exists at least one solution for this input.

Optimality:Optimal solutions are the best solutions given by the search algorithms at the lowest path cost.

Time Complexity:The time needed to accomplish a task or provide a solution.

Space Complexity:Memory or storage space needed when conducting a search operation.

150
Q

What do these terms mean?
Initial state
State space
Actions
Goal State
Goal Test
Path Cost

A

Initial state:This is the start state in which the search starts.

State space:These are all the possible states that can be attained from the initial state through a series of actions.

Actions:These are the steps, activities, or operations undertaken by AI agents in a particular state.

Goal state:This is the endpoint or the desired state.

Goal test:This is a test conducted to establish whether a particular state is a goal state.

Path cost:This is the cost associated with a given path taken by the agents

151
Q

What are three uninformed/blind search algorithms

A

Breadth-first search
Depth-first search
Uniform cost search

152
Q

What are two Informed search algorithms

A

Greedy search
A* tree search

153
Q

How does Breadth first search work?

A

It starts at the root node and systematically explores all the neighbouring nodes at the current depth level before moving on to nodes at the next depth level.

154
Q

How does Breadth first search do according to these metrics?
Completeness
Optimality
Time complextiy
Space complexity

A

Completeness:
BFS is complete. If there is a solution in the search space, BFS will eventually find it. This is because BFS explores all nodes at a given depth level before moving on to the next level, ensuring that it thoroughly searches the entire space.

Optimality:
BFS is optimal when all actions have the same cost. Since BFS systematically explores nodes level by level, it guarantees that the first solution found is one with the fewest possible steps or edges.

Time Complexity:
The time complexity of BFS is typically O(V+E), where V is the number of vertices (nodes) and E is the number of edges in the graph. In the worst case, BFS might visit all vertices and edges.

Space Complexity:
The space complexity of BFS is O(V) where V is the number of vertices. This is because BFS stores all vertices at a given depth level in a queue. In the worst case, the queue may contain all vertices at the maximum depth.

155
Q

How does depth first search work?

A

it starts at the root node, explores along a branch until it reaches a leaf node, and then backtracks to explore other branches.

156
Q

How does Depth first search do according to these metrics?
Completeness
Optimality
Time complextiy
Space complexity

A

Completeness:
DFS is not necessarily complete. DFS may get stuck in an infinite loop if the graph has cycles.

Optimality:
DFS is not optimal. It may find a solution, but it does not guarantee that it is the shortest path. The first solution encountered may not be the optimal one.

Time complexity and space complexity are the same as breadth first search

157
Q

How does uniform cost search work?

A

It explores nodes in a way that considers the cost associated with each node, prioritizing nodes with lower costs before those with higher costs.
It is guided by the cost to move from one node to another. The goal is to find a path where the cumulative sum of costs is the least.

cost(node) = cumulative cost of all nodes from root.

158
Q

How does Uniform Cost search do in terms of completeness and optimality?

A

Completeness:
UCS is complete. This is because UCS explores paths in increasing order of cost, ensuring that it finds the lowest-cost solution.

Optimality:
UCS is optimal. It guarantees that the first solution found is the one with the lowest cost. This is because UCS prioritizes paths with lower costs and explores them first.

159
Q

How does Greedy search work?

A

picks the next step by focusing on the best-looking option at each moment, mainly based on heuristic values, without caring much about the total path cost.

the closest node to the goal node is expanded, where closeness factor is calculated using a heuristic function h (x).
h (x) is an estimate of the distance between one node and the end or goal node. The lower the value of h (x), the closer the node is to the endpoint

160
Q

How is the completeness and optimality of Greedy search?

A

Completeness:
Greedy search is not guaranteed to be complete. It may get stuck in an infinite loop or fail to find a solution, especially if it encounters cycles or cannot backtrack.

Optimality:
Greedy search is not necessarily optimal. It makes locally optimal choices at each step, but these choices may not lead to the globally optimal solution. The algorithm tends to prioritize immediate gains without considering the overall cost.

161
Q

How does A* tree search work?

A

A* search is like a smart explorer. It considers both the cost to reach a point and a heuristic estimate of how far that point is from the goal.

It combines the attributes of the uniform cost algorithm and the greedy algorithm. Here, the heuristic is simply an integration of the greedy search cost (h (x)) and the cost in the uniform cost algorithm (g (x)).
h(x) is the forward cost, which is an estimate of the distance of the current node from the goal node.
g(x) is the backward cost, which is the cumulative cost of a node from the root node.

The cumulative cost is denoted as f (x) = h(x) + g(x).

The strategy is to select the node with the lowest total cost value (f (x)).

162
Q

How does A* tree search work with completeness and optimality

A

Completeness:
A* Search is complete. If there is a solution in the search space, A* will eventually find it.

Optimality:
A Search:* A* Search is optimal. It guarantees finding the optimal solution by using a heuristic function to guide the search. The heuristic helps prioritise paths likely to lead to the best solution.

163
Q

What are some examples of Breadth first search and Depth first search applications

A

Shortest Path in Unweighted Graphs: Identifying the shortest path in a network.
Web Crawling: Indexing web pages level by level.
Puzzle Games: Solving puzzles with multiple states.
Maze Solving: Finding a path through a maze.
Game AI: Exploring possible moves in games like chess or tic-tac-toe.
Network Routing: Exploring paths in computer networks.

164
Q

What are some examples of Uniform Cost Search aplications

A

Dijkstra’s Algorithm: Finding the shortest path in a graph with weighted edges.
Resource Allocation: Optimising the use of resources in project management.
Network Routing with Variable Costs: Routing in computer networks where edges have varying costs.

165
Q

What are some examples of A* search and greedy search applications

A

Robotics: Path planning for robots in an environment with obstacles.
Maps and Navigation Systems: Finding optimal routes on maps.
Video Game Pathfinding: Navigating characters in video games efficiently.
Traveling Salesman Problem: Determining the most efficient route to visit multiple locations.
Job Scheduling: Assigning tasks in a way that optimises a specific criterion.
Network Design: Optimising the layout of communication networks.

166
Q

What is a proposition?

A

A proposition is a declarative sentence (a sentence that declares a fact) that is either true or false, but not both.

167
Q

What is a propositional function?

A

A proposition that depends on unknown values like variables

168
Q

What does the equivalence mean in propositional logic
denoted as ≡ or ↔

A

Equivalence requires two variables and will return “true” if both variables have the same value

169
Q

What is a tautology?

A

Tautology is very similar to logical equivalence
When all values are “true” that is a tautology

170
Q

Why might we want to translate English into compound statements?

A

English (and every other human language) is often ambiguous. Translating sentences into compound statements removes the ambiguity.

171
Q

What does soundness mean in propositional logic?

A

Soundness ensures that if the initial statements and rules are true, then the conclusion will also be true.

172
Q

What does completeness mean in propositional logic?

A

Completeness ensures that any true statement can be derived within the system.

In our example, if we know the true statement “I have free time” (Q), the system allows us to derive the true statement “It’s the weekend” (P).
Completeness guarantees that the system captures all possible true statements within its rules.

173
Q
A