AI Flashcards

Question

What is a convoluted Neural network used for and why

Answer 1

It's commonly used in image classification problems as it's able to sucessfully capture the spatial and temportal dependancies of an image through the application of relevent filters

Answer 2

It will require much less

Answer 3

-CNNs learn features automatically -Processing images in terms of RGB channels allows the network to capture color dependent features needed for image recognition -progressively builds a hierarchy of features

Answer 4

-The image is entered as a 3D cube, the height represents the pixel location and the depth represents the three color channnels (RGB) -filters will slide across the image looking for patterns for each color -the filters will create feature maps showing where they found interesting features -pooling layers will shrink these maps and grab the key points -This will be repeated and then the network will flatten everything and feed it to a regular neural network

Answer 5

max pooling will take the maximum value in a specific window and will capture the most prominent features within a local area -is good for object recognistion tasks where key features are crucial but might lose some spacial information Average pooling will the average value for a specific region of the feature map and will capture a more generalised representation of the features within a local area -Will provide more spatial information than max pooling but may blur sharper features

Answer 6

occurs when the dataset has a significant skew in the distribution of classes. This is problematic as most algorithms will favor the majority class and poorly identify the minority class

Answer 7

-The model may not be able to capture all the complexities present in the data -The model may perform well on data it's seen during training but may struggle with unseen data

Answer 8

Overfitting is where the model will memorise the training data and learn from random noise or quirks in that data set. This means that the model will perform well on the training data but will fail on anything new.

Answer 9

-collect more data -Delete data from the majority class -Create synthetic data

Answer 10

Random oversampling is where random data points from the minority class are duplicated Random undersampling is where random datapoints from the majority class are deleted

Answer 11

It can cause loss of information(undersampling) It can cause overfitting and fixed boundaries(for over sampling)

Answer 12

Overfitting is caused by having too few samples to learn from, rendering the model unable to generalise to new data Bias is caused by having class imbalance, the model is unable to learn the boundaries between the classes These can be solved by regularisation and data augmentation

Answer 13

Rotation range - takes a value in degrees which signifies a range in which to rotate pictures width shift and height shift - takes a value within which will randomly translate pictures vertically or horizontally Shear range - will randomly apply shear transformations zoom range- will randomly zoom into pictures horizontal flip - will randomly flip half the images horizontally fill mode- Will fill in newly created pixels which can appear after a rotation or width shift

Answer 14

It's a data augmentation technique used to address a class imbalance in machine learning tasks -Should only be performed after the train-test split -Should only ever be performed on the training data

Answer 15

1. Identifiying minority class Instances 2. Select instances randomly for oversampling 3. Find the nearest neighbours of the selected instance from the minority class 4. Generate Synthetic Samples -Create synthetic samples by interpolating between the selected instance and it's K nearest neighbour e.g. take the difference between the two The synthetic samples can then be added to the dataset

Answer 16

Accuraccy - can be misleading for imbalanced data sets recall - measure of the model corectly identifiying true positives precision - the ratio between true positives and all the positives

Answer 17

It's a saved network that was previously trained on a large dataset it's commonly used when attempting deep learning on smaller datasets

Answer 18

ImageNet - has 1.4 million labeled images and 1000 different classes Contains many different animal classes and will perform well at animal identification

Answer 19

Faster R-CNN: Faster R-CNN is a widely used model for object detection. Mask R-CNN: An extension of Faster R-CNN, providing pixel-wise segmentation masks along with bounding boxes. YOLO (You Only Look Once): YOLO is an object detection algorithm that divides the input image into a grid and predicts bounding boxes and class probabilities directly. YOLOv3 and YOLOv4 are some of the popular versions.

Answer 20

The process of identifying and extracting meaningful characteristics from data. Features help a deep learning model understand data by focusing only on the relevant parts, making the data easier for the model to process and analyse. The model can then learn patterns and relationships more effectively

Answer 21

As these features are often learned from massive datasets pre trained models will already have learned useful features

Answer 22

We can freeze the the weights(parameters) of the pre-trained models initial layers(convolutional base) so they can act as a feature extractor and their knowledge will remain unchanged. The later layers can be trained specifically for the required task(This is pretty much fine-tuning tbh)

Answer 23

We can remove the final layers of a pre-trained model, removing the decision making sections and leaving only a feature extractor

Answer 24

Fine-tuning is about using the pre-trained models large amount of generalised knowledge and adapting it to a specific problem. This can be done by freezing the initial layers, keeping the feature extraction knowledge intact. Later layers responsible for classifying categories are left unfrozen and additional layers can be added specifically for the classification task`

Answer 25

Reduced Training Data Need: Since the pre-trained model already has a strong foundation, you can often achieve good results with a smaller amount of your own data for fine-tuning. Faster Training Times: By only training a portion of the model, fine-tuning is generally much faster than training a model from scratch. Improved Performance: By leveraging pre-trained knowledge and adapting it to your specific task, fine-tuning can significantly improve the accuracy of your final model.

Answer 26

In deep mode, also known as heavy fine-tuning, you essentially fine-tune a significant portion of the pre-trained convolutional base along with your newly added layers. -Improved performance -Overfitting risk & increased training time Shallow mode, also known as light fine-tuning, focuses on making smaller adjustments to the pre-trained model. -Fast training -Reduced overfitting risk -limited performance

Answer 27

Dimensionality reduction is the technique of representing multi-dimensional data (data with multiple features having a correlation with each other) in a lower dimension.

Answer 28

The curse of dimensionality is a phenomenon that happens because the sample density decreases exponentially with the increase of the dimensionality. When we keep adding (or increasing the number) features without increasing the size/number of training samples, the dimensionality of the feature space grows and becomes sparser. Due to this sparsity, it becomes much easier to find a perfect solution for the machine learning model which highly likely leads to overfitting. e.g. As the number of dimensions increases, the amount of data needed to effectively train a model grows exponentially. This can lead to issues like overfitting and computational inefficiency. Dimensionality reduction helps alleviate this curse by reducing the number of features the model needs to learn from.

Answer 29

Autoencoders are a type of unsupervised ANN. Its primary objective is to learn a representation (encoding) of the input data in a way that it captures the essential features, reducing the dimensionality of the data. This encoding is then used to reconstruct the original input as closely as possible.

Answer 30

Encoder: is the part of the autoencoder responsible for encoding/compressing the input data into a lower-dimensional representation. Decoder: it takes the compressed representation generated by the encoder and attempts to reconstruct the original input data from this representation. Latent Space (code): the compressed representation learned by the encoder is often referred to as the "latent space." Objective Function: to minimize the difference between the input data and its reconstruction, where could be achieved by using a loss function such as mean squared error (MSE) to measure the difference between the input and the reconstructed output.

Answer 31

A specific type of autoencoder architecture designed to tackle noisy data. In addition to dimensionality reduction and feature extraction(like a normal autoencoder) they can remove noise from the data iteself

Answer 32

An autoencoder is trained on blurry or noisy images. The DAE tries to reconstruct a clean, clear version of the image from the corrupted input. By forcing this reconstruction, it learns to identify and remove the noise while capturing the underlying features of the data. -They are trained on a corrupted version of the data, this can be anything from salt-and-pepper noise to occluding or masking part of the data

Answer 33

To make machines understand and interpret human language the way it is written or spoken

Answer 34

syntax- what part of the text is grammatically correct semantics - What is the meaning of the given text

Answer 35

The formation of words and how they relate with each other

Answer 36

Understanding the meaning of a given text The following are ambiguities NLP will attempt to resolve Lexical ambiguity - words have different meanings syntactic - sentence has multiple parse trees sematic- sentence has multiple meanings anaphoric - phrase or word previoulsly mentioned which has a different meaning

Answer 37

Syntax Analysis: identify the structure of the sentence, including parts of speech and sentence structure. Semantics: Understand the meaning of words and phrases Named Entity Recognition(NER): recognise named entities in the input. e.g. Exeter is a location Intent Recognition: Understand the Users Intent

Answer 38

Import text file Sentence Segmentation- identify sentence boundaries in the given text e.g full stops new lines etc Tokenisation - Identify different words numbers and punctuation Stemming - Strip the ending of words e.g. eating is reduced to eat Part of speech(POS) tagging - assign each word in a sentence it's own tag such as designating word as noun or adverb Parsing - Divide text into different categories to answer a question e.g. this part of sentence modifies another part Named Entity Recognition - Identifies entities such as persons, location and times co-reference - Define the relationship of the given words in a sentence with previous and next sentence

Answer 39

Will split a large section of text into component sentences, often using punctuation like fullstops, to apply sentence tokenisation with NLTK we can use the NLTK.sent_tokenise function

Answer 40

Will split the segmented sentences up into "tokens". Tokenised sentences are essentially just arrays of words

Answer 41

Stop words usually refer to the most common words such as “and”, “the”, “a” in a language, but there is no single universal list of stopwords. The list of the stop words can change depending on your application. Stop words are removed before filtering text data as they are considered to have little meaning

Answer 42

The goal of both stemming and lemmatization is to reduce inflectional forms and sometimes derivation-ally related forms of a word to a common base form. am, are, is => be dog, dogs, dog’s, dogs’ => dog

Answer 43

Lemmatization usually refers to doing things properly with the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma. e.g. The word better has "good" as it's lemma

Answer 44

Stemming usually refers to a crude heuristic process that chops off the ends of words in the hope of achieving this goal correctly most of the time, and often includes the removal of derivational affixes

Answer 45

Part of speech (POS) tagging is a fundamental technique used to understand the grammatical function of each word in a sentence. It's like dissecting a sentence and labeling each word based on its role. Each word in a text is assigned a category based on its grammatical function. Common categories include nouns, verbs, adjectives, adverbs, pronouns, prepositions, conjunctions, interjections, and determiners -By knowing the POS of each word NLP applications can grasp the sentence structure and how words relate to each other

Answer 46

The bag-of-words model is a popular and simple feature extraction technique used when we work with text. It describes the occurrence of each word within a document. Any information about the order or structure of words is discarded. That’s why it’s called a bag of words. This model is trying to understand whether a known word occurs in a document, but don’t know where is that word in the document. The intuition is that similar documents have similar contents. Also, from a content, we can learn something about the meaning of the document.

Answer 47

Note: im not sure whether stop words should be removed so i won't here but keep in mind that they can be [2, 1, 1, 1, 1, 1, 1, 0] [1, 0, 1, 0, 0, 0, 0, 1]

Answer 48

measured as the cosine of the angle between two vectors

Answer 49

TF-IDF, short for term frequency-inverse document frequency is a statistical measure used to evaluate the importance of a word to a document in a collection or corpus. The goal is to highlight words that are both important within a specific document and distinctive in the entire corpus.

Answer 50

A scoring of the frequency of the word in the current document. This part measures how often a word appears in a document. The more frequently a word appears, the higher its TF score for that document. TF = times term appears in doc/total items in doc

Answer 51

Aa scoring of how rare the word is across documents. This part measures how unique or rare a word is across multiple documents in the corpus. The rarer the word, the higher its IDF score. IDF = log(no. documents/ documents with term in it)

Answer 52

TFIDF(term) = TF(term)*IDF(term)

Answer 53

Basic idea underlying the statistical approach to word prediction is to use the probabilities of SEQUENCES OF WORDS to choose the most likely next word/correction of spelling error for all words we predict the next word for which this (conditional) probability is highest. We can get this probability by estimating by relative frequency

Answer 54

MLE is a statistical technique for estimating the parameters of a probability distribution. In NLP, it's used to estimate the likelihood of a sequence of words occurring in a natural language. By analyzing a large corpus of text, MLE estimates the probability of each word appearing in a specific context. This information is then used to predict the most likely next word in a sequence.

Answer 55

Imagine a sentence as a sequence of words, like "The quick brown fox jumps." The chain rule allows you to calculate the probability of this entire sentence by breaking it down into the probability of each word appearing given the words that came before it. Using the chain rule: P(S) = P(w1) * P(w2 | w1) * P(w3 | w1, w2) * ... * P(wn | w1, w2, ..., wn-1) Directly calculating all these individual probabilities can be impractical, especially for long sentences. In practice, NLP models estimate these probabilities using techniques like n-grams that rely on statistical learning from large text corpora.

Answer 56

only prior local context – the last few words – affects the next word making the Markov assumption for word prediction means assuming that the probability of a word only depends on the previous N-1 words (N-GRAM model)

Answer 57

Unigram: This is the simplest type of n-gram, where n = 1. It just considers the probability of a single word appearing in isolation. For example, the probability of the word "the" appearing in a text. Bigram: Here, n = 2. This considers the probability of a two-word sequence appearing together. For instance, the probability of the word "the" being followed by the word "cat" is likely higher than "the" being followed by "pineapple." Trigram: This type of n-gram uses a sequence of three words (n = 3). So, it would consider the probability of a three-word sequence like "I went to" being followed by the word "the" or "the store."

Answer 58

The higher the N, the better is the model usually. But this leads to lots of computation overhead that requires large computation power in terms of RAM. N-grams are a sparse representation of language. It will give zero probability to all the words that are not present in the training corpus.

Answer 59

They are designed to handle sequential data where the order matters. They will use internal memory to store information about past inputs and influence how they process future ones The idea is to have multiple copies of the same network, each passing a message to a successor. In this way the network can remember information. RNN can be demonstrated as a graph of RNN cells, where each cell performs the same operation on every element in the input sequence.

Answer 60

RNN processes sequences by iterating through the sequence elements and maintaining a state containing information relative to what it has seen so far.

Answer 61

it's a common pre-processing step when working with sequence data. The purpose of it is to ensure that all sequences have the same length which is required for feeding data into networks that expect fixed-size inputs

Answer 62

The embedding layer represents words as dense vectors where each dimension encodes information about how the word relates to others in the given context

Answer 63

The objective of the model during training is typically a language modeling task, where it predicts the probability distribution over the vocabulary for the next word given the context. This task encourages the model to learn representations that capture semantic and syntactic relationships.

Answer 64

-Token text into words represented by a unique integer index -Create input output sequences where the input is a sequence of words and the output is the next word in the sequence Building the model: -Use an Embedding layer to convert word indices into dense vectors. -Include one or more recurrent layers to capture sequential dependencies. -Use a Dense layer with a softmax activation for predicting the next word. -Choose an appropriate loss function, optimizer, and metrics for your problem. -Fit the model on the training data.

Answer 65

Also known as the vanishing gradient problem. This is where RNN struggles to learn dependencies in data points far away from each other in long sequences. Technical explanation: For long sequences, the multiplication of gradients across many time steps can lead to extremely small values. If these values become close to zero, they may cause the gradients for early time steps to vanish.

Answer 66

Long Short-Term Memory (LSTM) networks are a special kind of Recurrent Neural Network (RNN) specifically designed to address the vanishing gradient problem that hinders traditional RNNs. They excel at handling long-term dependencies in sequential data, making them a powerful tool for tasks like machine translation, speech recognition, and time series analysis. LSTM networks are a type of RNN that uses special units that include a 'memory cell' that can maintain information in memory for long periods of time. A set of gates is used to control when information enters the memory, when it's output, and when it's forgotten.

Answer 67

Input gate to control the intake of new information. -Determines what new information to store in the cell state. It considers the current input and the previous state, deciding which elements are relevant and assigning weights between 0 and 1. Forget gate to determine what part of the cell state to be updated. -Decides what information to discard from the previous cell state. It analyzes the current input and the previous state, assigning a value between 0 and 1 to each element in the cell state. Values close to 0 indicate forgetting, while 1 means retaining the information. Output gate to determine what part of the cell state to output. -Controls what information from the cell state is used to update the hidden state of the LSTM. It examines the current cell state and the previous hidden state, assigning weights between 0 and 1.

Answer 68

One major difference is that in machine learning feature extraction must be done manually, the quality of the models perfromance relies on the quality of the chosen features In deep learning neural networks can automatically learn features from raw data

Answer 69

While deep learning and classical learning will perform at similar levels with a small amount of data, as the amount of data increases deep learning models will benefit much more

Answer 70

Estimating numerical value of an attribute, based on the history of values assigned to this attribute and other affecting measurements

Answer 71

Assigning a label to each instance depending on the values of a set of attributes

Answer 72

It will create a linear decision boundary(e.g a straight line). This will be based on a formula that considers features of the objects being classified, objects below this score will be classed as something different to something above the line

Answer 73

Is almost like a binary tree of yes/no decisions with each level being a question about the object being classified. The model will continue asking questions until it reaches a leaf node containing a classification for the object.

Answer 74

Determine attribute to select as the root; Partition input examples into subsets according to values of the root attribute; Construct Decision Tree recursively for each subset; Connect the roots for the sub-trees to the root of the whole tree via Labeled links

Answer 75

It's called the root and it should be the attribute considered the best by some metric of goodness e.g. information gain for the attribute. Root node can be determined by computing the entropy of the classification data and creating a frequency table for the classes

Answer 76

Exit condition occurs when all examples belong to one class (other conditions exist to stop growing the tree earlier).

Answer 77

E = -Σ (pi * log2(pi))

Answer 78

IG(Feature) = E - Σ (pi * E(Di))

Answer 79

Decision tree regression is a supervised learning technique used for predicting continuous values. Objective Function: Common metrics for regression include Mean Squared Error (MSE) or Mean Absolute Error (MAE), or other regression-specific metrics. Leaf Node Values: In classification trees, the leaves represent the predicted class. In regression trees, the leaves should represent a continuous value. Splitting Criteria: Regression trees aim to minimise the chosen regression metric (e.g., MSE or MAE).

Answer 80

The random forest model is a powerful ensemble learning technique used for both classification (predicting discrete categories) and regression (predicting continuous values) tasks. A random forest is built by creating a multitude of decision trees at training time. Each tree is like an individual expert, and the final prediction is made by combining the predictions of all the trees (like a democratic vote).

Answer 81

Random Subsets: When creating each tree, the algorithm samples a random subset of data points (with replacement) from the original training data. This technique, called bagging (bootstrap aggregating), helps to introduce diversity among the trees and reduce overfitting. Random Features: At each node of a tree, the algorithm randomly selects a subset of the features (instead of considering all features) to determine the best splitting rule. This randomness further diversifies the trees and prevents them from becoming too similar. Growing the Trees: Each tree is grown to full depth without pruning (unlike some decision tree approaches). This ensures that each tree captures a unique aspect of the data. Prediction: For classification tasks, the most frequent class predicted by the individual trees is chosen as the final prediction. For regression tasks, the average of the individual tree predictions is used as the final prediction.

Answer 82

Advantages: High Accuracy: By combining multiple decision trees, random forests can achieve higher accuracy than a single decision tree. Robust to Overfitting: Randomization through bagging and feature selection helps to reduce overfitting, a common problem in decision trees. Handles Missing Data: The algorithm can handle missing data points effectively. Interpretability: While not as interpretable as a single decision tree, feature importance scores can provide insights into which features are most influential for the model's predictions. Disadvantages: Can be Black Box: The inner workings of the individual trees can be complex, making it challenging to understand exactly why the model makes specific predictions. Computationally Expensive: Training a random forest can be computationally expensive, especially for large datasets.

Answer 83

Nearest Neighbor (NN) classification is a fundamental machine learning technique used to classify new data points based on their similarity to existing labeled data points.

Answer 84

no model is built! all the data is retained training could take up to O(np) per observation will not do well when number of features is large

Answer 85

User inputs k:an integer representing the number of nearest neighbors (instances) to search for With each unlabeled instance: calculate the distance between it and all the instances in the data set Find the k nearest neighbors Count the assigned class labels in k nearest neighbour for each class The class with the highest count (majority vote) is the output

Answer 86

Nearest Neighbor (NN) regression, also called k-Nearest Neighbors (KNN) regression, is a technique for predicting continuous values using the k nearest neighbors of a new data point.

Answer 87

Calculate Distances: Compute the distances between the query point and all other points in the training dataset based on the chosen distance metric (e.g., Euclidean distance). Identify Neighbours: Select the K nearest neighbours based on the calculated distances. Aggregate Target Values: For regression, instead of counting votes, take the average (or another aggregation measure) of the target variable values of the K neighbours. This aggregated value is the prediction for the query point.

Answer 88

A feed-forward multi-layer perceptron is essentially a simple kind of neural network with: input vector {x1, x2, ..., xn} output value y intermediate “hidden” values {h1, ..., hm}

Answer 89

a multi-layer perceptron is a non-linear function that maps the input vector to the output value the hidden values are also non-linear functions non-linear sigmoid threshold function ensures that the output is a value between [0; 1], i.e., 0 <= y <= 1 weights are applied at each layer

Answer 90

the learning process follows a steepest descent approach the idea is that we search the landscape of possible output values, and we want to descend from large errors to small (ideally 0) error backpropagation is the classic example the weights are adjusted by moving backwards: first from the hidden > output layer and then from the input > hidden layer amount of adjustment is proportional to the value of error function big errors > big adjustments small errors . small adjustments the error rate should decline during the learning process

Answer 91

Unsupervised learning is a type of machine learning where algorithms learn patterns and insights from unlabeled data. Unsupervised learning allows the model to discover hidden structures or groupings within the data on its own.

Answer 92

Unsupervised learning excels at finding hidden patterns and relationships within unlabeled data Can handle a wide variety of data types like text images and sensor data Good for anomaly detection

Answer 93

There is no ground truth and therefore no clear "correct answer", evaluating the result can be subjective

Answer 94

Clustering is a type of unsupervised learning A cluster is a collection of objects which are similar in some way Clustering is the process of grouping similar objects into groups Clustering is a type of unsupervised learning A cluster is a collection of objects which are similar in some way Clustering is the process of grouping similar objects into groups

Answer 95

To see whether the data fall into distinct groups, with members within each group being similar to other members in that group but different from members of other groups.

Answer 96

We have to define a similarity or distance metric that computes how close two instances are to each other. Each instance is assigned to a cluster(group) based on this metric. Some instances are considered outliers and do not belong in any cluster

Answer 97

sqrt((q1 - p1)^2 + (q2 - p2)^2 +...+ (qn - pn)^2) for p=(7,10) q=(5, 14) = sqrt((5 - 7)^2 + (14 - 10)^2) = 4.47

Answer 98

Σ(n, i = 1) |pi - qi| for p=(7,10) q=(5, 14) = |7 - 5 | + |10 - 14|

Answer 99

Centroids: The centre of each cluster(average of each feature of all instances belonging to the cluster ) size: number of instances in the cluster variations: the variance or standard deviation of the instances belonging to each cluster

Answer 100

Variance is the mean of the squares minus the square of the mean Ex^2/n - (Ex/n)^2 Standard deviation is the sqrt of variance

Answer 101

A single data point that manages to describe best a collection of objects(stays in the centre) For numeric data the centre of mass -calculated using the mean of all data points For nominal or ordinary data it could be the most frequent mode Is a score function

Answer 102

Cluster variation is a score function that can show how compact/tight clusters are, it's calculated as follows: Calculate the difference between each data point and the centriod of the structure then square it Add up the squared distance for every data point for a single cluster We can then do this for ever cluster and sum them to find the variation for the entire clustering

Answer 103

1. define number of clusters 2. choose k data objects randomly to serve as initial centriods for k clusters 3. assign each data object to the cluster represented by it's nearest centriod 4. find a new centriod for each cluster by calculating the mean vector of it's members 5. Remove all memberships and repeat from step 3 until cluster membership does not change/ max iterations are reached

Answer 104

simple and easy to implement quite efficient

Answer 105

Meed to specify the value of k, but we may not know what the value should be beforehand You may want to experiment with k value (i.e., number of clusters) Sensitive to the initialisation Sensitive to noise Clusters of different size

Answer 106

Imagine a family tree. Hierarchical clustering builds a similar structure for your data points, where data points are grouped based on similarities, forming a hierarchy of clusters. At the bottom of the hierarchy, each data point is in its own individual cluster. As you move up the hierarchy, clusters are merged based on their similarities, forming larger and more general clusters. The final result is a dendrogram, which shows the merging process and the relationships between clusters at different levels.

Answer 107

Agglomerative Hierarchical Clustering(Bottom-up): start with each data point in it's own cluster and iteratively merge similar clusters until one remains Divisive Hierarchical Clustering(top-down): start will all data in a single cluster and recursively split the cluster into subclusters based on dissimilarities

Answer 108

Take n data objects as individual clusters and build an n x n dissimilarity matrix storing distances between any pair of objects While there is more than one cluster: -find the two data objects with the minimum distance and merge into a bigger cluster -replace the entries in the matrix for the original clusters or objects by the cluster tag of the newly formed cluster -re-calculate relevant distances and update the matrix This whole process will produce a dendogram and relies of the definition of a distance metric between clusters

Answer 109

Deterministic results Multiple possible versions of clustering Mo need to specify the value of a k beforehand Can create clusters of arbitrary shapes (single-link

Answer 110

Does not scale up for large data sets

Answer 111

nearest neighbor(single link) - Distance between two closest points in distinct clusters furthest neighbor(complete link) - Distance between two furthest away points in 2 clusters distance between centriods for clusters Group average- average of all distances between pairs of points in two clusters(every points links to each other)

Answer 112

Select the size and type of the map. Initialise all node/neuron weight vectors randomly. Choose a random data point from training data and present it to the SOM. Find the “Best Matching Unit” (BMU) in the map – the most similar node. Similarity is calculated using the Euclidean distance formula. Determine the nodes within the “neighbourhood” of the BMU. Adjust weights of nodes in the BMU neighbourhood towards the chosen datapoint.Repeat Steps 2-5 for N iterations / convergence.

Answer 113

We can visualize our data attributes

Answer 114

It is a good way to communicate complex information. It is critical tool in AI, it provides an effective way to identify summaries, structure, relationships, differences, and abnormalities in the data.

Answer 115

Pie charts show categories as a proportion or a percentage of the whole. Use pie charts to show the composition of categorical data with each segment proportional to the quantity it represents.

Answer 116

A framework that enables us to concisely describe the components of any graphics

Answer 117

Learning the grammar will help you not only create graphics that you know about now, but will also help you to think about new graphics that would be even better. Without the grammar, there is no underlying theory and existing graphics packages are just a big collection of special cases.

Answer 118

Data: The raw information you want to visualize. Aesthetic Mappings: How visual properties (aesthetics) like color, size, position, etc. are linked to the data variables Geometrical Shapes: The basic visual marks used to represent data points (like points, lines, bars, etc.). Scales: How data values are mapped to visual properties (e.g., color scale for temperature). Coordinates: The system that defines the position of visual elements (e.g., x and y axes in a scatter plot). Themes: The overall visual design elements that provide a consistent look and feel to the graphic (e.g., color palettes, fonts, backgrounds).

Answer 119

Principal Component Analysis (PCA) is a dimensionality reduction technique used in machine learning and data analysis. It helps simplify complex datasets by focusing on the most important features that capture the majority of the variation in the data.

Answer 120

Imagine a dataset with features like height and weight. Without centering, features with larger scales (e.g., height in cm) can dominate the analysis. Mean centering subtracts the mean value of each feature from all the data points in that feature. This essentially shifts the data such that each feature has a mean of zero. By centering the data around the origin (0, 0), PCA focuses on the direction of the spread (variance) rather than absolute values.

Answer 121

This matrix captures how much two features in your data vary together. After centering, the covariance matrix shows the relationship between features in terms of their deviations from the mean. A positive covariance indicates features tend to move together (higher values in one feature correspond to higher values in the other). A negative covariance indicates features move in opposite directions (higher values in one feature correspond to lower values in the other). A value close to zero suggests little linear relationship between the features.

Answer 122

Find the eigenvalues and eigenvectors of the covariance matrix. Eigenvalues represent the amount of variance captured in each principal component, and eigenvectors give the direction of these components

Answer 123

The eigenvectors are the principal components, and they are ranked based on their corresponding eigenvalues. The first principal component captures the maximum variance in the data, followed by the second, third, and so on.

Answer 124

Search algorithms provide search solutions through a sequence of actions that transform the start state to the goal state.

Answer 125

Completeness: A search algorithm is complete if it provides a solution for a given input when there exists at least one solution for this input. Optimality: Optimal solutions are the best solutions given by the search algorithms at the lowest path cost. Time Complexity: The time needed to accomplish a task or provide a solution. Space Complexity: Memory or storage space needed when conducting a search operation.

Answer 126

Initial state: This is the start state in which the search starts. State space: These are all the possible states that can be attained from the initial state through a series of actions. Actions: These are the steps, activities, or operations undertaken by AI agents in a particular state. Goal state: This is the endpoint or the desired state. Goal test: This is a test conducted to establish whether a particular state is a goal state. Path cost: This is the cost associated with a given path taken by the agents

Answer 127

Breadth-first search Depth-first search Uniform cost search

Answer 128

Greedy search A* tree search

Answer 129

It starts at the root node and systematically explores all the neighbouring nodes at the current depth level before moving on to nodes at the next depth level.

Answer 130

Completeness: BFS is complete. If there is a solution in the search space, BFS will eventually find it. This is because BFS explores all nodes at a given depth level before moving on to the next level, ensuring that it thoroughly searches the entire space. Optimality: BFS is optimal when all actions have the same cost. Since BFS systematically explores nodes level by level, it guarantees that the first solution found is one with the fewest possible steps or edges. Time Complexity: The time complexity of BFS is typically O(V+E), where V is the number of vertices (nodes) and E is the number of edges in the graph. In the worst case, BFS might visit all vertices and edges. Space Complexity: The space complexity of BFS is O(V) where V is the number of vertices. This is because BFS stores all vertices at a given depth level in a queue. In the worst case, the queue may contain all vertices at the maximum depth.

Answer 131

it starts at the root node, explores along a branch until it reaches a leaf node, and then backtracks to explore other branches.

Answer 132

Completeness: DFS is not necessarily complete. DFS may get stuck in an infinite loop if the graph has cycles. Optimality: DFS is not optimal. It may find a solution, but it does not guarantee that it is the shortest path. The first solution encountered may not be the optimal one. Time complexity and space complexity are the same as breadth first search

Answer 133

It explores nodes in a way that considers the cost associated with each node, prioritizing nodes with lower costs before those with higher costs. It is guided by the cost to move from one node to another. The goal is to find a path where the cumulative sum of costs is the least. cost(node) = cumulative cost of all nodes from root.

Answer 134

Completeness: UCS is complete. This is because UCS explores paths in increasing order of cost, ensuring that it finds the lowest-cost solution. Optimality: UCS is optimal. It guarantees that the first solution found is the one with the lowest cost. This is because UCS prioritizes paths with lower costs and explores them first.

Answer 135

picks the next step by focusing on the best-looking option at each moment, mainly based on heuristic values, without caring much about the total path cost. the closest node to the goal node is expanded, where closeness factor is calculated using a heuristic function h (x). h (x) is an estimate of the distance between one node and the end or goal node. The lower the value of h (x), the closer the node is to the endpoint

Answer 136

Completeness: Greedy search is not guaranteed to be complete. It may get stuck in an infinite loop or fail to find a solution, especially if it encounters cycles or cannot backtrack. Optimality: Greedy search is not necessarily optimal. It makes locally optimal choices at each step, but these choices may not lead to the globally optimal solution. The algorithm tends to prioritize immediate gains without considering the overall cost.

Answer 137

A* search is like a smart explorer. It considers both the cost to reach a point and a heuristic estimate of how far that point is from the goal. It combines the attributes of the uniform cost algorithm and the greedy algorithm. Here, the heuristic is simply an integration of the greedy search cost (h (x)) and the cost in the uniform cost algorithm (g (x)). h(x) is the forward cost, which is an estimate of the distance of the current node from the goal node. g(x) is the backward cost, which is the cumulative cost of a node from the root node. The cumulative cost is denoted as f (x) = h(x) + g(x). The strategy is to select the node with the lowest total cost value (f (x)).

Answer 138

Completeness: A* Search is complete. If there is a solution in the search space, A* will eventually find it. Optimality: A Search:* A* Search is optimal. It guarantees finding the optimal solution by using a heuristic function to guide the search. The heuristic helps prioritise paths likely to lead to the best solution.

Answer 139

Shortest Path in Unweighted Graphs: Identifying the shortest path in a network. Web Crawling: Indexing web pages level by level. Puzzle Games: Solving puzzles with multiple states. Maze Solving: Finding a path through a maze. Game AI: Exploring possible moves in games like chess or tic-tac-toe. Network Routing: Exploring paths in computer networks.

Answer 140

Dijkstra's Algorithm: Finding the shortest path in a graph with weighted edges. Resource Allocation: Optimising the use of resources in project management. Network Routing with Variable Costs: Routing in computer networks where edges have varying costs.

Answer 141

Robotics: Path planning for robots in an environment with obstacles. Maps and Navigation Systems: Finding optimal routes on maps. Video Game Pathfinding: Navigating characters in video games efficiently. Traveling Salesman Problem: Determining the most efficient route to visit multiple locations. Job Scheduling: Assigning tasks in a way that optimises a specific criterion. Network Design: Optimising the layout of communication networks.

Answer 142

A proposition is a declarative sentence (a sentence that declares a fact) that is either true or false, but not both.

Answer 143

A proposition that depends on unknown values like variables

Answer 144

Equivalence requires two variables and will return “true” if both variables have the same value

Answer 145

Tautology is very similar to logical equivalence When all values are “true” that is a tautology

Answer 146

English (and every other human language) is often ambiguous. Translating sentences into compound statements removes the ambiguity.

Answer 147

Soundness ensures that if the initial statements and rules are true, then the conclusion will also be true.

Answer 148

Completeness ensures that any true statement can be derived within the system. In our example, if we know the true statement "I have free time" (Q), the system allows us to derive the true statement "It's the weekend" (P). Completeness guarantees that the system captures all possible true statements within its rules.

AI Flashcards

(176 cards)