Lets go Flashcards

Question

What are genetic algorithms? short

Answer 1

Genetic algorithms are algorithms that mimics evolution to ‘train’ a solution. It’s easy to get a good result using your domain specific knowledge as you can directly implement it into ‘starting population’.

Answer 2

GA procedure at a high level: 1. Initializes a population of solutions (chromosomes). 2. Evaluate each chromosome in the population using an objective function. 3. Create a new solutions from the previous population by applying the reproduction & modification operations. 4. Replace the old population with new ones, just created. 5. Repeat steps 2 to 4 for a number of iterations (generations).

Answer 3

Reproduction is the process in which individual solutions are arranged according to their objective function values -i.e. performance, and accordingly selected for the next generation.

Answer 4

The idea of crossover is that genetic material (i.e. binary patterns) of parents are combined to produce new solutions (children) that would ‘hopefully’ benefit from strengths of the parents. Use the biased roulette wheel to select the parents. Crossover is achieved by exchanging coding bits between 2 ‘mated’ strings.

Answer 5

Take x percent top performers and include them in the next population without any reproduction/ mutation.

Answer 6

The is the occasional random alteration of the value of a bit string. The mutation operation plays the role of occasionally providing new material to add to the diversity of the search. This gives us the ability to potentially find a new/ better local optima.

Answer 7

The way we represent any given agent in our genetic algorithm can have implications on its performance. For example, if we use binary: the number 127 is 1111111 in binary, which has a Hamming weight of 7 (i.e. 7 ones), but 128 is 10000000, with a Hamming weight of 1, even though these are consecutive numbers. This Hamming Cliff can cause a problem for optimisation. Choose a way to represent your genetic algorithm that allows for easy modification but also does not suffer from this hamming cliff problem.

Answer 8

Two or more algos working independently and occasionally interacting. This can greatly increase the amount of the search space that the genetic algorithm is exposed to whilst decreasing the amount of time to find the local optima.

Answer 9

Master-slave Model (also known as Global Model). Diffusion Model. Migration Model.

Answer 10

Tuning PGA parameters could be treated as an optimisation issue by itself.

Answer 11

RL can be considered similar to supervised learning, but with delayed feedback (in the form of rewards). Put simply, an agent explores a (real or simulated) world, does things right/wrong, and learns from its experiences.

Answer 12

Reinforcement learning uses ‘markov decision processes’ which basically chooses the next action based only on the outcome of the previous action as storing the outcome of all previous actions is not feasible (memory). This behavior gives rise to Markov Chains… Although we do not store the outcome of every action and have no ‘knowledge’ of our history, learnt result is always propagated forward through the previous action and or q matrix. These subsequent states have “memory” of previous states, especially recent ones.This subsequent states have “memory” of previous states, especially recent ones. MDP’s (markov decision processes) follow the cycle or state, action, reward. Basically, given a state, an action is taken which gives a reward (positive or negative)

Answer 13

A matrix defining what action to take depending on what state the agent is in. If action values are the same, pick a random one. Keep track of the current reward and update the Q matrix after the action is taken according to the bellman’s equation. Given enough timesteps, the Q matrix will fill up and converge to an optimal solution given each state.

Answer 14

Problems: 1. T might be infinite, leading to a meaningless utility value. 2. Equal consideration is given to rewards received a long time ago as to those only received recently. Fix: We introduce a discount factor, γ (Greek letter gamma), a value between 0 and 1 exclusive. (Normally close to 1, e.g. 0.9.) Multiply future utilities by this factor, e.g. 4 * 0.9, then then by this factor to increasing powers for subsequence state utilities, e.g. 6 * 0.9 * 0.9. Thus only near-future states really matter.

Answer 15

1. Initialize population 2. Sort by fitness value 3. Elitism 4. Calculate the fractional fitness value of total fitness for each 5. Calculate cumulative fractional fitness value for each 6. Reproduction 7. Crossover 8. Mutation

Answer 16

Sort the initial population by fitness value if not already done The number of chromosomes passing to the next round by elitism = population * elitism rate hyper parameter. Assuming the list is sorted, just take the top X values and pass them to the next population

Answer 17

1. Sort the initial population by fitness value if not already done 2. Calculate the fractional fitness value of total fitness. Sum all fitness values and divide current by total. 3. Calculate cumulative fractional fitness value for each. Just aggregate the fractional fitness value as you iterate through them down the sorted list. 4. Reproduction (popluation * reproduction rate hyper paramter = N) - Pick random number from random number list and go down the cumulative fractional fitness values until cumulative fractional fitness value > random number. This chromosome will pass to the next round. Do that N times.

Answer 18

Make sure the chromosomes are sorted by fitness and have the cumulative fractional fitness value is computed for each row. The number of chromosomes that will be crossedover = popluation * crossover rate hyper paramter. Take X random numbers, and for each go down the cumulative fractional value list, store the picked row's genes as pairs. Pick a random point in each pair where you will cross them over. These crossed over genes are added to the next population.

Answer 19

Make sure the chromosomes are sorted by fitness and have the cumulative fractional fitness value is computed for each row. Find the number of genes that need to be mutated = total genes * mutation rate hyper paramter. Take X random numbers, for each random number multiply it by total genes. Take the corresponding value counting the genes from left to right, top to bottom. Inverse that genes value (MAX+1) - Currval of gene = new val of gene.

Answer 20

A deep neural network is a network with more than 1 'hidden' layer. Prior deep neural networks the vanishing gradient problem limited the performance of state of the art networks. This is because deep neural networks generally use relu activation functions which don't have a change in gradient or use LSTM (memory cells).

Answer 21

1. Flatten 2. Dense layer (fully connected) 3. Convolutional layer 4. Pooling layer 5. Dropout layer 6. Batch normalization layer 7. Recurrent Layer

Answer 22

A “flatten” layer transforms a matrix of values into a vector.

Answer 23

The concept of synapses in deep neural networks is essentially the same as in basic neural networks, but with some added complexity due to the depth of the network.

Answer 24

Convolutional neural network Recurrent neural network

Answer 25

- Type of deep neural network - Often used with image data & sound data - Perform automatic feature selection by picking out salient parts of an image in increasing levels of detail. - Basic concept: Weights are now defined as small tables of numbers (known as kernels), which are slid across the image to perform image recognition / computer vision.

Answer 26

- Type of deep neural network - A neural network that takes multiple inputs (or sets of inputs) over time. - However, also input to the neural network is the output from the previous forward propagation through the model.

Answer 27

- Used in deep neural networks - Overcomes the shortcomings of RNNs (recurrent neural networks) by learning to forget, this optimising the accuracy better. - Forget gate applies sigmoid to the input and multiplies the result (element-wise) by the output from the previous iteration. - Input gate multiplies the result of both a sigmoid and tanh to the current input, then adds the result (element-wise) to the previous result working on the recurred data. - Output gate applies sigmoid to the new input, a tanh to the recurred input. It then multiplies them for output.

Answer 28

Logits are used to gain insight into the confidence of a model. The logit function is applied to the weights as an input to the output layer. A characterstic of deep neural networks is that each results is one output node. Normalizing the output 'logits' gives us a probality/ confidence value. This also simplifies the backpropogation and determining whether the model found an answer using a decision boundary/ threshold.

Answer 29

- Attention is a technique that enables models to focus on specific parts of the input for processing. - Enhances the model's ability to understand context and relationships in data. - Mimics human attention, providing “focus” to a neural network.

Answer 30

Self-Attention Mechanism: The core innovation of transformers is the self-attention mechanism, which allows the model to weigh the importance of different parts of the input sequence when making predictions. Encoder-Decoder Structure: The original transformer model consists of an encoder-decoder structure. The encoder processes the input sequence and generates a context, while the decoder uses this context to generate the output sequence. Positional encoding: Which is a method to incorporate sequence order information in the model.

Answer 31

Multi-Head Attention: Enhances the model's ability to focus on different positions. Encoder-Decoder Attention: Helps the decoder focus on relevant parts of the input sequence. Feed-Forward Neural Networks: Role in processing the output of the attention layers.

Answer 32

Its an architecture not designed for prediction but for generation. The input is random noise output is based on what it’s trained on. It uses a generator disciminator architectures (which can be ANN's or deep neural networks). The generator takes the random noise input and generates an output, usually an image. This image along side the an example from training data is the input into the descriminator. The descriminator outputs imagewhich image is generated by the generator. When the descriminator is 50%, the generator has converged. Note that the loss function for the generator and descriminator are in competition.

Answer 33

Refers to when a GAN's generator produces limited varieties of outputs - overfitting to training data.

Answer 34

Two approaches to solving this problem: 1. We can introduce regularization methods or alternative architectures. (e.g. conditional GANs for more controlled generation). 2. StyleGAN architecture have also been explored. Introduces a more complex mapping from noise to data, allowing finer control over generated features. Developed by Nvidia and optimized for faces.

Answer 35

Semi supervised learning uses a small set of labelled data along side a large set of unlabeled data to learn.

Answer 36

1. Manifold Assumption: Data that shares the same label or describes a related object shares the same manifold 2. Cluster Assumption: The cluster assumption assumes that data points that have the same label will form discrete clusters 3. The low-density assumption (the decision boundary should not pass through high-density areas in the input space. 4. Smoothness Assumption: if two samples x and x1 are close in the input space, their labels y and y1 should be the same.

Answer 37

Transductive approaches aim to predict the labels of the specific unlabeled data points given during training. The model does not generalize to new, unseen data points but rather focuses on the given unlabeled data. Inductive approaches are what you intuitively think of, you train a model using the labeled to predict the unlabeled data. And then you add the newly labeled data to the training data. This approach is able to generalize to new, unseen data points.

Answer 38

1. Support Vector Machines (SVMs) 2. Random walks

Answer 39

1. Self-Training 2. Co-Training 3. Tri-Training 4. Multi-View Learning These are all examples of 'Wrapper Methods'

Answer 40

Every labeled node walks N steps and marks each visited node with its classification. The unlabeled nodes will count the number marks received from each classification and then assume the most occurring mask as their classification.

Answer 41

It is a Discriminative Learner meaning it computes the boundary only and not the class of each point. Support vectors are data points that are closer to the boundary that influence the position of the boundary. Transductive SVM is a graph-based label propagation algorithms Data points are joined by links (nodes and edges). Labels can be propagated from one node to another based upon various factors such as link density. In this case labels are propagated to nearest neighbours and then the decision boundary is computed.

Answer 42

Self-Training: Train a model, classify instances, choose high-confidence examples, include them in the next training cycle, and continue until a stopping condition is met.

Answer 43

Co-Training: Uses two classifiers trained on different views to iteratively label and expand the training set. Tri-Training: Extends co-training to three classifiers without requiring conditionally independent views. Multi-View Learning: Generalizes the concept to multiple views, combining different feature sets to train models and improve predictions. They are all 'bag-esque' methods

Answer 44

All inductive methods suffer from error propagation (a flaw in the dataset will be extrapolated to the new data as well) Solutions to error prop: 1. Constraints: Have a mechanism that penalises the learner when it makes mistakes (Drury and Torgo 2011) 2. Reducing Weights of Pseudo Labels : The new instances that are selected have less influence on the model (Ribeiro et al, 2019)

Answer 45

A field of AI which deals with intepreting and processing natural language. The common approach is to define the input text as a graph of depencies/ relationships between words essentially, turning this into a graph problem. Graph neural networks (GNN) - CNN that works on graphs. - Input is the dependency graph from the parsed text. - Each node aggregates from its neighbors - Classification as output Transformers RNN

Answer 46

1. POS tags: Language is probabilistic, there are methods to compute what type of word is likely to follow another type of word. These word types are: Nouns, Verbs, Adjectives, Adverbs, etc. 2. Document representations: are wide-ranging, they contain all the words from the input text but various methods change the grouping according to some attribute. E.g. bag of words: semantic meaning, dependency trees: sequeences of words that have dependencies.

Answer 47

Hidden Markov Model (HMM) The **Hidden Markov Model (HMM)** is a statistical model that can be used for POS tagging. It treats POS tagging as a sequence labeling problem, where the goal is to find the most likely sequence of tags for a given sequence of words. Viterbi Algorithm The **Viterbi Algorithm** is a dynamic programming algorithm used to find the most probable sequence of hidden states (POS tags) given a sequence of observations (words).

Answer 48

1. Bag of Words: Split text into individual words. Remove stop-words. Loses semantic relationships between words and phrases. 2. Bigrams / Trigrams: Sequences of two or three words. Such as Computer Science, and Bag of Words 3. Dependency Trees: Sequences of words that have dependencies 4. Word Vectors: Bag of words where words are represented as vectors 5. Document Vectors: Represent whole document as a vector words or as a vector of vectors (tensors)

Lets go Flashcards

(75 cards)