Lets go Flashcards

1
Q

What learning approach does linear regression use?
a. supervised
b. unsupervised
c. reinforcement
d. deep
e. meta-heuristics
f. convolutional

A

Linear regression is a supervised learning approach.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is the aim of linear regression?

A

Linear regression tries to fit a line to historical data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

The equation for a line is Y = MX + C. What variables does linear regression change in order to minize cost?

A

Y = MX + C
P0 = C, P1 = M
Y = P0 + P1X

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is cost defined as in linear regression? Short

A

Cost = amount of data that does not fit the hypothesis (line).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is the hypothesis defined as in linear regression? Short

A

The hypothesis in linear regression just means the predicted function (line).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are two algorithms that are often used in linear regression?

A

The two algorithms that are often used in linear regression are gradient descent and using the normal equation. These algorithms aim to find the local optima of cost when iteratively changing P0 and P1.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What learning approach does K-NN (K-NearestNeighbours) use?
a. supervised
b. unsupervised
c. reinforcement
d. deep
e. meta-heuristics
f. convolutional

A

Supervised learning

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Does K-NN (K-NearestNeighbours) classify or regress? How does it do that? Short

A

Classification algorithm which determines the class of any given point by counting the N nearest neighbors. Whichever class occurs the most is the class that will be given to the ‘predicted’ data point.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

How does a regression differ from a classification? Short

A

Regression is trying to find the value of the other axis on the graph.
Classification is trying to find the group/ class a point on a graph belongs to.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What learning approach does K-means use?
a. supervised
b. unsupervised
c. reinforcement
d. deep
e. meta-heuristics
f. convolutional

A

Unsupervised learning

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Is K-means good for regression or classification?

A

Good for classification and regression
Not sure how but you can use it in regression if the data forms a line by splitting the line into K groups and determining the value of the other axis based on the classification given?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Does K-means need labeled data?

A

No. Unsupervised learning…

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

How does K-means work? Explain the algo in short.

A

Find K number of groups within the data (centroids). You can extrapolate the groups found on new data

The algo:
Start with K centroids (random)
Each datapoint is grouped with its nearest centroid
Set the centroids pos to the mean of all values
Repeat steps B and C until stopping criteria is met:
(i.e., no data points change clusters, the sum of the distances is minimized, or some maximum number of iterations is reached).

Randomize the starting points and rerun algo to find another result
Choose the best results

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Does K-means always give a result?

A

K-means guarantees to converge at a result, does not shift around forever. That result MAY be a local optima.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Does K-NN (K-NearestNeighbours) need labeled data?

A

Yes. Supervised learning…

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Does linear regression need labeled data?

A

Yes. Supervised learning…

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What is the ‘search space’ in regards to nueral networks? Short

A

The search space is the total amount of ‘answers’ that are possible for a problem.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What are the most basic activation functions?

A

Hard limit (step function): the simplest!, Converts a fraction to either 0 or 1, If the fraction is above some threshold, it becomes a 1, otherwise a 0.

Linear: Returns a fraction, which is either the same fraction, or a scaled one.

Sigmoidal: More advanced, but essential to solving useful problems with ANNs, Based on an S-shaped curve.

ReLu: its just relu

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What is feed forward in regards to neural networks? Short

A

Feed forward refers to the fact that a node takes an input and multiplies it by its weight when outputting it to the connected output nodes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What are ‘learning rules’ in regards to neural networks? Name & explain them. Detailed

A

TLDR: Change weights based on the correctness of the prediction.

Perceptron Learning Rule: states that the algorithm would automatically learn the optimal weight coefficients. Single layer Perceptrons can learn only linearly separable patterns. Basically the same as error propogation, but it can’t propagate so it has a different definition.

Error back-propagation rule: The training samples are fed through one-by-one, and for each. The actual output is compared to the correct output. This error is then used to alter the weights between all neurons throughout the network.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Should datasets be split into multiple groups? Which ones and what ratio? Short

A

Datasets should be split into three groups with the ratio 70/20/10. The 70 is for training data, 20 for evaluation, and 10 for testing. Evaluation happens after every epoch to evaluate the output of the model currently to check whether we are over fitting. Instead the test data is used after the model is trained to see how it performs on completely new data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What are the effects of a NN (neural network) having a high ‘bias’? Short

A

A model with high bias tends to be smaller and/or simpler. If too much so, it fails to capture the relationships within the data, leading to lower accuracy. In other words, it’s biased toward seeing different values in the training data as too similar, lacking the ability to capture the variations.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What are the effects of a NN (neural network) having high ‘variance’? Short

A

A model with high variance tends to be larger and/or more complex. If too much so, it sees even slightly different samples as vastly different, giving inaccurate predictions, as it learns from the noise in the data. In other words, it sees a lot of variance in training data, even if some samples are very similar to each other.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Regularization techniques are used to balance bias and variance, and prevent overfitting. What are the ones covered in the slides? Detail

A

Lasso
Least Absolute Shrinkage and Selection Operator
Prevents overfitting by adding a penalty term to the loss function.
Remember, the loss function is the way the error is calculated.
It detects the synaptic weights associated with less important features, and drives them to zero value.
It therefore carries out feature selection internally.
Leads to smaller models

Ridge
Similar to L1, it adds an extra term to the loss function, which is related to the squares of the weights.
Unlike L1, which drives weights to zero, L2 drives those associated with less important features down to smaller values.
More computationally efficient than L1.

Drop out
Doesn’t change the loss function, but the network architecture itself.
Neurons in each layer are randomly “shut down” (or ignored), and the selection is changed every epoch.
It forces the network to learn more robust features and has been shown to increase generalization.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

What are genetic algorithms? short

A

Genetic algorithms are algorithms that mimics evolution to ‘train’ a solution. It’s easy to get a good result using your domain specific knowledge as you can directly implement it into ‘starting population’.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

How does a genetic algorithm work? Simple step by step

A

GA procedure at a high level:
1. Initializes a population of solutions (chromosomes).
2. Evaluate each chromosome in the population using an objective function.
3. Create a new solutions from the previous population by applying the reproduction & modification operations.
4. Replace the old population with new ones, just created.
5. Repeat steps 2 to 4 for a number of iterations (generations).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

What is reproduction in a genetic algorithm? Short

A

Reproduction is the process in which individual solutions are arranged according to their objective function values -i.e. performance, and accordingly selected for the next generation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

What is crossover in a genetic algorithm? Short

A

The idea of crossover is that genetic material (i.e. binary patterns) of parents are combined to produce new solutions (children) that would ‘hopefully’ benefit from strengths of the parents. Use the biased roulette wheel to select the parents. Crossover is achieved by exchanging coding bits between 2 ‘mated’ strings.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

What is elitism in a genetic algorithm? Short

A

Take x percent top performers and include them in the next population without any reproduction/ mutation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

What is mutation in a genetic algorithm? Short

A

The is the occasional random alteration of the value of a bit string.
The mutation operation plays the role of occasionally providing new material to add to the diversity of the search. This gives us the ability to potentially find a new/ better local optima.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

What is the hamming cliff problem in GAs? Detail

A

The way we represent any given agent in our genetic algorithm can have implications on its performance. For example, if we use binary: the number 127 is 1111111 in binary, which has a Hamming weight of 7 (i.e. 7 ones), but 128 is 10000000, with a Hamming weight of 1, even though these are consecutive numbers. This Hamming Cliff can cause a problem for optimisation.

Choose a way to represent your genetic algorithm that allows for easy modification but also does not suffer from this hamming cliff problem.

32
Q

What are parallel genetic algorithms? Short

A

Two or more algos working independently and occasionally interacting. This can greatly increase the amount of the search space that the genetic algorithm is exposed to whilst decreasing the amount of time to find the local optima.

33
Q

What are the 3 most common types of parallel genetic algorithm models?

A

Master-slave Model (also known as Global Model).
Diffusion Model.
Migration Model.

34
Q

What is the problem with parallel genetic algorithm models? Short

A

Tuning PGA parameters could be treated as an optimisation issue by itself.

35
Q

What is reinforcement learning? Short & practical definition

A

RL can be considered similar to supervised learning, but with delayed feedback (in the form of rewards). Put simply, an agent explores a (real or simulated) world, does things right/wrong, and learns from its experiences.

36
Q

What does ‘markov decision processes’ or ‘Markov Chains’ refer to in reinforcement learning? Detailed

A

Reinforcement learning uses ‘markov decision processes’ which basically chooses the next action based only on the outcome of the previous action as storing the outcome of all previous actions is not feasible (memory).

This behavior gives rise to Markov Chains…
Although we do not store the outcome of every action and have no ‘knowledge’ of our history, learnt result is always propagated forward through the previous action and or q matrix. These subsequent states have “memory” of previous states, especially recent ones.This subsequent states have “memory” of previous states, especially recent ones.

MDP’s (markov decision processes) follow the cycle or state, action, reward. Basically, given a state, an action is taken which gives a reward (positive or negative)

37
Q

What is a Q-Matrix in reinforcement learning?

A

A matrix defining what action to take depending on what state the agent is in. If action values are the same, pick a random one. Keep track of the current reward and update the Q matrix after the action is taken according to the bellman’s equation. Given enough timesteps, the Q matrix will fill up and converge to an optimal solution given each state.

38
Q

What are the common problems with reinforcement learning and their solutions? Detail

A

Problems:
1. T might be infinite, leading to a meaningless utility value.
2. Equal consideration is given to rewards received a long time ago as to those only received recently.

Fix:
We introduce a discount factor, γ (Greek letter gamma), a value between 0 and 1 exclusive. (Normally close to 1, e.g. 0.9.) Multiply future utilities by this factor, e.g. 4 * 0.9, then then by this factor to increasing powers for subsequence state utilities, e.g. 6 * 0.9 * 0.9. Thus only near-future states really matter.

39
Q

What are the names of the steps to compute a single population for a genetic algorithm?

A
  1. Initialize population
  2. Sort by fitness value
  3. Elitism
  4. Calculate the fractional fitness value of total fitness for each
  5. Calculate cumulative fractional fitness value for each
  6. Reproduction
  7. Crossover
  8. Mutation
40
Q

What are the steps to compute which chromosomes pass to the next round through ‘elitism’ by hand in a genetic algorithm?

A

Sort the initial population by fitness value if not already done

The number of chromosomes passing to the next round by elitism = population * elitism rate hyper parameter. Assuming the list is sorted, just take the top X values and pass them to the next population

41
Q

What are the steps to compute which chromosomes pass to the next round through ‘reproduction’ by hand in a genetic algorithm?

A
  1. Sort the initial population by fitness value if not already done
  2. Calculate the fractional fitness value of total fitness. Sum all fitness values and divide current by total.
  3. Calculate cumulative fractional fitness value for each. Just aggregate the fractional fitness value as you iterate through them down the sorted list.
  4. Reproduction (popluation * reproduction rate hyper paramter = N)
    - Pick random number from random number list and go down the cumulative fractional fitness values until cumulative fractional fitness value > random number. This chromosome will pass to the next round. Do that N times.
42
Q

What are the steps to compute which chromosomes pass to the next round through ‘crossover’ by hand in a genetic algorithm?

A

Make sure the chromosomes are sorted by fitness and have the cumulative fractional fitness value is computed for each row.

The number of chromosomes that will be crossedover = popluation * crossover rate hyper paramter. Take X random numbers, and for each go down the cumulative fractional value list, store the picked row’s genes as pairs. Pick a random point in each pair where you will cross them over. These crossed over genes are added to the next population.

43
Q

What are the steps to compute which chromosomes pass to the next round through ‘mutation’ by hand in a genetic algorithm?

A

Make sure the chromosomes are sorted by fitness and have the cumulative fractional fitness value is computed for each row.

Find the number of genes that need to be mutated = total genes * mutation rate hyper paramter. Take X random numbers, for each random number multiply it by total genes. Take the corresponding value counting the genes from left to right, top to bottom. Inverse that genes value (MAX+1) - Currval of gene = new val of gene.

44
Q

Week 11: Data Pre-processing and Feature Selection

A
45
Q

Week 12: Unsupervised learning and visualization

A
46
Q

Week 13: Hyperparameter tuning and pipelines

A
47
Q

What are deep neural networks and what are they good at?

A

A deep neural network is a network with more than 1 ‘hidden’ layer. Prior deep neural networks the vanishing gradient problem limited the performance of state of the art networks. This is because deep neural networks generally use relu activation functions which don’t have a change in gradient or use LSTM (memory cells).

48
Q

What types or layers are often introduced in deep neural networks?

A
  1. Flatten
  2. Dense layer (fully connected)
  3. Convolutional layer
  4. Pooling layer
  5. Dropout layer
  6. Batch normalization layer
  7. Recurrent Layer
49
Q

What is the function of a ‘flatten’ layer in a deep neural network?

A

A “flatten” layer transforms a matrix of values into a vector.

50
Q

How are synapses defined in a neural network compared to a deep neural network?

A

The concept of synapses in deep neural networks is essentially the same as in basic neural networks, but with some added complexity due to the depth of the network.

51
Q

What are the types of deep neural networks? Name 2

A

Convolutional neural network
Recurrent neural network

52
Q

What are convolutional neural networks?

A
  • Type of deep neural network
  • Often used with image data & sound data
  • Perform automatic feature selection by picking out salient parts of an image in increasing levels of detail.
  • Basic concept: Weights are now defined as small tables of numbers (known as kernels), which are slid across the image to perform image recognition / computer vision.
53
Q

What are Recurrent neural networks?

A
  • Type of deep neural network
  • A neural network that takes multiple inputs (or sets of inputs) over time.
  • However, also input to the neural network is the output from the previous forward propagation through the model.
54
Q

What is a LSTM (Long Short-term Memory)?

A
  • Used in deep neural networks
  • Overcomes the shortcomings of RNNs (recurrent neural networks) by learning to forget, this optimising the accuracy better.
  • Forget gate applies sigmoid to the input and multiplies the result (element-wise) by the output from the previous iteration.
  • Input gate multiplies the result of both a sigmoid and tanh to the current input, then adds the result (element-wise) to the previous result working on the recurred data.
  • Output gate applies sigmoid to the new input, a tanh to the recurred input. It then multiplies them for output.
55
Q

Why do we use logits in deep neural networks?

A

Logits are used to gain insight into the confidence of a model. The logit function is applied to the weights as an input to the output layer. A characterstic of deep neural networks is that each results is one output node. Normalizing the output ‘logits’ gives us a probality/ confidence value.

This also simplifies the backpropogation and determining whether the model found an answer using a decision boundary/ threshold.

56
Q

What is ‘attention’ in the context of deep neural networks/ transformers?

A
  • Attention is a technique that enables models to focus on specific parts of the input for processing.
  • Enhances the model’s ability to understand context and relationships in data.
  • Mimics human attention, providing “focus” to a neural network.
57
Q

What is the key difference between transormers and neural networks?

A

Self-Attention Mechanism: The core innovation of transformers is the self-attention mechanism, which allows the model to weigh the importance of different parts of the input sequence when making predictions.

Encoder-Decoder Structure: The original transformer model consists of an encoder-decoder structure. The encoder processes the input sequence and generates a context, while the decoder uses this context to generate the output sequence.

Positional encoding: Which is a method to incorporate sequence order information in the model.

58
Q

What are common strategies used to manage attention in a transformer?

A

Multi-Head Attention: Enhances the model’s ability to focus on different positions.

Encoder-Decoder Attention: Helps the decoder focus on relevant parts of the input sequence.

Feed-Forward Neural Networks: Role in processing the output of the attention layers.

59
Q

What are generative adversarial networks (GANs)? Describe the architecture.

A

Its an architecture not designed for prediction but for generation. The input is random noise output is based on what it’s trained on.
It uses a generator disciminator architectures (which can be ANN’s or deep neural networks). The generator takes the random noise input and generates an output, usually an image. This image along side the an example from training data is the input into the descriminator. The descriminator outputs imagewhich image is generated by the generator. When the descriminator is 50%, the generator has converged. Note that the loss function for the generator and descriminator are in competition.

60
Q

What is mode collapse?

A

Refers to when a GAN’s generator produces limited varieties of outputs - overfitting to training data.

61
Q

What is the solution to mode collapse?

A

Two approaches to solving this problem:
1. We can introduce regularization methods or alternative architectures. (e.g. conditional GANs for more controlled generation).

  1. StyleGAN architecture have also been explored. Introduces a more complex mapping from noise to data, allowing finer control over generated features. Developed by Nvidia and optimized for faces.
62
Q

What the selling poing of semi supervised learning?

A

Semi supervised learning uses a small set of labelled data along side a large set of unlabeled data to learn.

63
Q

What are the four assumptions of semi-supervised learning? Give a brief explanation of each

A
  1. Manifold Assumption: Data that shares the same label or describes a related object shares the same manifold
  2. Cluster Assumption: The cluster assumption assumes that
    data points that have the same label will form discrete clusters
  3. The low-density assumption (the decision boundary should not pass through high-density areas in the input space.
  4. Smoothness Assumption: if two samples x and x1 are close in the input space, their labels y and y1 should be the same.
64
Q

What is the difference between transductive and inductive semi-supervised approaches?

A

Transductive approaches aim to predict the labels of the specific unlabeled data points given during training. The model does not generalize to new, unseen data points but rather focuses on the given unlabeled data.

Inductive approaches are what you intuitively think of, you train a model using the labeled to predict the unlabeled data. And then you add the newly labeled data to the training data. This approach is able to generalize to new, unseen data points.

65
Q

What are the two transductive approaches to semi-supervised learning covered in the slides?

A
  1. Support Vector Machines (SVMs)
  2. Random walks
66
Q

What are the inductive approaches to semi-supervised learning covered in the slides?

A
  1. Self-Training
  2. Co-Training
  3. Tri-Training
  4. Multi-View Learning
    These are all examples of ‘Wrapper Methods’
67
Q

How does the semi-supervised tranductive approach: Random walks work?

A

Every labeled node walks N steps and marks each visited node with its classification. The unlabeled nodes will count the number marks received from each classification and then assume the most occurring mask as their classification.

68
Q

How does the semi-supervised transductive approach: Support Vector Machines (SVM) work?

A

It is a Discriminative Learner meaning it computes the boundary only and not the class of each point.
Support vectors are data points that are closer to the boundary that influence the position of the boundary.
Transductive SVM is a graph-based label propagation algorithms
Data points are joined by links (nodes and edges). Labels can be propagated from one node to another based upon various factors such as link density.
In this case labels are propagated to nearest neighbours and then the decision boundary is computed.

69
Q

How does the semi-supervised inductive approach: Self-Training work?

A

Self-Training: Train a model, classify instances, choose high-confidence examples, include them in the next training cycle, and continue until a stopping condition is met.

70
Q

How do the semi-supervised inductive approaches: Co-Training, Tri-Training, and Multi-View Learning work?

A

Co-Training: Uses two classifiers trained on different views to iteratively label and expand the training set.
Tri-Training: Extends co-training to three classifiers without requiring conditionally independent views.
Multi-View Learning: Generalizes the concept to multiple views, combining different feature sets to train models and improve predictions.

They are all ‘bag-esque’ methods

71
Q

What is problem do inductive methods (in semi-supervised learning) suffer from? Briefly mention the solutions too

A

All inductive methods suffer from error propagation (a flaw in the dataset will be extrapolated to the new data as well)

Solutions to error prop:
1. Constraints: Have a mechanism that penalises the learner when it makes mistakes (Drury and Torgo 2011)
2. Reducing Weights of Pseudo Labels : The new instances that are selected have less influence on the model (Ribeiro et al, 2019)

72
Q

What is Natural Language Processing? What architectures are commonly used in NLP?

A

A field of AI which deals with intepreting and processing natural language. The common approach is to define the input text as a graph of depencies/ relationships between words essentially, turning this into a graph problem.

Graph neural networks (GNN)
- CNN that works on graphs.
- Input is the dependency graph from the parsed text.
- Each node aggregates from its neighbors
- Classification as output

Transformers

RNN

73
Q

What two inputs are commonly computed within a NLP architecture? Brief definition of each

A
  1. POS tags: Language is probabilistic, there are methods to compute what type of word is likely to follow another type of word. These word types are: Nouns, Verbs, Adjectives, Adverbs, etc.
  2. Document representations: are wide-ranging, they contain all the words from the input text but various methods change the grouping according to some attribute. E.g. bag of words: semantic meaning, dependency trees: sequeences of words that have dependencies.
74
Q

What are the two major algorithms used to compute these POS tags in NLP? Briefly explain each

A

Hidden Markov Model (HMM)
The Hidden Markov Model (HMM) is a statistical model that can be used for POS tagging. It treats POS tagging as a sequence labeling problem, where the goal is to find the most likely sequence of tags for a given sequence of words.

Viterbi Algorithm
The Viterbi Algorithm is a dynamic programming algorithm used to find the most probable sequence of hidden states (POS tags) given a sequence of observations (words).

75
Q

What are the 5 types of document representation used in NLP? Briefly explain each

A
  1. Bag of Words: Split text into individual words. Remove stop-words. Loses semantic relationships between words and phrases.
  2. Bigrams / Trigrams: Sequences of two or three words. Such as Computer Science, and Bag of Words
  3. Dependency Trees: Sequences of words that have dependencies
  4. Word Vectors: Bag of words where words are represented as vectors
  5. Document Vectors: Represent whole document as a vector words or as a vector of vectors (tensors)