Big data Flashcards
What is QSAR?
Quantitative Structure Activity relationship is a modeling technique used to predict biological activity of a molecule.
It focuses on relating the structure of a molecule to predict numeric values that can describe almost any molecular property.
The basic idea of QSAR is to mathematically describe the structure of molecules and then use machine learning to predict some properties of interest. The machine learning algorithm will look at all the fingerprints and those that are similar will be getting similar predicted values.
What is a molecular descriptor and molecular fingerprint? Give examples
A numerical representation of a molecule derived from its symbolic representation. The goal is to create numerical vectors that capture the structural features of molecules. Similar molecules will therefore get similar vectors. The vector is called a molecular fingerprint.
An example of a simple molecular descriptor is just counting for example the atoms, pairwise distances ect.
A more complex desriptor is a Morgan fingerprint.
Describe the Morgan fingerprint
The Morgan fingerprint is perhaps the most commonly used fingerprint and they are derived using the Morgan algorithm. It is created by describing the neighbourhood of each atom out to a certain radius, and then hashing (and sometimes folding) this down to a bit or count vector of a fixed length.
When you generate a Morgan fingerprint for a molecule, you end up with a binary vector, where each element (or bit) in the vector represents the presence or absence of a particular structural feature or environment within the molecule. So each position is a feature and the binary values [0,1] tells you if the feature is present or not.
What does it mean to use folding on a molecular fingerprint?
Folding fingerprints is a way of reducing the dimensionality of a fingerprint. You divide the finger print in half and combine the two halves using a logical OR.
This can for example be used to reduce a Morgan fingerprint down to its fixed length.
What is Uppmax?
UPPMAX is a high-performance computing cluster that consists of both login nodes and
compute nodes as well as shared storage. It can be used to run computations that require a lot
of memory or that require multiple nodes (as is common when working with genomics data).
What did we do in assignment 1?
In the NGS assignment, we used Bowtie2 to align reads from next generation seqeuncing to a
reference genome. This was first done for a bacterial genome (in an interactive node on
UPPMAX) and then for a larger genome (using a batch script and submitting it to the job queue
at UPPMAX via SLURM).
What did we do in assignment 2?
In assignment 2 we ran a nextflow pipeline for part 1 and for part 2 we wrote a nextflow pipeline consisting of 4 processes that preprocess mass spectrometry data using the open source software collection OpenMS.
Why do we use 3 splits of the data instead of 2 in deep learning?
So what we always try to do is make a model and then evaluate it on data it has never seen before. This is the first split, training vs test, which you’ve probably seen a lot before.
In Deep Learning we can use that split, but it is much, much more common to do a training, validation, test split.
The extra validation set let’s us “test” the model as we train it. Based on that we can then see trends in how the model learns, and use that to improve the model as we train or at end of training. Early stopping is one such example where the model that performed best on the validation set, which the model has not been trained on, is chosen. It is then tested with the “test” set.
Difference between batches and epochs when training your data?
A batch is the training examples that are forward propageted before we backpropagate. This lets us get overall trends in how we should adjust our parameters (weights and biases). If we back propagated after every single forward propagation it would both take longer to train the network and make the model fit more to individual data points instead of overall trends.
Batch size is how many examples (images) that we have in a batch
Epoch: When all the training data has been forward propagated and back propagated we have trained for one epoch. So if we have 10 batches then we have run 2 epochs after 20 batches. So more epochs == more training
or
if a dataset includes 1,000 images split into mini-batches of 100 images, it will take 10 iterations to complete a single epoch.
What is forward propagation?
Forward propagation involves passing input data through the neural network to generate predictions or outputs. During this process, the input data is sequentially transformed as it propagates through the network’s layers, ultimately producing an output.
In deep learning, what are our parameters and what is x?
Weights and bias are parameters and x are our input values.
What is an activation function?
A neural network without an activation function is essentially just a linear regression model. The activation function does the non-linear transformation to the input making it capable to learn and perform more complex tasks.
Is deep learning without bias?
No. We are picking the data going into the model and that data is somehow going to be biased.
What is a perceptron?
A linear model + an activation function. A model that we can train.
What is depth and width in deep learning?
Depth = how many layers we have
Width = how many neurons we have
What is a multilayer perceptron?
Multilayer perceptron = more than one layer of neurons, sometimes referred to as ANN.
Difference between ANN and CNN in machine learning?
Artificial Neural Network (ANN), is a group of multiple perceptrons or neurons at each layer. ANN is also known as a Feed-Forward Neural network because inputs are processed only in the forward direction.
This type of neural networks are one of the simplest variants of neural networks. They pass information in one direction, through various input nodes, until it makes it to the output node. The network may or may not have hidden node layers, making their functioning more interpretable.
Convolutional neural networks (CNN) are one of the most popular models used today. This neural network computational model uses a variation of multilayer perceptrons and contains one or more convolutional layers that can be either entirely connected or pooled. These convolutional layers create feature maps that record a region of image which is ultimately broken into rectangles and sent out for nonlinear processing.
What is deep learning and why deep learning?
Deep learning is a subset of machine learning.
- High accuracies
- Adaptable
- Fast prediction times - long training times though
- Get rid of bias - we cannot lose all bias
Explain the definition of AI, machine learning and deep learning.
Artificial intelligence – try to make computers do what the human brain can.
- Machine learning – algorithms that has the ability to learn without being explicitly programmed. Cumputor systems learn from data that represents experiences. Objective is to learn a target function (model) that can be used to predict the value or label of a future observation.
- Deep learning – Subset of machine learning in which neural networks adapt and learn from data.
Deep Learning == Training a Deep Neural Network
Explain the definitions:
Neural network
Neuron
Perceptron
Deep Learning == Training a Deep Neural Network
Neural network == Network of neurons
Neuron == Perceptron
Perceptron : linear model + activation function.
The perceptron has a linear function to which we put our input values and the weights and bias are our ONLY parameters. These parameters will change. The activation function then adds non-linearity to capture complex patterns in the data and it also adds the option of not sending the output on.
Explain the sigmoid activation function
Logistic function that goes from 0 to 1.
The sigmoid function maps any real-valued input to a value between 0 and 1. Specifically, as the input
x becomes increasingly negative, the sigmoid function approaches 0 but never quite reaches it. Similarly, as the input x becomes increasingly positive, the sigmoid function approaches 1 but never quite reaches it.
What is backward propagation?
Backward propagation, also known as backpropagation, is used to calculate the gradients of the loss function with respect to the weights and biases of the neural network. These gradients are then used to update the network’s parameters during the optimization process.
Explain the process of forward pass and backward pass.
Forward Pass:
Input data x is fed into the input layer.
Each neuron in the input layer computes a weighted sum of its inputs, adds a bias term, and applies an activation function, producing an output.
The outputs of the input layer neurons become the inputs to neurons in the next layer. This process continues through each layer, with each layer transforming the input from the previous layer until reaching the output layer.
Loss Calculation:
The output of the network is compared to the ground truth (actual targets) using a loss function, which quantifies how well the network’s predictions match the true values.
The loss function provides a single scalar value representing the discrepancy between the predicted outputs and the true targets.
Backward Pass (Backpropagation):
The gradient of the loss function with respect to each parameter (weights and biases) in the network is computed using derivatives.
The gradient indicates how much the loss function would change if each parameter were adjusted slightly.
By computing these gradients backward through the network, starting from the output layer and moving backward, we determine how sensitive the loss function is to changes in each parameter.
These gradients guide the optimization process by indicating the direction and magnitude of parameter updates needed to minimize the loss function.
Why does the number of parameters change through the different layers of an MLP?
The number of parameters depends on the number of neurons in the present layer as well as the number of neurons in the previous layer.
Say that the input is an image of 32x32 pixels, the input into the first layer is 32x32 = 1024 + 1(bias).
If those parameters goes into 7 neurons, then the parameters in that layer will be (1024+1)7 = 7175. From that layer we will get 7 outputs that goes into the next layer of 5 nodes. the number of parameters there will be (7+1)5 = 40.
What is the loss function?
When you compare the prediction to the target. There are may variations to this.
What is the main goal of backpropagation?
Back propagation: Adjusting the network based on how wrong the prediction was.
Each neuron will be updated in proportion to how much they contributed to the loss of the next layer.
Each parameter will be updated in proportion to how
much they contributed to the neuron being wrong
Why can’t we just use the test set to stop training?
Because we will then use up the unseen data and there will nothing left for the real test.
What is the similarity between logistic regression and ANN models?
Logistic regression uses the curve in the activation function to predict a value and the ANN architectures uses the same curves but many times and in different layers.
What is the general architecture of an ANN model?
Commonly used for classifications.
- An input layer that takes the input
- Layer that flattens the input into a vector.
- Hidden layers with multiple perceptrons
- Output layers with perceptrons
Input is processed only forward.
Why do we train the models in machine learning?
To get better than random chance at at specific task.
What is the right batch size?
Like the number of epochs, batch size is a hyperparameter with no magic rule of thumb. Choosing a batch size that is too small will introduce a high degree of variance (noisiness) within each batch as it is unlikely that a small sample is a good representation of the entire dataset. Conversely, if a batch size is too large, it may not fit in memory of the compute instance used for training and it will have the tendency to overfit the data.
What is convolution and filter maps?
during a convolution we move a filter across the image, doing some basic math on all the numbers the filter touches and putting it into a new square (pixel) in our new “image”. This resulting image is called a filter map and we do this to extract features in images.
kernel = filter = feature detector and is a sliding window of predetermined size that moves across the original image and calculates new values of those in each pixel –> filter map.
What is strides and padding in CNNs?
Strides is the number of pixels the kernel moves through the original image.
Padding is what we do so that corners of the image will have as much input on the output as the middle of the image. We add numbers around the image. This is also done so that the filtermam will have the same size as the input image.
What is pooling in CNNs?
If we keep having the same size output as input but generating more filter map outputs than we have inputs we will quickly get a large amount of data, i.e. a lot of math to do which means training will take longer and require more data.
In these filtermaps there might be a lot of pixels that do not add more information. So to concentrate the information we use pooling. This is another matrix that will move across the filtermap but the stride will always be the same as the matrix so no pixel is touched twice.
AveragePooling vs MaxPooling?
We set a pooling window size and in that window max pooling will choose the highest value and set for that window in the new matrix which will also be smaller. Average pooling will take the average of all values.
It reduces the signals so that we look at the image more general.
What is the general architecture of CNNs?
CNNs are commonly used for image recognition and has:
- input layer
- one or more convoluation layer
- pooling layer
- flatten layer
- MLP
- output
The convolution layers can either be fully connected or pooled.
What is data Augmentation?
A problem in life sciences is that we usually do not have big data sets.
This means that we run higher risk of overfitting our models because there is not enough variety.
A way of solving this is data augmentation where we randomly change the input to reduce the risk of overfitting our data.
These transformations are typically designed to preserve the underlying characteristics of the data while introducing variability.
What is the Dropout function doing?
Dropout is a form of regularization that helps prevent overfitting by randomly setting a fraction of input units to zero during training. In your case, Dropout(0.2) means that during training, 20% of the units in the previous layer will be randomly set to zero. This forces the network to learn more robust features and prevents it from becoming overly reliant on specific activations.
By randomly dropping out units during training, Dropout helps to reduce the interdependency between neurons, making the network more resilient and less likely to overfit to the training data. At test time, Dropout is typically turned off, and the full network is used for making predictions.