Big data Flashcards

1
Q

What is QSAR?

A

Quantitative Structure Activity relationship is a modeling technique used to predict biological activity of a molecule.

It focuses on relating the structure of a molecule to predict numeric values that can describe almost any molecular property.

The basic idea of QSAR is to mathematically describe the structure of molecules and then use machine learning to predict some properties of interest. The machine learning algorithm will look at all the fingerprints and those that are similar will be getting similar predicted values.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is a molecular descriptor and molecular fingerprint? Give examples

A

A numerical representation of a molecule derived from its symbolic representation. The goal is to create numerical vectors that capture the structural features of molecules. Similar molecules will therefore get similar vectors. The vector is called a molecular fingerprint.

An example of a simple molecular descriptor is just counting for example the atoms, pairwise distances ect.

A more complex desriptor is a Morgan fingerprint.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Describe the Morgan fingerprint

A

The Morgan fingerprint is perhaps the most commonly used fingerprint and they are derived using the Morgan algorithm. It is created by describing the neighbourhood of each atom out to a certain radius, and then hashing (and sometimes folding) this down to a bit or count vector of a fixed length.

When you generate a Morgan fingerprint for a molecule, you end up with a binary vector, where each element (or bit) in the vector represents the presence or absence of a particular structural feature or environment within the molecule. So each position is a feature and the binary values [0,1] tells you if the feature is present or not.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What does it mean to use folding on a molecular fingerprint?

A

Folding fingerprints is a way of reducing the dimensionality of a fingerprint. You divide the finger print in half and combine the two halves using a logical OR.

This can for example be used to reduce a Morgan fingerprint down to its fixed length.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is Uppmax?

A

UPPMAX is a high-performance computing cluster that consists of both login nodes and
compute nodes as well as shared storage. It can be used to run computations that require a lot
of memory or that require multiple nodes (as is common when working with genomics data).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What did we do in assignment 1?

A

In the NGS assignment, we used Bowtie2 to align reads from next generation seqeuncing to a
reference genome. This was first done for a bacterial genome (in an interactive node on
UPPMAX) and then for a larger genome (using a batch script and submitting it to the job queue
at UPPMAX via SLURM).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What did we do in assignment 2?

A

In assignment 2 we ran a nextflow pipeline for part 1 and for part 2 we wrote a nextflow pipeline consisting of 4 processes that preprocess mass spectrometry data using the open source software collection OpenMS.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Why do we use 3 splits of the data instead of 2 in deep learning?

A

So what we always try to do is make a model and then evaluate it on data it has never seen before. This is the first split, training vs test, which you’ve probably seen a lot before.

In Deep Learning we can use that split, but it is much, much more common to do a training, validation, test split.

The extra validation set let’s us “test” the model as we train it. Based on that we can then see trends in how the model learns, and use that to improve the model as we train or at end of training. Early stopping is one such example where the model that performed best on the validation set, which the model has not been trained on, is chosen. It is then tested with the “test” set.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Difference between batches and epochs when training your data?

A

A batch is the training examples that are forward propageted before we backpropagate. This lets us get overall trends in how we should adjust our parameters (weights and biases). If we back propagated after every single forward propagation it would both take longer to train the network and make the model fit more to individual data points instead of overall trends.

Batch size is how many examples (images) that we have in a batch

Epoch: When all the training data has been forward propagated and back propagated we have trained for one epoch. So if we have 10 batches then we have run 2 epochs after 20 batches. So more epochs == more training

or

if a dataset includes 1,000 images split into mini-batches of 100 images, it will take 10 iterations to complete a single epoch.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is forward propagation?

A

Forward propagation involves passing input data through the neural network to generate predictions or outputs. During this process, the input data is sequentially transformed as it propagates through the network’s layers, ultimately producing an output.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

In deep learning, what are our parameters and what is x?

A

Weights and bias are parameters and x are our input values.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is an activation function?

A

A neural network without an activation function is essentially just a linear regression model. The activation function does the non-linear transformation to the input making it capable to learn and perform more complex tasks.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Is deep learning without bias?

A

No. We are picking the data going into the model and that data is somehow going to be biased.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is a perceptron?

A

A linear model + an activation function. A model that we can train.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is depth and width in deep learning?

A

Depth = how many layers we have

Width = how many neurons we have

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is a multilayer perceptron?

A

Multilayer perceptron = more than one layer of neurons, sometimes referred to as ANN.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Difference between ANN and CNN in machine learning?

A

Artificial Neural Network (ANN), is a group of multiple perceptrons or neurons at each layer. ANN is also known as a Feed-Forward Neural network because inputs are processed only in the forward direction.
This type of neural networks are one of the simplest variants of neural networks. They pass information in one direction, through various input nodes, until it makes it to the output node. The network may or may not have hidden node layers, making their functioning more interpretable.

Convolutional neural networks (CNN) are one of the most popular models used today. This neural network computational model uses a variation of multilayer perceptrons and contains one or more convolutional layers that can be either entirely connected or pooled. These convolutional layers create feature maps that record a region of image which is ultimately broken into rectangles and sent out for nonlinear processing.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What is deep learning and why deep learning?

A

Deep learning is a subset of machine learning.

  • High accuracies
  • Adaptable
  • Fast prediction times - long training times though
  • Get rid of bias - we cannot lose all bias
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Explain the definition of AI, machine learning and deep learning.

A

Artificial intelligence – try to make computers do what the human brain can.

  • Machine learning – algorithms that has the ability to learn without being explicitly programmed. Cumputor systems learn from data that represents experiences. Objective is to learn a target function (model) that can be used to predict the value or label of a future observation.
  • Deep learning – Subset of machine learning in which neural networks adapt and learn from data.

Deep Learning == Training a Deep Neural Network

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Explain the definitions:

Neural network
Neuron
Perceptron

A

Deep Learning == Training a Deep Neural Network
Neural network == Network of neurons
Neuron == Perceptron
Perceptron : linear model + activation function.

The perceptron has a linear function to which we put our input values and the weights and bias are our ONLY parameters. These parameters will change. The activation function then adds non-linearity to capture complex patterns in the data and it also adds the option of not sending the output on.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Explain the sigmoid activation function

A

Logistic function that goes from 0 to 1.

The sigmoid function maps any real-valued input to a value between 0 and 1. Specifically, as the input
x becomes increasingly negative, the sigmoid function approaches 0 but never quite reaches it. Similarly, as the input x becomes increasingly positive, the sigmoid function approaches 1 but never quite reaches it.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What is backward propagation?

A

Backward propagation, also known as backpropagation, is used to calculate the gradients of the loss function with respect to the weights and biases of the neural network. These gradients are then used to update the network’s parameters during the optimization process.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Explain the process of forward pass and backward pass.

A

Forward Pass:
Input data x is fed into the input layer.
Each neuron in the input layer computes a weighted sum of its inputs, adds a bias term, and applies an activation function, producing an output.
The outputs of the input layer neurons become the inputs to neurons in the next layer. This process continues through each layer, with each layer transforming the input from the previous layer until reaching the output layer.

Loss Calculation:
The output of the network is compared to the ground truth (actual targets) using a loss function, which quantifies how well the network’s predictions match the true values.

The loss function provides a single scalar value representing the discrepancy between the predicted outputs and the true targets.

Backward Pass (Backpropagation):
The gradient of the loss function with respect to each parameter (weights and biases) in the network is computed using derivatives.
The gradient indicates how much the loss function would change if each parameter were adjusted slightly.
By computing these gradients backward through the network, starting from the output layer and moving backward, we determine how sensitive the loss function is to changes in each parameter.

These gradients guide the optimization process by indicating the direction and magnitude of parameter updates needed to minimize the loss function.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Why does the number of parameters change through the different layers of an MLP?

A

The number of parameters depends on the number of neurons in the present layer as well as the number of neurons in the previous layer.

Say that the input is an image of 32x32 pixels, the input into the first layer is 32x32 = 1024 + 1(bias).

If those parameters goes into 7 neurons, then the parameters in that layer will be (1024+1)7 = 7175. From that layer we will get 7 outputs that goes into the next layer of 5 nodes. the number of parameters there will be (7+1)5 = 40.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

What is the loss function?

A

When you compare the prediction to the target. There are may variations to this.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

What is the main goal of backpropagation?

A

Back propagation: Adjusting the network based on how wrong the prediction was.

Each neuron will be updated in proportion to how much they contributed to the loss of the next layer.

Each parameter will be updated in proportion to how
much they contributed to the neuron being wrong

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

Why can’t we just use the test set to stop training?

A

Because we will then use up the unseen data and there will nothing left for the real test.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

What is the similarity between logistic regression and ANN models?

A

Logistic regression uses the curve in the activation function to predict a value and the ANN architectures uses the same curves but many times and in different layers.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

What is the general architecture of an ANN model?

A

Commonly used for classifications.

  • An input layer that takes the input
  • Layer that flattens the input into a vector.
  • Hidden layers with multiple perceptrons
  • Output layers with perceptrons

Input is processed only forward.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

Why do we train the models in machine learning?

A

To get better than random chance at at specific task.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

What is the right batch size?

A

Like the number of epochs, batch size is a hyperparameter with no magic rule of thumb. Choosing a batch size that is too small will introduce a high degree of variance (noisiness) within each batch as it is unlikely that a small sample is a good representation of the entire dataset. Conversely, if a batch size is too large, it may not fit in memory of the compute instance used for training and it will have the tendency to overfit the data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

What is convolution and filter maps?

A

during a convolution we move a filter across the image, doing some basic math on all the numbers the filter touches and putting it into a new square (pixel) in our new “image”. This resulting image is called a filter map and we do this to extract features in images.

kernel = filter = feature detector and is a sliding window of predetermined size that moves across the original image and calculates new values of those in each pixel –> filter map.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

What is strides and padding in CNNs?

A

Strides is the number of pixels the kernel moves through the original image.

Padding is what we do so that corners of the image will have as much input on the output as the middle of the image. We add numbers around the image. This is also done so that the filtermam will have the same size as the input image.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

What is pooling in CNNs?

A

If we keep having the same size output as input but generating more filter map outputs than we have inputs we will quickly get a large amount of data, i.e. a lot of math to do which means training will take longer and require more data.

In these filtermaps there might be a lot of pixels that do not add more information. So to concentrate the information we use pooling. This is another matrix that will move across the filtermap but the stride will always be the same as the matrix so no pixel is touched twice.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

AveragePooling vs MaxPooling?

A

We set a pooling window size and in that window max pooling will choose the highest value and set for that window in the new matrix which will also be smaller. Average pooling will take the average of all values.

It reduces the signals so that we look at the image more general.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

What is the general architecture of CNNs?

A

CNNs are commonly used for image recognition and has:

  • input layer
  • one or more convoluation layer
  • pooling layer
  • flatten layer
  • MLP
  • output

The convolution layers can either be fully connected or pooled.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

What is data Augmentation?

A

A problem in life sciences is that we usually do not have big data sets.

This means that we run higher risk of overfitting our models because there is not enough variety.

A way of solving this is data augmentation where we randomly change the input to reduce the risk of overfitting our data.

These transformations are typically designed to preserve the underlying characteristics of the data while introducing variability.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
38
Q

What is the Dropout function doing?

A

Dropout is a form of regularization that helps prevent overfitting by randomly setting a fraction of input units to zero during training. In your case, Dropout(0.2) means that during training, 20% of the units in the previous layer will be randomly set to zero. This forces the network to learn more robust features and prevents it from becoming overly reliant on specific activations.

By randomly dropping out units during training, Dropout helps to reduce the interdependency between neurons, making the network more resilient and less likely to overfit to the training data. At test time, Dropout is typically turned off, and the full network is used for making predictions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
39
Q

What are the 5 Vs in big data?

A

Volume: massive amount of data generated
Velcity: data is generated very fast
Variety: data is complex and unstructured
Veracity: quality of the data
Value: can it be turned into something useful?

40
Q

Explain the concept of working in batches with big data? When is it common to work in batches and when is it not?

A

The analysis of the data is separated from the generation of data. The data needs to be stored somewhere before we do the analysis later.

This is a very common approach for working with sequencing data where we might sequence some DNA, store it in a file and then map it back to a reference later.

When we for example work with image analysis the amount of data may not be possible to store before analysis due to the massive amounts of data generated from the image. It is then more common to do the analysis right after data generation.

It is also hard to work in batches when we have a continuous flow of data generating, like a video.

41
Q

What is synthetic data generation?

A

Generate more/new data based on existing data.

42
Q

What do we need to work with big data?

A
  • Infrastructure
  • storage
  • compute
  • networks
  • Systems and softwares
  • Methods and algorithms
  • Expertise
43
Q

What is cloud computing? Give examples of cloud applications.

What are the keys behind cloud computing?

A

The use of computers and applications that are off premises. Meaning you can do computations someplace else if you do not have the resources. Both applications and hire entire servers where you can install whatever you want.

Jupiter notebooks, google docs

Good internet connection and virtualization.

44
Q

What are virtual machines?

A

A virtual machine (VM) is a software emulation of a physical computer. VMs run an operating system and applications just like a physical computer, but they do so within a virtual environment.

Virtual machine software can run programs and operating systems, store data, connect to networks, and do other computing functions

45
Q

What is containers and docker?

A

If you are using a virtual machine and an operating system and just install one application then you will not have any conflicts, this is not reasonable. Containers are a way to run applications in a packed operating system. Docker found a way to run containers. Containers are like lightweight virtual machines.

Docker is an open-source platform that automates the deployment, scaling, and management of containerized applications. It provides tools and services for creating, managing, and running containers.

Containers are the concept and docker is the implementation.

46
Q

Why is better to use containers from docker than using virtual machines?

A

Docker containers occupy less space

Shorter boot-up time

Containers have better performance sine they are hosted in a single docker engine

High efficiency

Easily portable across different platforms

Data volumes can be shared and reused among multiple containers.

47
Q

What are kubernetes?

A

Open source platform that automates container operations designed to automate the deployment, scaling, and management of containerized applications.

48
Q

Supervised vs unsupervised learning?

A

Supervised learning is when we know the labels of the objects, aim is to predict new observations correctly.

Unsupervised learning is when we do not know the labels of the object and aim is often to classify based on patterns. Clustering methods is a form of unsupervised learning.

49
Q

What do we need to be able to know how good a model is?

A
  • Observations with known labels
  • Some metric
  • Some test procedure, usually we divide training data into training sets and testing sets to be able to test the model on unseen data. Used to see that your model is not overfitted.
50
Q

What is accuracy?

A

Metric for seeing how good your model is.

number of correct classifications / total number of classifications.

51
Q

What are the performance metrics we can use to see how good a model is?

A

Accuracy : fraction of correct predictions. TP+TN / TP + TN + FP + FN

Precision : what proportion of positive corrections is correct?

Sensitivity : TP / TP + FN

Specificity : TN / TN + FP

F1 - score : weighted average of the precision and sensitivity. Optimal value is 1 and worst case is 0.

52
Q

What metric for model quality should we choose in what situation?

A

Accuracy is good when we have symmetric datasets and when the cost of FP and FN are similar.

Sensitivity when you do not want to miss any positives and specificity when false positives have higher cost.

F1 is good if your dataset has an uneven class disitribution or if the cost of FP and FN is not even.

53
Q

What is a ROC curve?

A

Plot of sensitivity vs 1 - specificity and it tells you how the model is performing.

AUC under the ROC curve tells you if your model is better than just random chance which would be 50% AUC.

54
Q

What does it mean that your model is overfitted?

What are the cons of using the test set method to reduce the risk of overfitting?

A

That it is too specifically capturing the patterns of the training data and not the general patterns. This leads to the model performing very poorly on unseen data because it is too specifically trained on the training.

Ex. if training data is a blue flower a overfitted model may say that a red flower is not a flower just because it is not blue.

The cons is that we waste a lot of data just to use as test sets that we could have used to train the model with.

55
Q

What is K-fold cross validation and why may we use it?

A

A way of testing a model on data it has not been trained on without dividing data into training and testing.

We randomly split the data into k number of sets. The model trains on k-1 and is tested on the remaining set. The model is trained and tested until it has been tested on all separate sets and then we take the average of the quality of each testing results.

56
Q

What are the differences in using high k vs low k in cross validation?

A

If we were to use fewer folds, it would reduce the training time and computational cost but smaller K also leads to more overlap between training sets, because each training set will cover more of the dataset. This is potentially allowing the model to capture more consistent patterns in the data but with this we also get the risk of overfitting the model to the training data.

Larger K will therefore reduce the risk of overfitting the data because the overlap between training sets will be smaller but the computational cost does however get bigger with bigger K.

k = 10 is common to use

57
Q

What are some ways to improve an ML model?

A
  • select another modeling method. CV can be used to decide which model is best. Fit all models of family to your data and use CV to see which one is best.
  • choose new types of features
  • select subset of features (feature selection).
  • Tune the parameters for the method
  • Get more data - modeling methods can never fix the problem of bad or lacking data.
58
Q

What should you think about when choosing your variables for a prediction model?

A

More features than necessary increases the risk of overfitting your model and you should only choose the features needed to predict Y but they can be hard to know from the beginning.

Use cross-validation to find a useful set of features.

59
Q

What are some variable selection methods?

A

Forward selection
Backward selection
Wrapper approach : perform the modeling twice and in the second one use the variables that proved important.

60
Q

What are some parameters that may need tuning? What are some approaches to do so?

A

Number of layers
batch sizes

Approaches:
- Manual search
- grid search
- bayesian optimization.

61
Q

What are the main uses for large-scale ML?

A

When we have big datasets that may not fit on a computer memory or the training time is long.

Demanding ML methods like deep neural networks

Demanding model selection and parameter tuning - when we are training and assessing a lot of different models.

62
Q

What are some large-scale ML approaches?

A

Simplify the model:
- Reduce the model size by using feature selection

  • Reduce the complexity of the model by reducing the number of layers, perceptrons ect.

Optimize the model:
- approximations to get faster convergence

Compute in parallell
- use many nodes/cores to reduce running time.

63
Q

Generative models vs foundation models?

A

Generative:
- generate new content based on input. Such as HMMs.

  • generate new data samples that resemble a given training dataset. They learn the underlying distribution of the data and can create new instances that are similar to the original data.

Foundation:
Foundation models are large-scale machine learning models trained on vast amounts of diverse data. They are designed to serve as a general-purpose foundation that can be fine-tuned for a wide range of downstream tasks.

64
Q

What are large language models?

A

Large language models are advanced neural network models that are trained on massive corpora of text data to understand, generate, and manipulate human language. They use architectures such as transformers, which enable them to handle complex language tasks.

65
Q

What is cryoEM?

A

A microscopy method that sees individual atoms. Gives a massive volume of data very fast.

Because of the volume and velocity of the data, it can be hard to store and we therefore do no usually analyze it in batches but on the fly.

This data can be very useful since it can be used to view individual proteins, cells ect.

66
Q

Describe the challenges and opportunities of using AI in precision medicine.

A

Opportunities:
The opportunities are the AI can help us make predicitions about specific diseases based on medical data that can be very large. The challenges here are that the model needs to be trained on a large dataset that is labeled. Too small training sets may lead to poor training and false results.

Another challenge is accountability, who is responsible for the predicitions if they are wrong?

The models will also use real live daya to learn from, what if it learns things that we did not intend, there are ethical challenges. We need it to learn what we intend to with high accuracy but not anything else.

67
Q

What is map reduce?

A

A programming model.

map is a higher-order function that applies a given function to each
element of, e.g., a list.

Reduce is a higher-order function that by using a given function
combines the part of, e.g., a list into a return value.

Map and reduce can be used with many languages and since each step in Map and Reduce is separate it lends itself well to
parallelisation. The idea behind MapReduce frameworks is to run
data in parallell bacause the datasets are usually too big to store as one set.

Since the map and reduce functions are already separate the model is very easily parallizable.

68
Q

What is spark RDD?

A

In Apache Spark, Resilient Distributed Datasets (RDDs) are the fundamental data structure. They are immutable, distributed collections of objects that can be processed in parallel. RDDs offer fault tolerance, allowing for reliable processing of large datasets.

RDD is an unchangeable datastructure distributed over multiple computers with built in backup if something goes wrong.

69
Q

What is Apache spark and Apache Hadoop?

A

Both are open-source engines designed for large-scale data processing.

Spark is an update of hadoop.

Hadoop processes data using a series of MapReduce tasks and uses HDFS to store data on clusters of machines. Data generated during map is written to disc and output data goes back to HDFS.

Spark does its computing in memory which means that intermediate data is also stored in memory which makes it faster.

70
Q

Is nextflow a top-bottom programming language?

A

No, it is a reactive language where different processes communicate thorugh channels. As long as the channels have the right name the order of the processes does not have to be top to bottom.

71
Q

What is wget, scp, sftp and rsync?

A

wget is a program to get data from we servers.

scp is a secure copying protocol for copying files, cosidered outdate and inflexible.

Use sftp and rsync instead. Rsync is used for transfering and synchronising files, uses md5checksums to be sure that the files are identical to original.

sftp is

72
Q

What is zip and gzip? What are the z commands?

A

Zip is an archive format that can contain both files and directories that can be compressed.

gzip is a way of compressing and decompressing ONLY files.

z-commands are different ways of working with gzipped files.
zcat - view
zgrep - search inside a zipped file
zdiff - compare zipped files
zless - browse a zipped file

73
Q

What is the point of using containers?

A

It helps you not needing to keep all softwares on your own local computor. You set up a container that has everything you need to run the process and you do not need it anymore and you can discard the container.

74
Q

What is batch normalization in depp learning?

A

A form of regularisation method where we are reducing the risk of overfitting the model by applying a transformation that maintains the mean output close to 0 and the output standard deviation close to 1.

75
Q

Explain the differences between virtual machines and docker containers.

A

Virtual machines are like software versions of real computers. They share the hardware with the computer whilst having its own operating system. Hypervisors are used to create and manage the VMs and the VMs contain all that is needed to run the applications such as storage, computing, memory ect. These have slow boot times and require large overhead, the isolation however makes them secure.

Containers are lightweight, portable units of software that package up code and all its dependencies.
They share the host OS kernel but run isolated processes in separate OS environments. Containers use a docker engine to create and manage the container.
They have faster booting time and less overhead because they share the OS with the host and only the apps are run separately but this also means that they are less secure.

76
Q

Difference between zip and gzip?

A

zip is not just for one single file while gzip is jus for one file.

77
Q

Relate the 5 V’s in big data to CryoEM microcopy.

A

Volume: When working with CryoEM microscopy we can see individual atoms of the object we are producing images of. This is producing a massive number of images and very large datasets are generated. This could introduce some problems since the datasets probably will not be able to fit onto computer memory and working in batches is therefore usually not possible.

Velocity: The velocity of the generated data is very high, about 3TB a day.

Variety: This data is usually very complex.

Veracity: This data can be incredibly useful since it allows us to see individual proteins and protein systems or individual cells of an organ for example.

Value: It can be used in structural biology, drug development, medical research ect.

78
Q

What is FTP and sftp?

A

File Transfer Protocol created 1971. Was not designed to be secure and has a lot of security issues.

With sftp ssh takes care of the security.

79
Q

What are the cons of using containers?

A

One con is that the data is lost if the container is shut down, this may introduce problems for workflows.

The containers are also less isolated than VMs since they share OS with the host which means that they are less secure.

80
Q

What is pull vs push?

A

Push and pull are different workflow tools.

Push: tasks in the workflow are executed as data is made available. When this is done it moves on to the next task.

Pull: The workflow asks for specific outputs, the workflow goes backwards to produce the output needed.

81
Q

What is nextflow?

A

Domain specific language for workflows, it enables scalable and reproducible workflows.

82
Q

Why do we use workflow tools?

A
  • run multiple tasks in parallel
  • workflow tools models
    dependencies between tasks
  • avoid re-running computations to save time after failure.
  • separate unfinished from finished data to avoid that unfinished data is used downstream.

~80% of the work in dataanalysis pipelines is getting and preprocessing the data since data is seldom in a clean state right when you get it and it often needs to be integrated from multiple sources. Workflow tools are commonly used for this part of dataanalysis.

83
Q

What are the benefits we gain from using workflow tools?

A
  1. Improved readability
  2. Less risk for data corruption
  3. Less risk for mis-use of data
  4. Improved understanding of how
    output data was created
  5. (Often) improved portability
  6. Improved maintainability
  7. Simplified scalability
84
Q

Explain file-based vs explicit dependency workflow tools?

A

File based workflow tools performs tasks based on if specific filenames are available.

Explicit dependency tools do not care as much about file names but a task is performed if the prerequisite tasks have been performed.

85
Q

Static vs dynamic scheduling?

A

In static scheduling tools the graph of tasks is determined completely before run.

In dynamic scheduling new tasks can be created during execution.

86
Q

Give examples of three workflow tools

A
  • nextflow
  • snakemake
  • luigi
87
Q

Why is high content imaging (HCI) in the big data field?

A

A common area of high content imaging is to take pictures of cells in different conditions.

We have many wells with different conditions in them and we take pictures of each well and measure many features of the cells.

This produces millions of datapoints per plate.

The images are very useful in many areas because the cells may die but the images remain for further analysis and AI applications.

88
Q

single point vs multiparametric data in HCI?

A

single point data means that we are only looking at one parameter and we take the average of the values from many cells.

Multiparamteric means that for thousands of cells we are looking at 100-1000s parameters.

89
Q

What are some general considerations you should think of when doing HCI?

A

Image quality is key.

  • Illumination correction: We lose light in the corners of the images if we do not do this because of the fact that the objective of the microscope is round. Correct for the uneven pixels in the image.
  • Quality control - out-of-focus(blurring) makes it hard to extract information. Same with saturation(bubbles ect.). Measure how many pixels are saturated, we flag these images to not analyze them in downstream analysis.
  • Assay evaluation (signal-to-background, signal-to-noise, signal window, z-score)
90
Q

What is image segmentation in HCI?

What are the two ways of doing image segmentation?

A

Segment the cells/nuclei to then extract features from the segments. Shape, intensity, texture, microenvironment and context (relationship between one cell and surroundings).

It is important to also do quality control of the segmentation.

Model based: manual or with algorithms.
Machine learning: trained to find the optimal segmentation parameters.

91
Q

What are batch and plate effects in HCI?

A

The cells may run out of nutrients faster in one plate than the other and then it will look like they are growing poorly which me may mistake for treatment causes.

92
Q

What are the different models for imaging assays?

A

2D cultures - most widely used, here we use cell lines

3D cultures - cells grow in clumps on a 3D surface, should reflect reality better. Here we use primary cells. Challenge is that we start to loose the middle of the cells because dyes, light ect. cannot penetrate through.

Organoids - Even more physiologically correct

Animal models

93
Q

Biology directed assay vs unbiased profiling?

A

Two different strategies for screening with HCI

Biology directed assays start with specific hypothesis and targets. They can still be multiparametric. We may want to target things like
- membrane
- DNA
- RNA
- location/intensity of specific proteins

Unbiased profiling such as cell painting means that we are without specific targets looking at many parameters/features such as texture, shape, morphology, area ect. to uncover unexpected patterns/associations between different conditions.

94
Q

Name some differencess between academic and industrial workflows/pipelines.

A

Academic workflows are generally very exploratory where many paths are investigated while industry workflows are more streamlined and the pipelines are very optimized for single tasks. This means that academic pipelines are more manual, iterative and typically batch oriented.

95
Q

Within a machine learning context, explain the concept of transfer learning and what challenge it
addresses.

A

Transfer learning within deep learning is the concept of using already existing architectures, weights, biases that have been trained on other data on your own data.

You will remove layers towards the end of the used model and depending on how different the applications are you will need to remove more or less layers of the already existing model and add new ones.

You will train the “new” part of the model, meaning your own layers first. You freeze the pre-existing part so that they are part of forward propagation but not back propagation so that the already done training is not affected. Then train the pre-existing and lastly you will unfreeze and train them as a whole.

One challenge with transfer learning may be to decide on how many epochs to run with the pre-existing model frozen and how many unfrozen.

96
Q

What is the z-factor?

A

A common plate metric for estimating assay quality.

Values between 0.5-1 are considered to be good.