Big data Flashcards

Question

What is the loss function?

Answer 1

When you compare the prediction to the target. There are may variations to this.

Answer 2

Back propagation: Adjusting the network based on how wrong the prediction was. Each neuron will be updated in proportion to how much they contributed to the loss of the next layer. Each parameter will be updated in proportion to how much they contributed to the neuron being wrong

Answer 3

Because we will then use up the unseen data and there will nothing left for the real test.

Answer 4

Logistic regression uses the curve in the activation function to predict a value and the ANN architectures uses the same curves but many times and in different layers.

Answer 5

Commonly used for classifications. - An input layer that takes the input - Layer that flattens the input into a vector. - Hidden layers with multiple perceptrons - Output layers with perceptrons Input is processed only forward.

Answer 6

To get better than random chance at at specific task.

Answer 7

Like the number of epochs, batch size is a hyperparameter with no magic rule of thumb. Choosing a batch size that is too small will introduce a high degree of variance (noisiness) within each batch as it is unlikely that a small sample is a good representation of the entire dataset. Conversely, if a batch size is too large, it may not fit in memory of the compute instance used for training and it will have the tendency to overfit the data.

Answer 8

during a convolution we move a filter across the image, doing some basic math on all the numbers the filter touches and putting it into a new square (pixel) in our new "image". This resulting image is called a filter map and we do this to extract features in images. kernel = filter = feature detector and is a sliding window of predetermined size that moves across the original image and calculates new values of those in each pixel --> filter map.

Answer 9

Strides is the number of pixels the kernel moves through the original image. Padding is what we do so that corners of the image will have as much input on the output as the middle of the image. We add numbers around the image. This is also done so that the filtermam will have the same size as the input image.

Answer 10

If we keep having the same size output as input but generating more filter map outputs than we have inputs we will quickly get a large amount of data, i.e. a lot of math to do which means training will take longer and require more data. In these filtermaps there might be a lot of pixels that do not add more information. So to concentrate the information we use pooling. This is another matrix that will move across the filtermap but the stride will always be the same as the matrix so no pixel is touched twice.

Answer 11

We set a pooling window size and in that window max pooling will choose the highest value and set for that window in the new matrix which will also be smaller. Average pooling will take the average of all values. It reduces the signals so that we look at the image more general.

Answer 12

CNNs are commonly used for image recognition and has: - input layer - one or more convoluation layer - pooling layer - flatten layer - MLP - output The convolution layers can either be fully connected or pooled.

Answer 13

A problem in life sciences is that we usually do not have big data sets. This means that we run higher risk of overfitting our models because there is not enough variety. A way of solving this is data augmentation where we randomly change the input to reduce the risk of overfitting our data. These transformations are typically designed to preserve the underlying characteristics of the data while introducing variability.

Answer 14

Dropout is a form of regularization that helps prevent overfitting by randomly setting a fraction of input units to zero during training. In your case, Dropout(0.2) means that during training, 20% of the units in the previous layer will be randomly set to zero. This forces the network to learn more robust features and prevents it from becoming overly reliant on specific activations. By randomly dropping out units during training, Dropout helps to reduce the interdependency between neurons, making the network more resilient and less likely to overfit to the training data. At test time, Dropout is typically turned off, and the full network is used for making predictions.

Answer 15

Volume: massive amount of data generated Velcity: data is generated very fast Variety: data is complex and unstructured Veracity: quality of the data Value: can it be turned into something useful?

Answer 16

The analysis of the data is separated from the generation of data. The data needs to be stored somewhere before we do the analysis later. This is a very common approach for working with sequencing data where we might sequence some DNA, store it in a file and then map it back to a reference later. When we for example work with image analysis the amount of data may not be possible to store before analysis due to the massive amounts of data generated from the image. It is then more common to do the analysis right after data generation. It is also hard to work in batches when we have a continuous flow of data generating, like a video.

Answer 17

Generate more/new data based on existing data.

Answer 18

* Infrastructure - storage - compute - networks * Systems and softwares * Methods and algorithms * Expertise

Answer 19

The use of computers and applications that are off premises. Meaning you can do computations someplace else if you do not have the resources. Both applications and hire entire servers where you can install whatever you want. Jupiter notebooks, google docs Good internet connection and virtualization.

Answer 20

A virtual machine (VM) is a software emulation of a physical computer. VMs run an operating system and applications just like a physical computer, but they do so within a virtual environment. Virtual machine software can run programs and operating systems, store data, connect to networks, and do other computing functions

Answer 21

If you are using a virtual machine and an operating system and just install one application then you will not have any conflicts, this is not reasonable. Containers are a way to run applications in a packed operating system. Docker found a way to run containers. Containers are like lightweight virtual machines. Docker is an open-source platform that automates the deployment, scaling, and management of containerized applications. It provides tools and services for creating, managing, and running containers. Containers are the concept and docker is the implementation.

Answer 22

Docker containers occupy less space Shorter boot-up time Containers have better performance sine they are hosted in a single docker engine High efficiency Easily portable across different platforms Data volumes can be shared and reused among multiple containers.

Answer 23

Open source platform that automates container operations designed to automate the deployment, scaling, and management of containerized applications.

Answer 24

Supervised learning is when we know the labels of the objects, aim is to predict new observations correctly. Unsupervised learning is when we do not know the labels of the object and aim is often to classify based on patterns. Clustering methods is a form of unsupervised learning.

Answer 25

- Observations with known labels - Some metric - Some test procedure, usually we divide training data into training sets and testing sets to be able to test the model on unseen data. Used to see that your model is not overfitted.

Answer 26

Metric for seeing how good your model is. number of correct classifications / total number of classifications.

Answer 27

Accuracy : fraction of correct predictions. TP+TN / TP + TN + FP + FN Precision : what proportion of positive corrections is correct? Sensitivity : TP / TP + FN Specificity : TN / TN + FP F1 - score : weighted average of the precision and sensitivity. Optimal value is 1 and worst case is 0.

Answer 28

Accuracy is good when we have symmetric datasets and when the cost of FP and FN are similar. Sensitivity when you do not want to miss any positives and specificity when false positives have higher cost. F1 is good if your dataset has an uneven class disitribution or if the cost of FP and FN is not even.

Answer 29

Plot of sensitivity vs 1 - specificity and it tells you how the model is performing. AUC under the ROC curve tells you if your model is better than just random chance which would be 50% AUC.

Answer 30

That it is too specifically capturing the patterns of the training data and not the general patterns. This leads to the model performing very poorly on unseen data because it is too specifically trained on the training. Ex. if training data is a blue flower a overfitted model may say that a red flower is not a flower just because it is not blue. The cons is that we waste a lot of data just to use as test sets that we could have used to train the model with.

Answer 31

A way of testing a model on data it has not been trained on without dividing data into training and testing. We randomly split the data into k number of sets. The model trains on k-1 and is tested on the remaining set. The model is trained and tested until it has been tested on all separate sets and then we take the average of the quality of each testing results.

Answer 32

If we were to use fewer folds, it would reduce the training time and computational cost but smaller K also leads to more overlap between training sets, because each training set will cover more of the dataset. This is potentially allowing the model to capture more consistent patterns in the data but with this we also get the risk of overfitting the model to the training data. Larger K will therefore reduce the risk of overfitting the data because the overlap between training sets will be smaller but the computational cost does however get bigger with bigger K. k = 10 is common to use

Answer 33

- select another modeling method. CV can be used to decide which model is best. Fit all models of family to your data and use CV to see which one is best. - choose new types of features - select subset of features (feature selection). - Tune the parameters for the method - Get more data - modeling methods can never fix the problem of bad or lacking data.

Answer 34

More features than necessary increases the risk of overfitting your model and you should only choose the features needed to predict Y but they can be hard to know from the beginning. Use cross-validation to find a useful set of features.

Answer 35

Forward selection Backward selection Wrapper approach : perform the modeling twice and in the second one use the variables that proved important.

Answer 36

Number of layers batch sizes Approaches: - Manual search - grid search - bayesian optimization.

Answer 37

When we have big datasets that may not fit on a computer memory or the training time is long. Demanding ML methods like deep neural networks Demanding model selection and parameter tuning - when we are training and assessing a lot of different models.

Answer 38

Simplify the model: - Reduce the model size by using feature selection - Reduce the complexity of the model by reducing the number of layers, perceptrons ect. Optimize the model: - approximations to get faster convergence Compute in parallell - use many nodes/cores to reduce running time.

Answer 39

Generative: - generate new content based on input. Such as HMMs. - generate new data samples that resemble a given training dataset. They learn the underlying distribution of the data and can create new instances that are similar to the original data. Foundation: Foundation models are large-scale machine learning models trained on vast amounts of diverse data. They are designed to serve as a general-purpose foundation that can be fine-tuned for a wide range of downstream tasks.

Answer 40

Large language models are advanced neural network models that are trained on massive corpora of text data to understand, generate, and manipulate human language. They use architectures such as transformers, which enable them to handle complex language tasks.

Answer 41

A microscopy method that sees individual atoms. Gives a massive volume of data very fast. Because of the volume and velocity of the data, it can be hard to store and we therefore do no usually analyze it in batches but on the fly. This data can be very useful since it can be used to view individual proteins, cells ect.

Answer 42

Opportunities: The opportunities are the AI can help us make predicitions about specific diseases based on medical data that can be very large. The challenges here are that the model needs to be trained on a large dataset that is labeled. Too small training sets may lead to poor training and false results. Another challenge is accountability, who is responsible for the predicitions if they are wrong? The models will also use real live daya to learn from, what if it learns things that we did not intend, there are ethical challenges. We need it to learn what we intend to with high accuracy but not anything else.

Answer 43

A programming model. map is a higher-order function that applies a given function to each element of, e.g., a list. Reduce is a higher-order function that by using a given function combines the part of, e.g., a list into a return value. Map and reduce can be used with many languages and since each step in Map and Reduce is separate it lends itself well to parallelisation. The idea behind MapReduce frameworks is to run data in parallell bacause the datasets are usually too big to store as one set. Since the map and reduce functions are already separate the model is very easily parallizable.

Answer 44

In Apache Spark, Resilient Distributed Datasets (RDDs) are the fundamental data structure. They are immutable, distributed collections of objects that can be processed in parallel. RDDs offer fault tolerance, allowing for reliable processing of large datasets. RDD is an unchangeable datastructure distributed over multiple computers with built in backup if something goes wrong.

Answer 45

Both are open-source engines designed for large-scale data processing. Spark is an update of hadoop. Hadoop processes data using a series of MapReduce tasks and uses HDFS to store data on clusters of machines. Data generated during map is written to disc and output data goes back to HDFS. Spark does its computing in memory which means that intermediate data is also stored in memory which makes it faster.

Answer 46

No, it is a reactive language where different processes communicate thorugh channels. As long as the channels have the right name the order of the processes does not have to be top to bottom.

Answer 47

wget is a program to get data from we servers. scp is a secure copying protocol for copying files, cosidered outdate and inflexible. Use sftp and rsync instead. Rsync is used for transfering and synchronising files, uses md5checksums to be sure that the files are identical to original. sftp is

Answer 48

Zip is an archive format that can contain both files and directories that can be compressed. gzip is a way of compressing and decompressing ONLY files. z-commands are different ways of working with gzipped files. zcat - view zgrep - search inside a zipped file zdiff - compare zipped files zless - browse a zipped file

Answer 49

It helps you not needing to keep all softwares on your own local computor. You set up a container that has everything you need to run the process and you do not need it anymore and you can discard the container.

Answer 50

A form of regularisation method where we are reducing the risk of overfitting the model by applying a transformation that maintains the mean output close to 0 and the output standard deviation close to 1.

Answer 51

Virtual machines are like software versions of real computers. They share the hardware with the computer whilst having its own operating system. Hypervisors are used to create and manage the VMs and the VMs contain all that is needed to run the applications such as storage, computing, memory ect. These have slow boot times and require large overhead, the isolation however makes them secure. Containers are lightweight, portable units of software that package up code and all its dependencies. They share the host OS kernel but run isolated processes in separate OS environments. Containers use a docker engine to create and manage the container. They have faster booting time and less overhead because they share the OS with the host and only the apps are run separately but this also means that they are less secure.

Answer 52

zip is not just for one single file while gzip is jus for one file.

Answer 53

Volume: When working with CryoEM microscopy we can see individual atoms of the object we are producing images of. This is producing a massive number of images and very large datasets are generated. This could introduce some problems since the datasets probably will not be able to fit onto computer memory and working in batches is therefore usually not possible. Velocity: The velocity of the generated data is very high, about 3TB a day. Variety: This data is usually very complex. Veracity: This data can be incredibly useful since it allows us to see individual proteins and protein systems or individual cells of an organ for example. Value: It can be used in structural biology, drug development, medical research ect.

Answer 54

File Transfer Protocol created 1971. Was not designed to be secure and has a lot of security issues. With sftp ssh takes care of the security.

Answer 55

One con is that the data is lost if the container is shut down, this may introduce problems for workflows. The containers are also less isolated than VMs since they share OS with the host which means that they are less secure.

Answer 56

Push and pull are different workflow tools. Push: tasks in the workflow are executed as data is made available. When this is done it moves on to the next task. Pull: The workflow asks for specific outputs, the workflow goes backwards to produce the output needed.

Answer 57

Domain specific language for workflows, it enables scalable and reproducible workflows.

Answer 58

- run multiple tasks in parallel - workflow tools models dependencies between tasks - avoid re-running computations to save time after failure. - separate unfinished from finished data to avoid that unfinished data is used downstream. ~80% of the work in dataanalysis pipelines is getting and preprocessing the data since data is seldom in a clean state right when you get it and it often needs to be integrated from multiple sources. Workflow tools are commonly used for this part of dataanalysis.

Answer 59

1. Improved readability 2. Less risk for data corruption 3. Less risk for mis-use of data 4. Improved understanding of how output data was created 5. (Often) improved portability 6. Improved maintainability 7. Simplified scalability

Answer 60

File based workflow tools performs tasks based on if specific filenames are available. Explicit dependency tools do not care as much about file names but a task is performed if the prerequisite tasks have been performed.

Answer 61

In static scheduling tools the graph of tasks is determined completely before run. In dynamic scheduling new tasks can be created during execution.

Answer 62

- nextflow - snakemake - luigi

Answer 63

A common area of high content imaging is to take pictures of cells in different conditions. We have many wells with different conditions in them and we take pictures of each well and measure many features of the cells. This produces millions of datapoints per plate. The images are very useful in many areas because the cells may die but the images remain for further analysis and AI applications.

Answer 64

single point data means that we are only looking at one parameter and we take the average of the values from many cells. Multiparamteric means that for thousands of cells we are looking at 100-1000s parameters.

Answer 65

Image quality is key. - Illumination correction: We lose light in the corners of the images if we do not do this because of the fact that the objective of the microscope is round. Correct for the uneven pixels in the image. - Quality control - out-of-focus(blurring) makes it hard to extract information. Same with saturation(bubbles ect.). Measure how many pixels are saturated, we flag these images to not analyze them in downstream analysis. - Assay evaluation (signal-to-background, signal-to-noise, signal window, z-score)

Answer 66

Segment the cells/nuclei to then extract features from the segments. Shape, intensity, texture, microenvironment and context (relationship between one cell and surroundings). It is important to also do quality control of the segmentation. Model based: manual or with algorithms. Machine learning: trained to find the optimal segmentation parameters.

Answer 67

The cells may run out of nutrients faster in one plate than the other and then it will look like they are growing poorly which me may mistake for treatment causes.

Answer 68

2D cultures - most widely used, here we use cell lines 3D cultures - cells grow in clumps on a 3D surface, should reflect reality better. Here we use primary cells. Challenge is that we start to loose the middle of the cells because dyes, light ect. cannot penetrate through. Organoids - Even more physiologically correct Animal models

Answer 69

Two different strategies for screening with HCI Biology directed assays start with specific hypothesis and targets. They can still be multiparametric. We may want to target things like - membrane - DNA - RNA - location/intensity of specific proteins Unbiased profiling such as cell painting means that we are without specific targets looking at many parameters/features such as texture, shape, morphology, area ect. to uncover unexpected patterns/associations between different conditions.

Answer 70

Academic workflows are generally very exploratory where many paths are investigated while industry workflows are more streamlined and the pipelines are very optimized for single tasks. This means that academic pipelines are more manual, iterative and typically batch oriented.

Answer 71

Transfer learning within deep learning is the concept of using already existing architectures, weights, biases that have been trained on other data on your own data. You will remove layers towards the end of the used model and depending on how different the applications are you will need to remove more or less layers of the already existing model and add new ones. You will train the “new” part of the model, meaning your own layers first. You freeze the pre-existing part so that they are part of forward propagation but not back propagation so that the already done training is not affected. Then train the pre-existing and lastly you will unfreeze and train them as a whole. One challenge with transfer learning may be to decide on how many epochs to run with the pre-existing model frozen and how many unfrozen.

Answer 72

A common plate metric for estimating assay quality. Values between 0.5-1 are considered to be good.

Big data Flashcards

(96 cards)