big_data Flashcards

1
Q

What are the 5 V’s of big data?

A

Volume, Velocity, Variety, Veracity and Value
- Volume: massive amounts of data generated
- Velocity: data is generated rapidly
- Variety: data is complex or unstructured
- Veracity: quality of data (unclean data, missing values etc)
- Value: can it be turned into something useful

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What do we need to work with large/Big data?

A

Infrastructure (storage, compute, networks), Systems and software (Efficient, scalable), methods and algorithms, expertise.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Which command(s) can be used to check if your batch jobs are running?

A

squeue
jobinfo

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Briefly describe UPPMAX. What is it? What does it have? What should be done there?

A

UPPMAX is the university’s resources for high-performance computers (HPC) and large-scale storage. It provides researchers and students access to powerful computational resources and specialized software. Uppmax has several clusters that have large amounts of storage and processing power, for example Rackham and Snowy and also gives access to a wide range of software packages that can be installed using “module load”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What type of UPPMAX nodes are used for logging into a cluster?

A

login

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is QSAR?

A

Quantitative Structure-Activity Relationship.
Used to mathematically describe molecular structure and by ml predict molecular properties.

The basic idea of QSAR is to mathematically describe the structure of molecules and then use machine learning to predict some properties of interest. The machine learning algorithm will look at all the fingerprints and those that are similar will be getting similar predicted values.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

BATCH VS STREAM

A

Working with batches of data most common in life science.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Cross validation

A

Cross validation is a technique used in machine learning to evaluate the performance of a model on unseen data, this is done by separating the data in multiple folds and using one of those as validation and the rest to train the model.

We can run it in parallel if needed each fold in one NODE

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

parallel computing?

A

DIVIDE INTO NODES A PROBLEM

To perform parallel computing in a computing cluster, the problem must be divided into smaller parts that can be solved independently. Each node then solves its part of the problem in parallel with the other nodes, producing a partial result. These partial results are combined to form the final solution.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

ASS 1

A

NGS ALIGNMENT BOWTIE2
In the NGS assignment, we used Bowtie2 to align reads from next generation sequencing to a reference genome. This was first done for a bacterial genome (in an interactive node on UPPMAX) and then for a larger genome (using a batch script and submitting it to the job queue at UPPMAX via SLURM).

For the NGS assignment we were tasked with aligning multiple files of sequencing reads to a reference genome using the software bowtie2 in the bioinf package.
First we did the alignment for a prokaryote whose genome is smaller and the processing of the task was therefore faster and the data was small enough to fit locally. After this we did the same task but on the genome of a fly which was bigger and we then used snic tmp since the data would be too big to fit locally.
Batch script was submitted to nodes on uppmax using SLURM.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

ASS 2

A

In assignment 2 we ran a nextflow pipeline for part 1 and for part 2 we wrote a nextflow pipeline consisting of 4 processes that preprocess mass spectrometry data using the open source software collection OpenMS.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What does slurm do?

A

manage the job queue

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

send a job in uppmax

A

sbatch …

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

where should long calculations be performed

A

compute nodes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

que es nextflow and how it works

A

Nextflow is workflow management software which enables the writing of scalable and reproducible scientific workflows. It also allows to do run parallel tasks and its stream oriented. You can have different processes (which are isolated) in different scripting languages and connect them using channels, this will tell the order by which the processes will run.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

what does bowtie2 does

A

sequence alignment

17
Q

Give arguments as to when dimensionality reduction can be useful for analyzing multivariate datasets.

A

Dimensionality reduction is a technique used to reduce the number of features (dmentions) in a dataset while keeping the essence of the data (or the most important information). Analyzing a datasets with many features can become hard and time consuming, specially if we are trying to build/train a model, it can make the performance of it very slow. Dimensionality reduction can be useful to help improve a models performance and also to prevent overfitting. A way of doing this is by reduction is by doing a PCA analysis.

17
Q

What do we mean when we say that working with batches of data is the most common approach for biomedical data analysis? From this perspective, is there a difference when working with da

A

Working with batches means that you generate the data and then you store it for later analysing it, its not analyzed directly as its produced. This means we need to be able to store our data somewhere before. This is common for biomedical data analysis, for example, if we sequence DNA this can be stored in a file and later be used for mapping it to a reference or doing an alignment.
This type of workflow changes when working with images of cells. Images tend to have lots of features, produce larger amount of data and be harder/costly to store, so its preferred to have a continuous flow in the analysis (stream analysis) and directly work with them.

18
Q

UPPMAX?

A

UPPMAX is a high-performance computing cluster that consists of both login nodes and compute nodes as well as shared storage. It can be used to run computations that require a lot of memory or that require mutliple nodes (as is common when working with genomics data).

19
Q

Within a machine learning context, explain the concept of transfer learning and what challenge it addresses

A

Transfer learning is a technique in ML where a model trained on one task is used as the starting point of the next one. By using the learned features of the first process the second model can learn more quickly, be more effective, and need less computational resources and data.

20
Q

In the course we have seen three different tools: scp, wget, and rsync. Explain how they differ from each other and give examples of when to use each tool.

A

scp (secure copy) encrypted file transfer, but this is now considered outdated. It’s also considered to be slower and less optimized. I used scp most of the time, since it’s easy and I’m only transferring one file.
“Outdated, inflexible and not readily fixed” but still used file transferring, similar to Bash copy command
rsync is a program for transferring and synchronizing files, faster and optimized (compared to scp), it makes sure the file was correctly transferred. I would use rsync to transfer or synchronize remote folders, but they say this should be our go too.
Modern file transferring, can transfer just needed differences and continue aborted transfers. Should probably be your go to file transfer tool.
wget allows us to download files from the internet directly to our directory, this would be helpful if we want to download something from the web or simple outside of your network

21
Q

HADOOP

A

Hadoop
Pros: Mature, reliable, extensive ecosystem.
Cons: Slower due to disk I/O, not ideal for iterative processes.

Hadoop: works by loading things from disk, can use relatively cheap computers to run things in parallel, hadoop file system is distributed and scalable storage system designed for extremely large data and can be used as a storage solution for Spark.

22
Q

SPARK

A

Spark
Spark for its speed and efficiency in iterative tasks common in QSAR projects.

Pros: Faster due to in-memory processing, suitable for iterative algorithms.
Cons: Requires more memory, less mature ecosystem.

23
Q

difference between a virtual machine and a docker container

A

vm isolation
docker creates the envoriroment
vms safer

24
Q

parameters in deep learning

A

weights and bias

25
Q

Linux commands can be used to make a copy of a file

A

cp

26
Q

The “z commands”, e.g., zless, zgrep, etc., are convenient…

A

when working in a streaming fashion directly with gzipped files without having to first unzip the entire files.

27
Q

FTP is an …

A

FTP is an outdated insecure invention from the seventies that should not be used for any sensitive data.

28
Q

learning how to think in map and reduce operations is a general skill that can be reapplied in different programming languages
T/F

A

Most calculations can be converted into a series of map and reduce operations and since there are many different languages supporting map and reduce, learning how to think in map and reduce operations is a general skill that can be reapplied in different programming languages.

29
Q

Map is a …

A

higher order function for applying a function on each element in a list. Another fitting name for it would be apply-to-all.

30
Q

One motivation for using map reduce is that

A

in a world were the process of writing good parallel code is a hard problem that many struggle with, once an algorithm has been written as a series of map reduce operations it is already parallelizable and thus that hard problem is solved. So if programmers get used to thinking in map and reduce the parallelisation comes sort of for free (of course some problems might not lend themselves way to map / reduce)