big_data Flashcards
What are the 5 V’s of big data?
Volume, Velocity, Variety, Veracity and Value
- Volume: massive amounts of data generated
- Velocity: data is generated rapidly
- Variety: data is complex or unstructured
- Veracity: quality of data (unclean data, missing values etc)
- Value: can it be turned into something useful
What do we need to work with large/Big data?
Infrastructure (storage, compute, networks), Systems and software (Efficient, scalable), methods and algorithms, expertise.
Which command(s) can be used to check if your batch jobs are running?
squeue
jobinfo
Briefly describe UPPMAX. What is it? What does it have? What should be done there?
UPPMAX is the university’s resources for high-performance computers (HPC) and large-scale storage. It provides researchers and students access to powerful computational resources and specialized software. Uppmax has several clusters that have large amounts of storage and processing power, for example Rackham and Snowy and also gives access to a wide range of software packages that can be installed using “module load”
What type of UPPMAX nodes are used for logging into a cluster?
login
What is QSAR?
Quantitative Structure-Activity Relationship.
Used to mathematically describe molecular structure and by ml predict molecular properties.
The basic idea of QSAR is to mathematically describe the structure of molecules and then use machine learning to predict some properties of interest. The machine learning algorithm will look at all the fingerprints and those that are similar will be getting similar predicted values.
BATCH VS STREAM
Working with batches of data most common in life science.
Cross validation
Cross validation is a technique used in machine learning to evaluate the performance of a model on unseen data, this is done by separating the data in multiple folds and using one of those as validation and the rest to train the model.
We can run it in parallel if needed each fold in one NODE
parallel computing?
DIVIDE INTO NODES A PROBLEM
To perform parallel computing in a computing cluster, the problem must be divided into smaller parts that can be solved independently. Each node then solves its part of the problem in parallel with the other nodes, producing a partial result. These partial results are combined to form the final solution.
ASS 1
NGS ALIGNMENT BOWTIE2
In the NGS assignment, we used Bowtie2 to align reads from next generation sequencing to a reference genome. This was first done for a bacterial genome (in an interactive node on UPPMAX) and then for a larger genome (using a batch script and submitting it to the job queue at UPPMAX via SLURM).
For the NGS assignment we were tasked with aligning multiple files of sequencing reads to a reference genome using the software bowtie2 in the bioinf package.
First we did the alignment for a prokaryote whose genome is smaller and the processing of the task was therefore faster and the data was small enough to fit locally. After this we did the same task but on the genome of a fly which was bigger and we then used snic tmp since the data would be too big to fit locally.
Batch script was submitted to nodes on uppmax using SLURM.
ASS 2
In assignment 2 we ran a nextflow pipeline for part 1 and for part 2 we wrote a nextflow pipeline consisting of 4 processes that preprocess mass spectrometry data using the open source software collection OpenMS.
What does slurm do?
manage the job queue
send a job in uppmax
sbatch …
where should long calculations be performed
compute nodes
que es nextflow and how it works
Nextflow is workflow management software which enables the writing of scalable and reproducible scientific workflows. It also allows to do run parallel tasks and its stream oriented. You can have different processes (which are isolated) in different scripting languages and connect them using channels, this will tell the order by which the processes will run.