Exam I Flashcards

1
Q

bioinformatics

A

sequence analysis (DNA/RNA)

Stats, bio, cs

Ex/ forcing yeast to evolve drug resistance and sequencing genome to find mutated protein (yeast is eukaryotic)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

computational structural biology

A

protein/ligand structures

Physics, bio, cs, pharmacology

Ex/ drug discovery, predicting structure from sequence (homology modeling)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

systems biology

A

models complex biological networks (metabolic or cell-signaling networks)

Math, stats, bio, cs

Ex/ building mathematical model of the whole cell using protein interactions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

epidemiology

A

disease transmission/patterns/outbreaks

Stats, sociology, bio

Ex/ statistically significant correlation between tick bites and lyme disease

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

computational neuroscience

A

simulates brain function and cognition by simulating NS

Neurology, bio, cognitive science, physics, cs

Ex/ how to model neurons to gain insights into brain function

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

computational ecology

A

model ecosystem dynamics/disease spread

Ecology, bio, cs

Ex/ create a computer model of disease outbreak within a population

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

UNIX meaning

A

Uniplexed Information and Computing Service

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

unix impact

A

Foundation of many modern operating systems

Heavy-duty comp bio calculations run on UNIX

HUGE IMPACT → linux, macOS, android OS, and Chrome OS based on UNIX

Most remote servers and supercomputers run Unix/Linux

HISTORICAL → popularized hierarchical (directory-nested) file system

Launched the free software movement

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

unix historical impact

A

HISTORICAL → popularized hierarchical (directory-nested) file system

Launched the free software movement

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Linux

A

GUI built on Unix

open operating system used on computers

Used everywhere [OS, top500 supercomputers, 2% desktop computers]

Population distributions: CentOS, Fedora, Linux, Mint, Ubuntu

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

BASH

A

Bourne-again shell

Used to interact with Linux/Unix/macOS w/o the GUI

default Unix shell

File system commands streamline directory and file management
– Organized into directories and folders
– Enable file viewing and manipulation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

BASH features

A

Entirely text based (user input commands faster than clicking through GUI)

Able to navigate/view files/directories (files stored in directories (folders))

Able to run executable programs

***speed/control

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

high performance computing

A

***specialized systems designed to handle large-scale computational tasks

– Different architecture and capabilities

***parallel-processing with multiple nodes working in concert (100s/1000s connected together via high-speed networks

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Why build a HPC system? + examples

A

For when you need A LOT of computation:

  • Weather forecasting/climate modeling
  • Protein folding
  • Simulating galaxies
  • Simulating molecules/proteins
  • High-throughput virtual screening (drug discovery)
  • Machine learning
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Fundamental Architecture of HPCs

A

HPC systems rely on nodes for login, data transfer, and computation

System of nodes each have a specific purpose/function that affects how you interact with teh system and structure jobs

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

login node

A

login via this node (gateway into HPC system)

DO NOT run calculations here!

Shared cluster meant for light tasks

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

data transfer node

A

copy data to/from this node

Efficiently moves data in and out of the system

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

compute node

A

calculations take place here

Connected via a fast interconnect: high-speed network allowing for rapid communication between nodes

May be grouped into clusters: optimized for different functions, but data is all still accessible

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

SLURM meaning

A

Simple Linux Utility for Resource Managemen

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

HPC job schedulers: SLURM

A

Job schedulers optimize resource allocation and queue management

How many processors/resources for how long?

Manages multiple users, complex jobs, and limited resources (int between user/comp)

Waiting queue until resources are available

Sophisticated algorithms are used to balance system demands

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

PBS

A

Portable Batch System
(job scheduler)

22
Q

OGE

A

Open Grid Scheduler
(job scheduler)

23
Q

Top500 Supercomputers

A

Ranked biannually in June (EURO) and November (US)

Reflect global competition in super computing excellence

Balance of performance and power consumption

Rmax = maximum performance actually measured
Rpeak = theoretical peak performance
Power = power consumption

24
Q

serial programming

A

Task executed sequences

Straightforward; but doesn’t rake advantage of supercomputer abilities

25
Embarrassingly parallel programming →
Large number of nearly identical jobs (easy) No communication between them (independent)
26
Communication-intensive parallel programming →
Communication between processors/nodes Transferring data takes time. Not always a speed up Time intensive and much more complex
27
ideal scenario for parallelization
Work divided into completely independent pieces with no need for communication between them Each processors works on its own section w/o needing to check in with others Runs its own section of the data and then combines it ***highly efficient and scalable However…many system require communication for programming (more complex)
28
ssh
secure shell protocol allow remote access and file transfers on HPC systems ssh: secure gateway to remote computing resources Allows you to log into a remote machine and run commands as if you were sitting at the remote terminal Highly secure tunnel from your computer to the super computer (everything encrypted)
29
scp
secure copy Allows you to securely copy file to and from a remote system (between local & remote systems) Specifies destination while remaining secure
30
Module systems
prevent software conflict in HPC environments If all programs are available at the same time, they might conflict ** lets you set up with programs you'll run on the compute nodes Need to specify what software tools are required Selects software needed for a particular task to maintain order in complex envt
31
idle, mixed, allocated
IDLE: node not being used MIXED: some processors are being used, some are idle ALLOCATED: node is fully busy
32
sequeue -u username
Allows users to monitor and manage their active HPC jobs Monitors work, manages resources, troubleshoot problems (realtime job visibility)
33
SLURM job script
Must create a BASH script with certain elements: 1. Comments with instructions to SLURM 2. Module commands to load the software you need 3. UNIX commands to run the software
34
SLURM job script SBATCH directives
1. Job name 2. output filename 3. # of compute nodes 4. # tasks per node 5. time needed 6. cluster 7. partition
35
machine learning
- computers improve through experience (data) - machines find patterns in the data --- use patterns to do something (prediction, classification, clustering) - automates/scales up tedious tasks
36
traditional VS ML
- programming with a list of specific instructions - learning through experience and finding patterns in the data to be used
37
ML purpose
- Improves diagnostics, binding predictions, and structure analysis - scales up repetitive tasks that require intelligence
38
ML examples
Ex/ identifying biomarkers/symptoms → identifies patient disease Ex/ using crystal structure of protein → predict structure binding Ex/ use amino acid seq → predict secondary structure it belongs to
39
supervised learning
- uses labeled data to train predictive models - data is LABELED with features (right answers are known ahead of time) - learns to get the right label by matching patterns in the data ** some set of features (#) that describe real things
40
unsupervised
- learning discovers patterns without labeled data - NO LABELS - goal is NOT to map data features to right answers *** goal is to make sense of the data ex/ grouping data with similar characteristics
41
ML training and testing
- separate training and testing ensures UNBIASED evaluation of models - BEFORE TRAINING = Set aside some data for testing - then use training data to teach ML model (new info) - then test the model on testing data set aside initially - training data trains - testing data tests/evaluates ML model's performance
42
overfitting
- models capture noise instead of real patterns - TOO MANY FEATURES - data isn't truly representative - finds random patterns in training data that doesn't generalize all scenarios (doesn't predict accurately)
43
underfitting
- models fail to learn from the data - not enough features that the learner can distinguish one set of descriptors from another
44
classification
Categorize the data into a limited number of predefined classes Ex/ does molecule bind or not? Ex/ does brain have tumor type A, B, and C?
45
regression
Continuous output (not a category) Ex/ binding affinity Ex/ predicted age of death based on certain conditions *** anything can be turned into a classification problem by setting a range/right answers
46
decision trees
classify data through hierarchical rules The machine learns a series of nested decision rules ***SUPERVISED -- divides the data iteratively until it knows that category the examples go into
47
decision tree advantages and disadvantages
Advantage: people can understand the machine’s “thinking” by looking at the rules Disadvantage: overfitting is easy
48
Scikit-learn → python library type
Simplifies the use of many ML algorithms Implements so many machine-learning algorithms Done much to democratize ML [increased accessibility] Uses pip and conda to control software downloading
49
vector machines
Support vector machines (SVM)classify data using optimal separating boundaries SUPERVISED ML used for classification and regression tasks GOAL ⇒ find line that maximizes margins (far away from lines as possible) ** line that can best separate 2 things/patterns in the data
50
Kernel functions
Enhance SVM performance by transforming data Characteristics might lend themselves to a linear separation calculates similarity between 2 data points to take non-linear relationships and transforming them into linear ones Transform the data with a kernel function; you can get a better linear separation Separates data into patterns/relationships BC most data WILL NOT be easily seperated by lines **both supervised and unsupervised
51
K-means clustering
Groups data points based on similarity UNSUPERVISED ML Given objects with n characteristics each, can we divide those objects into k clusters Specifies beforehand – then data tries to fit into the categories It's an iterative process that uses a mathematical distance measure to assign each data point to the nearest cluster centroid