Bioinformatics Exam 3 Review Flashcards
Programming
helps in collecting and manipulating data, automating analysis workflows (to show people what you did), minimizing human error and generating reproducible reports, quick processing of large datasets and repetitive tasks, visualizing and making sense of the data
Programming language
letters/symbols create words according to rules, language for humans to formulate instructions for computers to generate some desired output, compiler and interpreter software allow an instruction formulated in a programming language to be translated into executable machine level operations.
Pathway: Instruction (in your mind) → instruction in programming language → instruction in machine level language → execution of instruction/computation → generated output
Source Code
a set of instructions formulated in a programming language that is readable by humans
Program
a set of instructions stored in a form that can be executed by a computer
Compiler
a software that translates source code into a machine level program that is (usually) efficiently optimized for the machine it is compiled for
Time to translate → Slow
Time to execute → Fast
Interpreter
translates source code scripts into machine level operations “on the fly” and executes them line by line
Time to translate → Fast
Time to execute → Slow
1976
Chambers, Becker and Wilks develop the S statistical programming language at Bell laboratories
Aim: facilitate quick transitions from idea to software
This Interpreter based language allowed modifications, testing and trouble shooting of programs quick and convenient.
1993
Ihaka and Gentleman re-implement S and Name it the “R programming language”
1995
R is decided to be made freely available under the GNU General Public license (But not officially released)
1997
R Core Group is founded and starts taking control of R’s further development, the Comprehensive R Archive Network (CRAN) is launched, enabling sharing and curation of user developed components that extends R’s capabilities
2000
R version 1.0.0 is released to the general public
2009
New york Times article: “Data Analysts Captivated by R’s Power”, Ashlee Vance
Good description of how R makes a difference → Daryl Pregibon (Google): “it allows statisticians to do very intricate and complicated analyses without knowing the blood and guts of computing systems”
2017
a study found that R has shown extreme growth
2019
Another study found that R is the most requested programming language
Comprehensive R Archive Network (CRAN)
a network of ftp and web servers storing versions of code and documentation for R. This serves as the main general purpose repository for R packages and if there is something common that is a common problem you can use a pre-made package to solve the answer to your problem.
R
language and environment for statistical computing and graphics, open source language that is free, provides tools for statisticians, data miners, data analysts, data scientists and academic researchers
Bioconductor
Another R package repository, free, dedicated to the analysis of genomic data and biological high-throughput assays, primary focus on an R package repository serving the needs of bioinformaticians and biomedical researchers
Packages available: >1800
Mission: accessibility of powerful analysis and visualization tools, reproducible research, rapid development of software components that are both scalable and compatible with each other
Commands in R
R’s interpreter can process 2 forms of these → expressions and assignments, these can be separated by line-breaks or the “;” character, individual components within commands can be arbitrary separated by spaces and tabs
Expressions
commands that are evaluated, printed (optional) and their output is lost, these take some input arguments or values and return some output values
Operators
are generally expressed via 1 to 3 consecutive special characters and often handle fundamental, essential programming tasks, there are several other operators that handle tasks such as logic or comparison
Examples: ? opens a webpage with helpful documentation and explanations of a function
Objects
individual pieces of data that have two major attributes,:
Data type:what type of information it contains
Value: the actual information that it contains
NOTE: internally the value of an object is just a bunch of zeros and ones in the memory of the computer the data type is what tells R how to interpret and display the value of the object.
Scalar and multidimensional data types
the two fundamental classes of data types
Character Objects
display letters, words and text, wrapped in quotation marks
Logical Objects
only two possible values (yes and no/ true and false {abbreviated T and F}), used when you want to check or remember whether or not something is true or has happened when you run a program.
Numerical Objects
integers and decimals
Parenthesis
used to group arguments of expressions in conjunction with commas and can be used to control the order of operations in expressions, the expression enclosed within the innermost parentheses will always be evaluated first
Assignment Commands
commands that evaluate an expression and store it, so that it can be accessed again in the future. These store objects in variables where the expression on the right-hand side can either be an R object or any type of valid expression that creates an R object and the left hand side is a variable that can be understood as a label or name that is attached to an object in order for R to remember it.
R Console
an interactive interface for the command line interpreter (it comes up when you open R), commands can be typed into the console and executed by hitting the key
Incomplete Commands
expressions that have not provided necessary right-hand side arguments or expressions that have not (yet) closed all their opened parentheses or brackets.
Negation/ “NOT” operator
done with an exclamation point, this turns true into false and false into true (opposites)
Logical “AND”
takes two logical expressions/variables and returns TRUE ONLY if BOTH are TRUE, otherwise it will return FALSE
Logical “OR”
returns the expression as true if either the left or right is true. Otherwise it is returned as false
Operator: |
“Equal to” Operator
takes R objects (or expressions generating an R object) and returns TRUE if both objects are identical to each other.
Operator: == (Two consecutive equal signs)
“Not equal to” Operator
takes R objects (or expressions generating an R object) and returns TRUE, if both objects are not identical to each other
Operator: !=
Inequality Operators
compare numerical objects to each other
Operators: < (less than), <= (less than or equal to), > (Greater than), >= (greater than or equal to)
If blocks
how to make logical expressions useful, help us to execute pieces of code conditionally and react to different inputs/scenarios while the program is running
Syntax: if (condition){# conditional lines of code goes here}
Where “condition” is a logical expression or variable. If the condition is met the code block in curly brackets will be executed if it is not it will not be executed.
If…else statements
allow us to conveniently cover two mutually exclusive cases (i.e. if one is true the other is false)
Syntax: if (condition){# if “condition” is TRUE then do…} else {# if “condition” is FALSE then do…}
Vectors
creating and manipulating simple, ordered list of a specific scalar data type
Three components:
Scalar data type
Ordered cells
Values
Created through the concatenation function
All elements of this has to have the same data type (you CANNOT create a one of these that contains characters and numbers), it’s not possible to have a mixed type one of these
Multi-dimensional Data types
more complex data types that are able to contain and arrange multiple scalars (vectors are the simplest of these types)
Binary operators
require either both arguments are vectors of the same length or one of the two arguments is a single object (i.e. a vector of length one)
If different length → R will give a warning and “recycle” the shorter vector from the beginning to extend its length
Subset operator
accepts vectors of logical expressions, logical vector has to be the same length as the vector we want to subset
accesses a specific element in the list and returns a new list with said element
Symbol: […]
Matrix
works just like a vector but has 2 dimensions (cells have an x- and y-position), instead of just one
Array
the abstraction of both vectors and matrices, is a multidimensional matrix with an arbitrary number assigned to user-specified dimensions
A 1-dimensional R ______ is equivalent to an R vector and a 2-dimensional R _______ is equivalent to an R matrix
List
are an extension of the vector idea, this is a generic collection of R objects
Each element in ______ is an R object with an arbitrary data type and dimension (also allows different lengths as well as complex objects)
Useful to group various types of data that belong together (where they do not conveniently fit into a single table)
Can also use the assignment operator inside the ______ function to give elements names by which they can be accessed in the future
Syntax of ____ function:
_____( arg1, arg2, arg3…)
Extraction operator
Extraction operator
will access a specific R object in the list and return the said object directly.
Symbol: [[…]] or $
Wrappers
any entity that encapsulates another entity, an object that holds other objects or a “container” for other objects, lists are generic wrappers for R objects, probably wont work directly with lists in this course (lists allow us to form a general intuition about other generic _________)
Generic __________: Three types → S3, S4 and S5 objects
Loops
help us perform repetitive tasks, reduce redundancy of code and reduce the amount of code required to perform a task. Note that R only has “for each” loops and “while” loops. The index variable can help us execute the same piece of code for different inputs.
“For” loops
repeat something “N” times (we choose N)
Structure: for( i in 1:n){# repeat the following code…}
“For each” loops
repeat something for each element in a set
Structure: for(i in elements){# repeat the following code…}
Header → defines an index variable (here named i) and an R vector of “elements” for each of which we want to repeat something
Body→ the code block that will be repeated
Can be read as for each “i” in a set of “elements”, do this
What it does (sequence of events):
Sets the index variable “i” to the 1st object in “elements”
Executes the code inside the body function
Sets the index variable “i” to the 2nd object in “elements”
Executes the code inside the body function
Repeats for all objects in “elements”
Functions
important form of expression, take arbitrarily many arguments 0,1,2,3..., usually perform more complex tasks, return some desired output, R has pre-defined functions and allows users to create their own, after a fxn has been created it can be used as a shorthand to run the code encapsulated inside of it, NOTE: fxns have "local scope" whereas code outside of functions has "Global scope" (meaning variables inside of a fxn are created independently and separately from the R environment outside of the function), R will discard any assignments made inside of a function after it has been executed, when a fxn has multiple input arguments they will be assigned in order when the fxn is executed (NOTE: order can be arbitrarily changed if arguments are referred to by their names), default values for input arguments can be made by using the assignment operator next to input variables in the header line of the function Syntax: nameOfFunction(arg1,arg2,arg3...) my_function = function(arg1,arg2,arg3,...){ # function goes here... } Header: Defines the name of the function on the left of the assignment operator and the names of input variables/arguments of the function (these variable names are provided in the parenthesis) Body: the code block that will be executed when "my_function" is used
Matrix function
my_matrix = matrix(
Data = ?, nrow = ?, byrow = ?
)
Input arguments:
“Data” → a vector of values or objects that will be put into the cells of the matrix
“nrow” → the number of rows in the matrix
“ncol” → the number of columns in the matrix
“byrow” → logical, if TRUE values populate the table they populate in order via row-by-row; if FALSE values populate the table they populate in order via column-by-column
cbind(A,B)
aka column bind, stitch A and B together into a single matrix such that the columns in B follow the right of the columns in A
Subset
the ________ operator can also be used in combination with the assignment operator to modify specific elements inside of the matrix, a matrix can be subset and accessed using the _________ operator [rows,columns] several different ways
Data Frames
the data frame is R's preferred data type to represent R X C data tables, like a matrix it has R rows and C columns and it supports all subset operations [A,B] that matrices have access to, in contrast each column has its own associated data type and each column can have a different data type (this offers a lot of additional functionality) my_tab = data.frame( column1 = c(...), column2 = c(...), Column3 = c(...), ... )
Reading
he process of loading a file into memory
Fxn: read.table, this fxn reads a simple text file containing table data into R and turns it into a data.frame object, this function expects a file to contain a plain text such that each line in the file represents the row of a table and columns are separated by a special character (usually a space, “” or tab “\t” character), in the “people.txt” file columns are separated by the “” character
Syntax: my_tab = read.table (
file = “…”, header = TRUE, sep = “…”,
)