Introduction to R Flashcards
This deck/module focuses on mastering key points from Hadley Wickam's Intro to R text (https://r4ds.had.co.nz/introduction.html). The goal is fluid recall than can be deployed while coding.
What is Haley’s model of the tools needed in a typical data science project?
What does it mean to tidy data
Tidying your data means storing it in a consistent form that matches the semantics of the dataset with the way it is stored.
In brief, when your data is tidy, each column is a variable, and each row is an observation.
Tidy data is important because the consistent structure lets you focus your struggle on questions about the data, not fighting to get the data into the right form for different functions.
What does it mean to transform data
Transformation includes narrowing in on observations of interest (like all people in one city, or all data from the last year).
Creating new variables that are functions of existing variables (like computing speed from distance and time).
And calculating a set of summary statistics (like counts or means).
What is wrangling data
Tidying and transforming, together are called wrangling because getting your data in a form that’s natural to work with often feels like a fight.
What are the two engines of knowledge generation?
Visualization and Modeling
These have complementary strengths and weaknesses so any real analysis will iterate between them many times.
What is a good visualisation
A good visualisation will show you things that you did not expect, or raise new questions about the data.
A good visualisation might also hint that you’re asking the wrong question, or you need to collect different data.
Visualisations can surprise you, but don’t scale particularly well because they require a human to interpret them.
What are models? When should you use them? what can or can’t they do?
Models are complementary tools to visualisation. Once you have made your questions sufficiently precise, you can use a model to answer them.
Models are a fundamentally mathematical or computational tool, so they generally scale well. Even when they don’t, it’s usually cheaper to buy more computers than it is to buy more brains!
But every model makes assumptions, and by its very nature a model cannot question its own assumptions.
That means a model cannot fundamentally surprise you.
What is the last step of data science?
Communication, an absolutely critical part of any data analysis project. It doesn’t matter how well your models and visualisation have led you to understand the data unless you can also communicate your results to others.
How does programming tie in data science?
Programming is a cross-cutting tool that you use in every part of the project.
You don’t need to be an expert programmer to be a data scientist, but learning more about programming pays off because becoming a better programmer allows you to automate common tasks, and solve new problems with greater ease.
What are the two camps analysis can be divided into?
Hypothesis generation and hypothesis confirmation (sometimes called confirmatory analysis).
What are the two key regions in the R studio interface?
What is tidyverse?
A collection of R packages
What is a package in R
Packages are the fundamental units of reproducible R code.
They include reusable functions, the documentation that describes how to use them, and sample data.
How do you install a package? Install the tidyverse package
install.packages(“tidyverse”)
What’s the “prompt” in R?
Refers to the “>” in the local console
When asking a question, what are the three things you need to include to make your example reproducible?
required packages, data, and code
What is data exploration
Data exploration is the art of looking at your data, rapidly generating hypotheses, quickly testing them, then repeating again and again and again.
What is the goal of data exploration?
The goal of data exploration is to generate many promising leads that you can later explore in more depth.
What is the grammar of graphics?
A coherent system for describing and building graphs
What is ggplot2
One of the more elegant and versatile systems for visualisation.
Located in the Tidyverse package
How do you load a library? Use tidyverse as an example
library(tidyverse)
What must you do with a package every time you start a new R session?
reload the package
How do you call up a function from a package? Give an example.
> package:: function ( )
eg: > ggplot2:: ggplot ( )
What is a data frame?
A data frame is a rectangular collection of variables (in the columns) and observations (in the rows).
How do you learn more about a function or a package loaded into R?
”?”
eg. ?mpg
Provide code for a reusable template for mapping graphs with ggplot2
ggplot (data = <data>) +</data>
<geom_function> (mapping = aes (<mappings>) )</mappings></geom_function>
How do you begin a plot in ggplot?
ggplot ( ).
What does the function “ggplot ( )” do?
“ggplot( )” creates a coordinate system that you can add layers to.
The first argument of ggplot ( ) is the dataset to use in the graph.
This creates an empty graph-eg ggplot (data = mpg)