R Flashcards
argument
(r) information that a function needs in order to run
variable
representation of a value in R that can be stored for use later during programming (can also be called OBJECT)
vector
a group of data elements of the same type stored in a sequence in R
Pipe
a tool in R for expressing a sequence of multiple operations, represented with “%>%”; takes the output of one statement and makes it the input of the next statement
The 4 types of Vectors
logical (TRUE, FALSE), character (words), integer (1L, 2L, 3L), double (2.5, 4.561)
create a data frame
data.frame(x=c(1,2,3), y=c(1.4, 5.4, 10.4)
create a new folder
dire.create (“destination_folder”)
create a file
file.create(“new_word_file.docx”)
copy a file
file.copy (“new_text_file.txt”, “destination_folder”)
OR operator
I or II
NOT operator
!
common function to preview data (1st 6 rows)
head()
these functions return summary - high level view of each column in your data arranged horizontally
str()- horizontal summary, and glimpse()
function for returning a list of column names from dataset
colnames()
renaming a column
rename(diamonds, carat_new = carat, cut_new = cut)
summarizing your data
summarize(diamonds, mean_carat = mean(carat))
separates plots by a charactaristic
+ facet_wrap(~cut)
code for using diamonds dataset, plotting x axis carat, , y axis price, and dots are colored differently for different cuts, scatter plot, different plots for different cuts
ggplot(data = diamonds, aes(x = carat, y = price, color = cut)) +
geom_point() +
facet_wrap (~cut)
packages (R)
units of reproducible R code
vignette
documentation that acts asa guide to an R package
browseVignettes()
filter by vitamin c dose 0.5
filtered_tg
sort by tooth length (after a filter)
arrange(filtered_tg, len)
Pipe operator shortcut
ctrl + shift + m
switch between a date-time to a date
as_date() (in the lubridate package)
data frame
collection of columns
tibbles
dataframes in the tidyverse you can’t change the type of info (number - string)
how to add a column to a dataframe
mutate(dataframe, column_new = column*100)
install tidyverse
install.packages(“tidyverse”)
after you’re done installing tidyverse, what is the next step?
load it: library(tidyverse)
Tibbles
only pull up first 10 rows of a dataset.
Never change the names of your variables,
or the data types of your inputs.
Part of tidyverse
how to read a csv file
read_csv()
import “hotel_bookings.csv” into R and save it as a data frame titled ‘bookings_df’
bookings_df
if you want to create another (smaller) data frame from the existing dataframe (for example wit hthe “adr” and “adults” columns of the bookings_df dataframe).
new_df
add a column to the dataframe: total = adr/adults
mutate(new_df, total= ‘adr’/adultsread
skimr package
makes summarizing data really easy, lets you skim through it more quickly
janitor package
has functions for cleaning data
functions to get summaries of our dataframes
skim_without_charts(), glimpse(), head(), str(), select()
packages that simplify data cleaning tasks
skimr and janitor
select()
specifies certain columns or excludes columns
if you want all the columns in the penguins dataset EXCEPT the species column
penguins %>%
select( - species)
rename a column (in penguins dataset)
penguins %>%
rename(island_new = island)
make all columns uppercase (or lowercase)
rename_with(penguins, toupper) (or tolower)
clean_names()
ensures only characters, numbers and underscores in the names
%%
returns remainder after division
%/%
returns an integer value after division (5%/%2=2)
4 kinds of operators
arithmetic, relational, logical, assignment
exponent
equal to
==
not equal to
!=
&&
compares only first numbers in the vectors (x
!
logical NOT
arrange()
chooses what variable you want to sort by
sort by bill length (penguins) in descending order
penguins %>%
arrange( - bill_length)
create a dataframe
assigning a name to something
view dataframe
View()
putting similar values together in a column
group_by()
leave the missing values out
drop_na( )
Get averages (or max values) of bill length per island penguins
penguins %>% group_by(island) %>% summarize (mean_bill_length_mm = mean (bill_length_mm))
(or replace mean with max)
get max and mean bill length for each species by island.
penguins %>% group_by(species, island) %>%
summarize(max_bl=max(bill_length_mm), mean_bl = mean(bill_length_mm)
only view Adelie penguins
penguins %>%
filter (species == Adelie)
data cleaning packages
install.packages(tidyverse, skimr, janitor
import and save csv file “hotel bookings” as a dataframe
bookings_df
view only certain columns from a dataframe
trimmed_df
cleaning functions
- rename: (to rename columns)
dataframe %>%
rename(column_new = column)
- unite:
dataframe %>%
unite (column1_2, c(“column1”, “column2”), sep =
“ “)
- mutate: (adds a column)
dataframe %
mutate(guests = babies+children+adults)
- summarize (newcolumn= mean(column),
newcolumn1 = sum(column1)
transform data with these functions
separate( )
unite ( )
mutate ( )
separate( ) syntax
separate( dataframe, column, into = c(newcolumn1, newcolumn2), sep = “ “)
unite( ) syntax
unite (dataframe, “newcolumn”, column1, column2,
mutate( ) syntax
dataframe %>%
mutate(new_column = column/1000, new_column2 = column2/1000)
Convert data from wide to long or long to wide
pivot_longer( ), pivot_wider( )
makes sure column names are unique and consistent
clean_names( )
bias function (package, syntax)
SimDesign package, bias(actual, predicted)
sort hotel_bookings columns by lead time (most to least)
arrange(hotel_bookings, desc(lead_time))
how to find max & min lead time in hotel_bookings
max(hotel_bookings$lead_time)
min (hotel_bookings$lead_time)
average lead time in hotel_bookings
mean(hotel_bookings$lead_time)
Filter syntax into a “new_hotel_dataframe”
new_hotel_dataframe
find min/max/mean lead times at the two hotels, call it “hotel_summary”
hotel_summary %
group_by (hotel) %>%
summarise (average_lead_time = mean(lead_time)
max_lead_time = max(lead_time)
min_lead_time = min (lead_time)
functions that let you change your data
arrange( ), group_by( ), filter( )
making columns lower (or upper)case
rename_with(dataframe, tolower)
core concepts in ggplot2
aesthetics, geoms, facets, labels, and annotations
view palmerpenguins dataset
install.packages(“palmerpenguins”)
library(“palmerpenguins”)
data(penguins)
View(penguins)
two different geoms
geom_point and geom_bar
geom_point argument for flipper length as xaxis, and body mass g as yaxis
ggplot(data=penguins) +
geom_point(mapping = aes(x=flipper_length_mm,
y=body_mass_g))
geom
a geometric object used to represent your data (points, bars, lines and more)
aesthetic
a visual property of an object in your plot (position, color, shape or size)
mapping
matching up a specific variable in your dataset with a specific aesthetic
3 steps to plot a graph
- start with ggplot function and choose a dataset
- add a geom_ function to display your data
- map the variables you want to plot in the arguments of the aes( ) function
what other aesthetics can you add to variables
x,y, color, shape, size, alpha (transparency)
this geom shows general trends in data
geom_smooth
this aesthetic breaks out geom_smooth into pieces
linetype = (species)
this geom creates a little noise around each point
geom_jitter
When using geom_bar, the color aesthetic will…
only put outlines of the color around the bars, the “fill” aesthetic will fill in the color
data smoothing for plots with less than 1000 points
ggplot(data, aes(x= , y= )) +
geom_point() +
geom_smooth (method = “loess”)
data smoothing for plots with more than 1000 points
ggplot(data, aes(x= , y= )+
geom_point() +
geom_smooth (method = “gam”, …)
facets
let you display smaller groups, or subsets, of your data
2 types of facets
facet_wrap, facet_grid
facet_wrap(~species)
let’s us create a separate plot for each species
allows you to facet your plot with two variables;
facet_grid
vertically by the first variable, and horizontally by the second variable
~
tilda symbol
what rotates text 45 degrees to make it easier to read?
theme(axis.text.x = element_text(angle = 45)
how to add a label
labs(title=”Palmer Penguins”, subtitle=”3 Species”, caption = “collected by Dr.”)
text INSIDE the grid of the plot
annotate function
“annotate” function syntax with font, size and tilt
annotate(“text”, x=50, y=50, label= “The largest”, fontface=”bold”, size=4.5, angle=25)
how to save a plot (2 ways)
- Explort
2. ggsave(“—.png”)
find earliest year in hotel_bookings
min(hotel_bookings$arrival_date_year)
paste0
subtitle=paste0(“Data from: “, mindate, “ to “, maxdate))
ggsave syntax
ggsave(“—.png”, width=7, height=7)
R Markdown
file format for making dynamic documents with R
Markdown
a syntax for formatting plain text files
R Notebook
lets users run your code and show tha graphs and charts that visualize the code
HTML
The set of markup symbols or codes used to create a webpage