WEEK 3 Flashcards
INDEXING
With R we can relate one group of vector with another.
INDEXING EXAMPLE PROGRAM
MURDER$RATE <- #MURDER$TOTAL/MURDERS$POPULATION * 100000
#MURDERS$RATE<=0.71
#MURDERS$STATE[MURDERS$RATE]
THE SUM FUNCTION
The function sum returns the sum of the entries oF a vector and logical vectors get coerced to numeric with TRUE coded as 1 and FALSE as 0.
Thus we can count the states using:
SUM[MURDERS$RATE]
LOGICAL OPERATOR PROGRAMMING EXAMPLE
WEST <- MURDER$REGION == “WEST”
SAFE <- MURDERS$RATE < 1
INDEX <- WEST & SAFE
MURDERS$STATE [INDEX]
WHICH FUNCTION
This helps us to find the specific entry by converting vectors of logical into indexes
example
index <- murder$state == “California”
murder$rate[index]
MATCH
This function tells us which
indexes of a second vector match each of the entries of a first vector
example
index<- match(c(“California”,”New York”, “Florida”), murder$state)
ind
%in%
If rather than an index we want a logical that tells us whether or not each element of a
first vector is in a second, we can use the function %in%.
c(“Boston”, “Dakota”, “Washington”) %in% murders$state
#> [1] FALSE FALSE TRUE
PLOT
PLOT FUNCTION CAN BE USED TO MAKE SCATTERPLOTS
EXAMPLE
X<- MURDERS$POPULATION / 10^6
Y<- MURDERS$TOTAL
PLOT(X,Y)
ALSO
X <-WITH(MURDERS(POPULATION/10^6,TOTAL)
PLOT(X)
HISTOGRAM
HISTOGRAMS ARE A POWERFUL GRAPHICAL SUMMARY OF A LIST OF NUMBERS THAT GIVES YOU A GENERAL OVERVIEW OF NUMBERS YOU HAVE.
HIST()
BOXPLOT
They provide a
more terse summary than histograms, but they are easier to stack with other boxplots.
murders$rate <- with(murders, total / population * 100000)
boxplot(rate~region, data = murders)
DPLYR
Library(dplyr)
MUTATE FUNCTION
This function is used to change the date table by adding more columns, or rows.
FILTER FUNCTION
This is used to filter the data.
How to select a specific column in a data table?
By using select function.
EXAMPLE FOR MUTATE - ADD A NEW COLUMN CALLED RATE IN MURDERS DATA TABLE
murders <- mutate(murders, rate = total / population * 100000)
Filter the states with murder rate less than 0.71
filter(murders, rate <= 0.71)
Select only state, region and rate assign it to an object called new_table and show the states less 0.71 murder rate?
new_table <- select(murders, state, region, rate)
filter(new_table, rate<= 0.71)
Select only state, region and rate assign it to an object called new_table and show the states less 0.71 murder rate in a single line of code?
murders %>% select(murders,state,region,rate) %>% filter(rate<=0.71)
MUTATE FUNCTION
The mutate function is used to add a column to a dataset. A mutate takes the dataframe as first argument, and names and value as the second argument.
ADD MURDER RATE USING MUTATE FUNCTION
library(dslabs)
data(“murders”)
murders <- mutate(murders, rate = total / population * 100000)
Filter function to filter data
The filter function, which takes the data
table as the first argument and then the conditional statement as the second.
Selecting columns with select
new_table <- select(murders, state, region, rate)
filter(new_table, rate <= 0.71)
This selects only rate, state, region column of murders dataset
The pipe function
murders %>% select(state, region, rate) %>% filter(rate <= 0.71)
In general, the pipe sends the result of the left side of the pipe to be the first argument of
the function on the right side of the pipe
summarize() function
1, The main purpose is to create new summary table.
example:
s <- heights %>%
filter(sex == “Female”) %>%
summarize(average = mean(height), standard_deviation = sd(height))
This takes our original data table as input, filters it to keep only females, and then produces
a new summarized table with just the average and the standard deviation of heights
Pull() function
us_murder_rate <- murders %>%
summarize(rate = sum(total) / sum(population) * 100000) %>%
pull(rate)
The resulting value is numeric not a data frame.
groupby()
heights %>%
group_by(sex) %>%
summarize(average = mean(height), standard_deviation = sd(height))
The summarize function applies the summarization to each group separately.
Arrange()
murders %>%
arrange(rate) %>%
Note that the default behavior is to order in ascending order. In dplyr, the function desc
transforms a vector so that it is in descending order.
example:
murders %>%
arrange(desc(rate))
Nested Sorting/ Arrange
murders %>%
arrange(region, rate) %>%
Here
we order by region, then within region we order by murder rate:
What is tibbles
The functions group_by and
summarize always return this type of data frame. The group_by function returns a special
kind of tbl, the grouped_df.
Tibbles display it better?
The print method for tibbles is more readable than that of a data frame. We
can do this using as_tibble(murders).
Subset of tibbles are tibbles?
If you subset the columns of a data frame, you may get back an object that is not a data
frame, such as a vector or scalar.
With tibbles this does not happen.
class(as_tibble(murders)[,4])
if you want to access the vector that defines a column, and not get back a
data frame, you need to use the accessor $:
class(as_tibble(murders)$population)
Create a tibble using tibble?
To create a data frame in the
tibble format, you can do this by using the tibble function.
grades <- tibble(names = c(“John”, “Juan”, “Jean”, “Yao”),
exam_1 = c(95, 80, 90, 85),
exam_2 = c(90, 85, 85, 90))
How to convert rectangular dataframe into a tibble?
To convert a regular data frame to a tibble, you can use the as_tibble function.
ex: as_tibble(grades) %>% class()
The Dot Operator?
rates <-filter(murders, region == “South”) %>%
mutate(rate = total / population * 10^5) %>%
.$rate
median(rates)
the do operator?
heights %>%
group_by(sex) %>%
do(my_summary(.))