WEEK 3: Working with R Flashcards
A data frame
is a collection of columns.
Data frames rules
First, columns should be named. Using empty column names can create problems with your results later on.
The data stored in your data frame can be many different types, like numeric, factor, or character.
Each column should contain the same number of data items, even if some of those data items are missing.
Tibbles
They make working with data easier
They won’t change your strings to factors or anything else.
Tibbles also never change the names of your variables
They never create row names.
Tibbles make printing in R easier.
Data frames and tibbles are the building blocks for analysis in R so having set standards for how they’re built and dealt with is pretty important.
Tidy data standard
Tidy data refers to the principles that make data structures meaningful and easy to understand.
Variables are organized into columns.
Observations are organized into rows
Each value must have its own cell.
The head function
Gives us just the first six rows.
str() and colnames()
Get the structure of the data frame
The mutate fonction
Makes changes to our data frame.
mutate(name of the data frame to change, new column with its calcutation if needed)
Ex: mutate(diamonds, grammes = kilo/1000)
There are three common sources for data
- A
package
with data that can be accessed by loading thatpackage
- An external file like a spreadsheet or CSV that can be imported into
R
- Data that has been generated from scratch using
R
code
Create a data from scratch
Create individual vectors of data and then combine them into a data frame using the data.frame()
function.
To preview you data frame
colnames()
glimpse()
str()
readr
read_csv(): comma-separated values (.csv) files
read_tsv(): tab-separated values files
read_delim(): general delimited files
read_fwf(): fixed-width files
read_table(): tabular files where columns are separated by white-space
read_log(): web log files
data() function
Loads datasets in R
If you run the data function without an argument, R will display a list of the available datasets.
readr_example
readxl_example
The way to use read_csv and read_excel
read_example(“dataset name”) –> displays the list of example datasets
Ex: read_csv(“dataset name”)
read_csv(readr_example (“dataset name”))
read_excel (“dataset name”)
read_excel(readxl_example (“dataset name”))
excel_sheets((“dataset name”) –> lists the names of the individual sheets inside an excel file.
Data cleaning packages
Here package makes referencing files easier
The Skimr package makes summarizing data really easy and let’s you skim through it more quickly.
The Janitor package has functions for cleaning data.
cleaning funtions
Is useful for pulling just a subset of variables from a large dataset
The rename function makes it easy to change column names.
the rename_with() function can change column names to be more consistent.