Exploratory Data Analysis Flashcards

Question

What is Tibble?

Answer 1

Tibbles are data frames, but they tweak some older behaviours to make life a little easier. tibble package, which provides opinionated data frames that make working in the tidyverse a little easir

Answer 2

The tibble package, is part of the core tidyverse.

Answer 3

Most other R packages use regular data frames, so you might want to coerce a data frame to a tibble. You can do that with as_tibble():

Answer 4

You can create a new tibble from individual vectors with tibble() ``` tibble( x = 1:5, y = 1, z = x ^ 2 + y ) #> # A tibble: 5 x 3 #> x y z #> #> 1 1 1 2 #> 2 2 1 5 #> 3 3 1 10 #> 4 4 1 17 #> 5 5 1 26 ```

Answer 5

it never changes the type of the inputs (e.g. it never converts strings to factors!), it never changes the names of variables, and it never creates row names.

Answer 6

There are a large number of chunk options in knitr documented at https://yihui.name/knitr/options. We list a subset of them below: eval: Whether to evaluate a code chunk. echo: Whether to echo the source code in the output document (someone may not prefer reading your smart source code but only results). results: When set to 'hide', text output will be hidden; when set to 'asis', text output is written “as-is”, e.g., you can write out raw Markdown text from R code (like cat('**Markdown** is cool.\n')). By default, text output will be wrapped in verbatim elements (typically plain code blocks). collapse: Whether to merge text output and source code into a single code block in the output. This is mostly cosmetic: collapse = TRUE makes the output more compact, since the R source code and its text output are displayed in a single output block. The default collapse = FALSE means R expressions and their text output are separated into different blocks. warning, message, and error: Whether to show warnings, messages, and errors in the output document. Note that if you set error = FALSE, rmarkdown::render() will halt on error in a code chunk, and the error will be displayed in the R console. Similarly, when warning = FALSE or message = FALSE, these messages will be shown in the R console. include: Whether to include anything from a code chunk in the output document. When include = FALSE, this whole code chunk is excluded in the output, but note that it will still be evaluated if eval = TRUE. When you are trying to set echo = FALSE, results = 'hide', warning = FALSE, and message = FALSE, chances are you simply mean a single option include = FALSE instead of suppressing different types of text output individually. cache: Whether to enable caching. If caching is enabled, the same code chunk will not be evaluated the next time the document is compiled (if the code chunk was not modified), which can save you time. However, I want to honestly remind you of the two hard problems in computer science (via Phil Karlton): naming things, and cache invalidation. Caching can be handy but also tricky sometimes. fig. width and fig.height: The (graphical device) size of R plots in inches. R plots in code chunks are first recorded via a graphical device in knitr, and then written out to files. You can also specify the two options together in a single chunk option fig.dim, e.g., fig.dim = c(6, 4) means fig.width = 6 and fig.height = 4. out. width and out.height: The output size of R plots in the output document. These options may scale images. You can use percentages, e.g., out.width = '80%' means 80% of the page width. fig. align: The alignment of plots. It can be 'left', 'center', or 'right'. dev: The graphical device to record R plots. Typically it is 'pdf' for LaTeX output, and 'png' for HTML output, but you can certainly use other devices, such as 'svg' or 'jpeg'. fig. cap: The figure caption. child: You can include a child document in the main document. This option takes a path to an external file.

Answer 7

Tribble is another way to create tibble.Is a short for transposed tibble. tribble() is customised for data entry in code: column headings are defined by formulas (i.e. they start with ~), and entries are separated by commas. This makes it possible to lay out small amounts of data in easy to read form. ``` tribble( ~x, ~y, ~z, #--|--|---- "a", 2, 3.6, "b", 1, 8.5 ) #> # A tibble: 2 x 3 #> x y z #> #> 1 a 2 3.6 #> 2 b 1 8.5 ```

Answer 8

printing and subsetting. Tibbles have a refined print method that shows only the first 10 rows, and all the columns that fit on screen. This makes it much easier to work with large data. In addition to its name, each column reports its type, a nice feature borrowed from str():sometimes u may control the output with diff option. You can also control the default print behaviour by setting options: options(tibble.print_max = n, tibble.print_min = m): if more than n rows, print only m rows. Use options(tibble.print_min = Inf) to always show all rows. Use options(tibble.width = Inf) to always print all columns, regardless of the width of the screen.A final option is to use RStudio’s built-in data viewer to get a scrollable view of the complete dataset. This is also often useful at the end of a long chain of manipulations. nycflights13::flights %>% View()

Answer 9

Subsetting using $ and [[. [[ can extract by name or position; $ only extracts by name but is a little less typing. ``` # Extract by name df$x, df[["x"]] ``` ``` # Extract by position df[[1]] ``` To use these in a pipe, you’ll need to use the special placeholder .: df %>% .$x df %>% .[["x"]] Compared to a data.frame, tibbles are more strict: they never do partial matching, and they will generate a warning if the column you are trying to access does not exist.

Answer 10

Some older functions don’t work with tibbles. If you encounter one of these functions, use as.data.frame() to turn a tibble back to a data.frame: class(as.data.frame(tb))

Answer 11

You can use the function is_tibble() to check whether a data frame is a tibble or not. The mtcars data frame is not a tibble. is_tibble(mtcars) #> [1] FALSE is_tibble(as_tibble(mtcars)) #> [1] TRUE More generally, you can use the class() function to find out the class of an object. Tibbles has the classes c("tbl_df", "tbl", "data.frame"), while old data frames will only have the class "data.frame". ``` class(mtcars) #> [1] "data.frame" class(ggplot2::diamonds) #> [1] "tbl_df" "tbl" "data.frame" class(nycflights13::flights) #> [1] "tbl_df" "tbl" "data.frame" ```

Answer 12

use to load flat files in R , which is part of the core tidyverse.

Answer 13

read_csv() reads comma delimited files. once you understand read_csv(), you can easily apply your knowledge to all the other functions in readr.

Answer 14

read_csv2() reads semicolon separated files (common in countries where , is used as the decimal place),

Answer 15

read_tsv() reads tab delimited files,

Answer 16

read_delim() reads in files with any delimiter.

Answer 17

Sometimes there are a few lines of metadata at the top of the file. You can use skip = n to skip the first n lines; read_csv("The first line of metadata The second line of metadata x,y,z 1,2,3", skip = 2)

Answer 18

The data might not have column names. You can use col_names = FALSE to tell read_csv() not to treat the first row as headings, and instead label them sequentially from X1 to Xn: ``` read_csv("1,2,3\n4,5,6", col_names = FALSE) #> # A tibble: 2 x 3 #> X1 X2 X3 #> #> 1 1 2 3 #> 2 4 5 6 ```

Answer 19

read_csv("1,2,3\n4,5,6", col_names = c("x", "y", "z") , the \n is use to indicate new row. so no need of getting to new line as i am having problem with that getting to new line

Answer 20

his is a generic function which combines its arguments. The default method combines its arguments to form a vector. All arguments are coerced to a common type which is the type of the returned value, and all attributes except names are removed. ?c states it stands for combine. This is a good question, and the answer is kind of odd. "c", believe it or not, stands for "combine", which is what it normally does: > c(c(1, 2), c(3)) [1] 1 2 3 But it happens that in R, a number is just a vector of length 1: > 1 [1] 1 So, when you use c() to create a vector, what you are actually doing is combining together a series of 1-length vectors.

Answer 21

``` read_csv("a,b,c\n1,2,.", na = ".") #> # A tibble: 1 x 3 #> a b c #> #> 1 1 2 NAread_csv("a,b,c\n1,2,.", na = ".") #> # A tibble: 1 x 3 #> a b c #> #> 1 1 2 NA ```

Answer 22

1. read_csv() are faster(10X) 2. They produce tibbles, they don’t convert character vectors to factors, use row names, or munge the column names. These are common sources of frustration with the base R functions. 3. They are more reproducible. Base R functions inherit some behaviour from your operating system and environment variables, so import code that works on your computer might not work on someone else’s.

Answer 23

read_delim(file, delim = "|")

Answer 24

In R, we can get at the underlying representation of a string using charToRaw():

Answer 25

readr also comes with two useful functions for writing data back to disk: write_csv() and write_tsv()

Answer 26

If you want to export a csv file to Excel, use write_excel_csv() ..write_csv(challenge, "challenge.csv"). But with this , the type information is lost when you save to csv. This makes CSVs a little unreliable for caching interim results—you need to recreate the column specification every time you load in. There are two alternatives: write_rds(challenge, "challenge.rds") read_rds("challenge.rds") The feather package implements a fast binary file format that can be shared across programming languages: library(feather) write_feather(challenge, "challenge.feather") read_feather("challenge.feather") Feather tends to be faster than RDS and is usable outside of R.

Answer 27

https://cran.r-project.org/doc/manuals/r-release/R-data.html

Answer 28

consistent way to organize data is called Tidy Data.

Answer 29

1. Each variable must have its own column. 2. Each observation must have its own row. 3. Each value must have its own cell.

Answer 30

1. If you have a consistent data structure, it’s easier to learn the tools that work with it because they have an underlying uniformity. 2. There’s a specific advantage to placing variables in columns because it allows R’s vectorised nature to shine. As you learned in mutate and summary functions, most built-in R functions work with vectors of values. That makes transforming tidy data feel particularly natural.

Answer 31

ggplot2 , : gplot2 is a system for declaratively creating graphics, based on The Grammar of Graphics. dplyr provides a grammar of data manipulation, providing a consistent set of verbs that solve the most common data manipulation challenges tidyr provides a set of functions that help you get to tidy data. Tidy data is data with a consistent form: in brief, every variable goes in a column, and every column is a variable. readr provides a fast and friendly way to read rectangular data (like csv, tsv, and fwf). It is designed to flexibly parse many types of data found in the wild, while still cleanly failing when data unexpectedly changes. purrr enhances R’s functional programming (FP) toolkit by providing a complete and consistent set of tools for working with functions and vectors. Once you master the basic concepts, purrr allows you to replace many for loops with code that is easier to write and more expressive. tibble is a modern re-imagining of the data frame, keeping what time has proven to be effective, and throwing out what it has not. Tibbles are data.frames that are lazy and surly: they do less and complain more forcing you to confront problems earlier, typically leading to cleaner, more expressive code. stringr provides a cohesive set of functions designed to make working with strings as easy as possible. It is built on top of stringi, which uses the ICU C library to provide fast, correct implementations of common string manipulations. forcats: provides a suite of useful tools that solve common problems with factors. R uses factors to handle categorical variables, variables that have a fixed and known set of possible values.

Exploratory Data Analysis Flashcards

(55 cards)