Exploratory Data Analysis Flashcards

1
Q

What is another name for Exploratory Data Analysis?

A

statisticians call exploratory data analysis, or EDA for short

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What does it mean that EDA is an iterative cycle?

A

You:

Generate questions about your data.

Search for answers by visualising, transforming, and modelling your data.

Use what you learn to refine your questions and/or generate new questions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

EDA is not a formal process with a strict set of rules. More than anything, EDA is a state of mind.what does this means?

A

During the initial phases of EDA you should feel free to investigate every idea that occurs to you. Some of these ideas will pan out, and some will be dead ends. As your exploration continues, you will home in on a few particularly productive areas that you’ll eventually write up and communicate to others.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

EDA is fundamentally a creative process.discuss?

A

EDA is fundamentally a creative process. And like most creative processes, the key to asking quality questions is to generate a large quantity of questions. It is difficult to ask revealing questions at the start of your analysis because you do not know what insights are contained in your dataset. On the other hand, each new question that you ask will expose you to a new aspect of your data and increase your chance of making a discovery. You can quickly drill down into the most interesting parts of your data—and develop a set of thought-provoking questions—if you follow up each question with a new question based on what you find.

There is no rule about which questions you should ask to guide your research. However, two types of questions will always be useful for making discoveries within your data. You can loosely word these questions as:

What type of variation occurs within my variables?

What type of covariation occurs between my variables?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is Variable?

A

A variable is a quantity, quality, or property that you can measure.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is Value?

A

A value is the state of a variable when you measure it. The value of a variable may change from measurement to measurement.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is observation?

A

An observation is a set of measurements made under similar conditions (you usually make all of the measurements in an observation at the same time and on the same object). An observation will contain several values, each associated with a different variable. I’ll sometimes refer to an observation as a data point.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Tabular data?

A

Tabular data is a set of values, each associated with a variable and an observation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

When a Tabular data is Tidy?

A

. Tabular data is tidy if each value is placed in its own “cell”, each variable in its own column, and each observation in its own row.
In real-life, most data isn’t tid

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is Variation?

A

Variation is the tendency of the values of a variable to change from measurement to measurement. You can see variation easily in real life; if you measure any continuous variable twice, you will get two different results. This is true even if you measure quantities that are constant, like the speed of light. Each of your measurements will include a small amount of error that varies from measurement to measurement. Categorical variables can also vary if you measure across different subjects (e.g. the eye colors of different people), or different times (e.g. the energy levels of an electron at different moments). Every variable has its own pattern of variation, which can reveal interesting information. The best way to understand that pattern is to visualise the distribution of the variable’s values.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

How to Visualising distributions of Variable?

A

It depend on whether the variable is categorical or continuous.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

How to visualise categorical variable?

A

A variable is categorical if it can only take one of a small set of values. In R, categorical variables are usually saved as factors or character vectors.

To examine the distribution of a categorical variable, use a bar chart:

e.g ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut))

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

How to visualise continues variable?

A

A variable is continuous if it can take any of an infinite set of ordered values. Numbers and date-times are two examples of continuous variables. To examine the distribution of a continuous variable, use a histogram:

e.g ggplot(data = diamonds) +geom_histogram(mapping = aes(x = carat), binwidth =0.5)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

why recommended to use geom_freqpoly() instead of geom_histogram()?

A

If you wish to overlay multiple histograms in the same plot, I recommend using geom_freqpoly() instead of geom_histogram(). geom_freqpoly() performs the same calculation as geom_histogram(), but instead of displaying the counts with bars, uses lines instead. It’s much easier to understand overlapping lines than bars.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

typical values in EDA

A

In both bar charts and histograms, tall bars show the common values of a variable, and shorter bars show less-common values. Places that do not have bars reveal values that were not seen in your data. To turn this information into useful questions, look for anything unexpected:

Which values are the most common? Why?

Which values are rare? Why? Does that match your expectations?

Can you see any unusual patterns? What might explain them?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Unusual values

A

Outliers are observations that are unusual; data points that don’t seem to fit the pattern. Sometimes outliers are data entry errors; other times outliers suggest important new science. When you have a lot of data, outliers are sometimes difficult to see in a histogram

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

How to do with Missing values?

A

If you’ve encountered unusual values in your dataset, and simply want to move on to the rest of your analysis, you have two options.

  1. Drop the entire row with the strange values:

diamonds2 %
filter(between(y, 3, 20))
I don’t recommend this option because just because one measurement is invalid, doesn’t mean all the measurements are. Additionally, if you have low quality data, by time that you’ve applied this approach to every variable you might find that you don’t have any data left!

  1. Instead, I recommend replacing the unusual values with missing values. The easiest way to do this is to use mutate() to replace the variable with a modified copy. You can use the ifelse() function to replace unusual values with NA:

diamonds2 %
mutate(y = ifelse(y < 3 | y > 20, NA, y))

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What is real? object in environment or r script?

A

However, in the long run, you’ll be much better off if you consider your R scripts as “real”.

With your R scripts (and your data files), you can recreate the environment. It’s much harder to recreate your R scripts from your environment! You’ll either have to retype a lot of code from memory (making mistakes all the way) or you’ll have to carefully mine your R history.

There is a great pair of keyboard shortcuts that will work together to make sure you’ve captured the important parts of your code in the editor:

Press Cmd/Ctrl + Shift + F10 to restart RStudio.
Press Cmd/Ctrl + Shift + S to rerun the current script.
I use this pattern hundreds of times a week.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

How to identify ur current working directory instantly ?

A

RStudio shows your current working directory at the top of the console:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

how to print and see working directory?

A

setwd():

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Studio Project?

A

R experts keep all the files associated with a project together — input data, R scripts, analytical results, figures. This is such a wise and common practice that RStudio has built-in support for this via projects

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

R Project

A

In summary, RStudio projects give you a solid workflow that will serve you well in the future:

Create an RStudio project for each data analysis project.

Keep data files there; we’ll talk about loading them into R in data import.

Keep scripts there; edit them, run them in bits or as a whole.

Save your outputs (plots and cleaned data) there.

Only ever use relative paths, not absolute paths.

Everything you need is in one place, and cleanly separated from all the other projects that you are working on.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

how R store categorical variable?

A

Factors are how R stores categorical data.

24
Q

Data wrangling?

A

Data wrangling is the process of cleaning, structuring and enriching raw data into a desired format for better decision making in less time.

Unfortunately, data wrangling is 80% of what a data scientist does. It’s where most of the real value is created and it’s the most thankless, difficult, and poorly understood job I know of. Nobody gets a degree in data wrangling. Nobody publishes papers on it. Nobody teaches how to do it. (Yes there are courses on how to use specific tools like R or Python to do simple joins and dupe removal, but they assume that you already know how and why you are wrangling.)

There are six steps in data wrangling:

Gather data from inside and outside the firewall
Understand (and document) your sources and their limitations
Clean up the duplicates, blanks, and other simple errors
Join all your data into a single table
Create new data by calculating new fields and recategorizing
Visualize the data to remove outliers and illogical results
The first four are straightforward albeit annoying. Most people do steps 1 and 3 and then jump in to do their analysis. They then spend several weeks discovering all kinds of additional errors as they try to get their models to work.

You can read more details here: The Value is in the Data (Wrangling)

One last thought. Step 5 is where almost all of the data science value is hidden. The creative intelligence used to imagine derivative variables is perhaps the single most valuable trait in a data scientist. It’s what separates the scientists from the pretenders.

25
Q

What is Tibble?

A

Tibbles are data frames, but they tweak some older behaviours to make life a little easier. tibble package, which provides opinionated data frames that make working in the tidyverse a little easir

26
Q

What is relationship betwe Tidyverse and Tibble?

A

The tibble package, is part of the core tidyverse.

27
Q

What is relationship btew data.frame and tiibble? how to do change from dataframe to tiible ?

A

Most other R packages use regular data frames, so you might want to coerce a data frame to a tibble. You can do that with as_tibble():

28
Q

How to create Tibble from individual vector?

A

You can create a new tibble from individual vectors with tibble()

tibble(
  x = 1:5, 
  y = 1, 
  z = x ^ 2 + y
)
#> # A tibble: 5 x 3
#>       x     y     z
#>     
#> 1     1     1     2
#> 2     2     1     5
#> 3     3     1    10
#> 4     4     1    17
#> 5     5     1    26
29
Q

Does tibble() change type of input or change variable names?

A

it never changes the type of the inputs (e.g. it never converts strings to factors!), it never changes the names of variables, and it never creates row names.

30
Q

what are some chunk option in R?

A

There are a large number of chunk options in knitr documented at https://yihui.name/knitr/options. We list a subset of them below:

eval: Whether to evaluate a code chunk.
echo: Whether to echo the source code in the output document (someone may not prefer reading your smart source code but only results).
results: When set to ‘hide’, text output will be hidden; when set to ‘asis’, text output is written “as-is”, e.g., you can write out raw Markdown text from R code (like cat(‘Markdown is cool.\n’)). By default, text output will be wrapped in verbatim elements (typically plain code blocks).
collapse: Whether to merge text output and source code into a single code block in the output. This is mostly cosmetic: collapse = TRUE makes the output more compact, since the R source code and its text output are displayed in a single output block. The default collapse = FALSE means R expressions and their text output are separated into different blocks.

warning, message, and error: Whether to show warnings, messages, and errors in the output document. Note that if you set error = FALSE, rmarkdown::render() will halt on error in a code chunk, and the error will be displayed in the R console. Similarly, when warning = FALSE or message = FALSE, these messages will be shown in the R console.

include: Whether to include anything from a code chunk in the output document. When include = FALSE, this whole code chunk is excluded in the output, but note that it will still be evaluated if eval = TRUE. When you are trying to set echo = FALSE, results = ‘hide’, warning = FALSE, and message = FALSE, chances are you simply mean a single option include = FALSE instead of suppressing different types of text output individually.
cache: Whether to enable caching. If caching is enabled, the same code chunk will not be evaluated the next time the document is compiled (if the code chunk was not modified), which can save you time. However, I want to honestly remind you of the two hard problems in computer science (via Phil Karlton): naming things, and cache invalidation. Caching can be handy but also tricky sometimes.
fig. width and fig.height: The (graphical device) size of R plots in inches. R plots in code chunks are first recorded via a graphical device in knitr, and then written out to files. You can also specify the two options together in a single chunk option fig.dim, e.g., fig.dim = c(6, 4) means fig.width = 6 and fig.height = 4.
out. width and out.height: The output size of R plots in the output document. These options may scale images. You can use percentages, e.g., out.width = ‘80%’ means 80% of the page width.
fig. align: The alignment of plots. It can be ‘left’, ‘center’, or ‘right’.
dev: The graphical device to record R plots. Typically it is ‘pdf’ for LaTeX output, and ‘png’ for HTML output, but you can certainly use other devices, such as ‘svg’ or ‘jpeg’.
fig. cap: The figure caption.
child: You can include a child document in the main document. This option takes a path to an external file.

31
Q

What is Tribble? what is the relatioship with tibble?

A

Tribble is another way to create tibble.Is a short for transposed tibble.

tribble() is customised for data entry in code: column headings are defined by formulas (i.e. they start with ~), and entries are separated by commas. This makes it possible to lay out small amounts of data in easy to read form.

tribble(
  ~x, ~y, ~z,
  #--|--|----
  "a", 2, 3.6,
  "b", 1, 8.5
)
#> # A tibble: 2 x 3
#>   x         y     z
#>     
#> 1 a         2   3.6
#> 2 b         1   8.5
32
Q

What are main diff between tibble and data.frame?

A

printing and subsetting.

Tibbles have a refined print method that shows only the first 10 rows, and all the columns that fit on screen. This makes it much easier to work with large data. In addition to its name, each column reports its type, a nice feature borrowed from str():sometimes u may control the output with diff option. You can also control the default print behaviour by setting options:

options(tibble.print_max = n, tibble.print_min = m): if more than n rows, print only m rows. Use options(tibble.print_min = Inf) to always show all rows.

Use options(tibble.width = Inf) to always print all columns, regardless of the width of the screen.A final option is to use RStudio’s built-in data viewer to get a scrollable view of the complete dataset. This is also often useful at the end of a long chain of manipulations.

nycflights13::flights %>%
View()

33
Q

How to use subsetting in Tibble vs Data.frame?

A

Subsetting using $ and [[. [[ can extract by name or position; $ only extracts by name but is a little less typing.

# Extract by name
df$x, df[["x"]]
# Extract by position
df[[1]]

To use these in a pipe, you’ll need to use the special placeholder .:

df %>% .$x
df %>% .[[“x”]]

Compared to a data.frame, tibbles are more strict: they never do partial matching, and they will generate a warning if the column you are trying to access does not exist.

34
Q

how to change tibble to data.frame? and why u need to change?

A

Some older functions don’t work with tibbles. If you encounter one of these functions, use as.data.frame() to turn a tibble back to a data.frame:

class(as.data.frame(tb))

35
Q

how can you check if object is tibble?

A

You can use the function is_tibble() to check whether a data frame is a tibble or not. The mtcars data frame is not a tibble.

is_tibble(mtcars)
#> [1] FALSE
is_tibble(as_tibble(mtcars))
#> [1] TRUE

More generally, you can use the class() function to find out the class of an object. Tibbles has the classes c(“tbl_df”, “tbl”, “data.frame”), while old data frames will only have the class “data.frame”.

class(mtcars)
#> [1] "data.frame"
class(ggplot2::diamonds)
#> [1] "tbl_df"     "tbl"        "data.frame"
class(nycflights13::flights)
#> [1] "tbl_df"     "tbl"        "data.frame"
36
Q

what is e readr package,?

A

use to load flat files in R , which is part of the core tidyverse.

37
Q

what use read.csv()

A

read_csv() reads comma delimited files. once you understand read_csv(), you can easily apply your knowledge to all the other functions in readr.

38
Q

what use read.csv2()

A

read_csv2() reads semicolon separated files (common in countries where , is used as the decimal place),

39
Q

read_tsv() ?

A

read_tsv() reads tab delimited files,

40
Q

read_delim()?

A

read_delim() reads in files with any delimiter.

41
Q

how to frop header in CSV()

A

Sometimes there are a few lines of metadata at the top of the file. You can use skip = n to skip the first n lines;

read_csv(“The first line of metadata
The second line of metadata
x,y,z
1,2,3”, skip = 2)

42
Q

if data do dont have colum header and u wnat represent it with X1 to Xn , how to do it?

A

The data might not have column names. You can use col_names = FALSE to tell read_csv() not to treat the first row as headings, and instead label them sequentially from X1 to Xn:

read_csv("1,2,3\n4,5,6", col_names = FALSE)
#> # A tibble: 2 x 3
#>      X1    X2    X3
#>     
#> 1     1     2     3
#> 2     4     5     6
43
Q

how to create tibble and at same time naming colum names?

A

read_csv(“1,2,3\n4,5,6”, col_names = c(“x”, “y”, “z”) , the \n is use to indicate new row. so no need of getting to new line as i am having problem with that getting to new line

44
Q

Dont forget always create vector with c()

A

his is a generic function which combines its arguments.

The default method combines its arguments to form a vector. All arguments are coerced to a common type which is the type of the returned value, and all attributes except names are removed.

?c states it stands for combine.

This is a good question, and the answer is kind of odd. “c”, believe it or not, stands for “combine”, which is what it normally does:

> c(c(1, 2), c(3))
[1] 1 2 3
But it happens that in R, a number is just a vector of length 1:

> 1
[1] 1
So, when you use c() to create a vector, what you are actually doing is combining together a series of 1-length vectors.

45
Q

how to create mising value in read.csv()

A
read_csv("a,b,c\n1,2,.", na = ".")
#> # A tibble: 1 x 3
#>       a     b c    
#>     
#> 1     1     2 NAread_csv("a,b,c\n1,2,.", na = ".")
#> # A tibble: 1 x 3
#>       a     b c    
#>     
#> 1     1     2 NA
46
Q

read.csv() vs read_csv()

A
  1. read_csv() are faster(10X)
  2. They produce tibbles, they don’t convert character vectors to factors, use row names, or munge the column names. These are common sources of frustration with the base R functions.
  3. They are more reproducible. Base R functions inherit some behaviour from your operating system and environment variables, so import code that works on your computer might not work on someone else’s.
47
Q

What function would you use to read a file where fields were separated with “|”?

A

read_delim(file, delim = “|”)

48
Q

how to get underline string in R?

A

In R, we can get at the underlying representation of a string using charToRaw():

49
Q

writting to file using readr?

A

readr also comes with two useful functions for writing data back to disk: write_csv() and write_tsv()

50
Q

how to write file in R?

A

If you want to export a csv file to Excel, use write_excel_csv() ..write_csv(challenge, “challenge.csv”). But with this , the type information is lost when you save to csv. This makes CSVs a little unreliable for caching interim results—you need to recreate the column specification every time you load in. There are two alternatives:

write_rds(challenge, “challenge.rds”)
read_rds(“challenge.rds”)

The feather package implements a fast binary file format that can be shared across programming languages:

library(feather)
write_feather(challenge, “challenge.feather”)
read_feather(“challenge.feather”)

Feather tends to be faster than RDS and is usable outside of R.

51
Q

R data import manual ?

A

https://cran.r-project.org/doc/manuals/r-release/R-data.html

52
Q

What is tidy data. ?

A

consistent way to organize data is called Tidy Data.

53
Q

What are three interrelated rules which make a dataset tidy?

A
  1. Each variable must have its own column.
  2. Each observation must have its own row.
  3. Each value must have its own cell.
54
Q

Why ensure that your data is tidy?

A
  1. If you have a consistent data structure, it’s easier to learn the tools that work with it because they have an underlying uniformity.
  2. There’s a specific advantage to placing variables in columns because it allows R’s vectorised nature to shine. As you learned in mutate and summary functions, most built-in R functions work with vectors of values. That makes transforming tidy data feel particularly natural.
55
Q

packages in tidyverse?

A

ggplot2 , : gplot2 is a system for declaratively creating graphics, based on The Grammar of Graphics.

dplyr provides a grammar of data manipulation, providing a consistent set of verbs that solve the most common data manipulation challenges

tidyr provides a set of functions that help you get to tidy data. Tidy data is data with a consistent form: in brief, every variable goes in a column, and every column is a variable.

readr provides a fast and friendly way to read rectangular data (like csv, tsv, and fwf). It is designed to flexibly parse many types of data found in the wild, while still cleanly failing when data unexpectedly changes.

purrr enhances R’s functional programming (FP) toolkit by providing a complete and consistent set of tools for working with functions and vectors. Once you master the basic concepts, purrr allows you to replace many for loops with code that is easier to write and more expressive.

tibble is a modern re-imagining of the data frame, keeping what time has proven to be effective, and throwing out what it has not. Tibbles are data.frames that are lazy and surly: they do less and complain more forcing you to confront problems earlier, typically leading to cleaner, more expressive code.

stringr provides a cohesive set of functions designed to make working with strings as easy as possible. It is built on top of stringi, which uses the ICU C library to provide fast, correct implementations of common string manipulations.

forcats: provides a suite of useful tools that solve common problems with factors. R uses factors to handle categorical variables, variables that have a fixed and known set of possible values.