Data Analytics Flashcards by Max S

What is the operator to assign objects to variables

<-
“assign operator”
The operator assigns the value on its right-hand side to the variable or object on its left-hand side.

How well did you know this?

Not at all

Perfectly

c() function - what is it?

concatenate - Define a vector

How well did you know this?

Not at all

Perfectly

tidy or untidy dataframe?

each variable should have one column; not more

How well did you know this?

Not at all

Perfectly

What is the pipe operator?

%>% - pipe operators allows to concatenate code so it is more readable; instead of using nested functions; takes whatever comes before the concatenator and sends it to the first argument after; allows to pass code of one function directly into the next function as an argument

How well did you know this?

Not at all

Perfectly

What function to use for making a dataframe tidy?

pivot_longer()
–> is used to transform data from a wide format (where ecah level of variable has its own column) into a long format (where all values of a variable are placed into a single column)

How well did you know this?

Not at all

Perfectly

What does group_by() do?

The group_by() function in dplyr is used to group a data frame by one or more variables (columns), so that subsequent operations like summarizing, mutating, or filtering are applied to each group independently

How well did you know this?

Not at all

Perfectly

What does summarise() do?

the summarise() function in dplyr is used to reduce or summarize a data frame by calculating a single value for each group of data - it is typically used after group_by() to perform aggregate calculations like finding the mean, sum, count, etc.

How well did you know this?

Not at all

Perfectly

What does mutate() do?

A function which “mutates”, i.e. adds information - compute transformations on columns: create new columns using existing columns - To drop all non-transformed variables use transmute()

How well did you know this?

Not at all

Perfectly

Confidence interval formula

x bar +/- z * s/(n^1/2)
(n = sample size, s = SD of sample)

How well did you know this?

Not at all

Perfectly

What is the interpretation of the Confidence Interval?

For 95% of the time I am calculating xyz, it is going to fall into that specific interval - 95% of the time the x bar is going to fall into this interval

How well did you know this?

Not at all

Perfectly

Parameter of the t-distribution

n-1

How well did you know this?

Not at all

Perfectly

Hypothesis testing - z-test - formula

z-test = | (x bar - mu) / (sigma / n^1/2) |

How well did you know this?

Not at all

Perfectly

The two outcomes for a hypothesis test

z-test is < (?) than z-critical
–> Fail to Reject the Null Hypothesis / Reject the Null Hypothesis

–> Failing to reject H0 is not the same as accepting H0 - it simply means there isn’t strong enough evidence against H0, not that H0 is true

How well did you know this?

Not at all

Perfectly

hypothesis testing - comparing to samples - t-test (technically can also be a t-test) - to test whether two means are different or the same - i.e. H0: mu1 - mu2 = 0; H0: mu1 - mu2 is not 0 - set-up of formula for SE:

SE = square root (s1^2/n1 + s2^2/n2)
Use gaussian distribution if n1 and n2 > 30

How well did you know this?

Not at all

Perfectly

hypothesis testing - comparing to samples - t-test for comparing to sample means - formula

t-test = ((x1 bar - x2 bar) - 0) / SE = …
0 because H0 is mu1 - mu2 = 0

How well did you know this?

Not at all

Perfectly

reorder() function

The reorder() function reorders categories (e.g. region_names) based on the numerical data provided (e.g. total_crime)

How well did you know this?

Not at all

Perfectly

aes() function

In R, the aes() function is often used within other graphing elements to specify the desired aesthetics. The aes() function can be used in a global manner (applying to all of the graph’s elements) by nesting within ggplot()

How well did you know this?

Not at all

Perfectly

function qt() - explain

qt() provides the t-critical value (students may also talk about the probability that it expects, and the degrees of freedom)

returns the value on the x-axis below which contains the specified probability, or area under the t-distribution below that point. The pt() returns the probability that x is less than a specified value.

How well did you know this?

Not at all

Perfectly

Explain what the :: operator does

:: operator says package::function i.e. from the stated package, use the stated function.

How well did you know this?

Not at all

Perfectly

Explain what the functions pnorm()/xpnorm() and qnorm() do?

pnorm()/xpnorm() provides probability to the left and to the right of a specific z value in a Standard Normal distribution, i.e. 𝑃(𝑍 < 𝑧) as well as 𝑃(𝑍 > 𝑧) for a normally distributed standardized random variable z - pnorm(1.96) = 0.975

qnorm() for a given probability what is the corresponding z value of 𝑃(𝑍 < 𝑧) - qnorm(.975, mean=0, sd=1) = 1.96

How well did you know this?

Not at all

Perfectly

xxx_join and full_join(A,B)

full_join(A,B) - It joins tibble A and tibble B by a common column called colName automatically and results in all rows from both tibbles where matching rows are combined. In the case where there are no matching values the result is NA for the missing ones.

How well did you know this?

Not at all

Perfectly

Discuss the plots that allow you to check whether assumptions of a Linear Regression model have been satisfied - and which R function is that?

The function name is autoplot(). There are 3 plots for testing the residual behavior and they are:

Residuals vs Fitted: shows if residuals are randomly scattered (have an average of zero). This is shown with the blue line which should be horizontal at 0.
Normal Q-Q plot (points should be located on the diagonal – this would mean residuals are normally distributed.
Scale location – should have horizontal blue line which would mean new have a case of constant variance with time.

How well did you know this?

Not at all

Perfectly

identify dplyr verbs?

Study These Flashcards

filter() selects rows; group_by() groups by a given column; mutate() creates a new column with the result of the transform on existing columns;

Explain what the t.test() R function is used for?

Study These Flashcards

.test() is used to test hypothesis. It provides information on:
- Confidence Interval
- t-stat
- p-value

What does the separate function do?

separate() has been superseded in favour of separate_wider_position() and separate_wider_delim() because the two functions make the two uses more obvious, the API is more polished, and the handling of problems is better. Superseded functions will not go away, but will only receive critical bug fixes. Given either a regular expression or a vector of character positions, separate() turns a single character column into multiple columns.

what does pivot_longer do?

pivot_longer() "lengthens" data, increasing the number of rows and decreasing the number of columns. The inverse transformation is pivot_wider()

function to make dataset tidy?

pivot_longer()

Basic Data Types in R

Untidy vs. Tidy Data

DPLYR Package - key verbs

Key Verbs: select, filter, arrange, rename, mutate, summarise, group_by

select()

select() pick columns (variables) directly by the names

filter()

filter() pick rows (observations) by their values using logical conditions

arrange()

arrange() reorders rows according to one of variable names while preserving order of columns

Data Visualisation with ggplot2

stats::t.test

Use a function from the stats package to produce the results from all three approaches in a single table (i.e. the CI approach, z/t-stat approach, p-value approach)

reorder()

This function rearranges the levels of a factor variable based on a specified numeric summary (e.g., sum or mean of another variable). In the context of ggplot(aes(x = reorder())), it helps sort the bars or categories along the x-axis based on the values of another variable.

geom_point()

The point geom is used to create scatterplots. The scatterplot is most useful for displaying the relationship between two continuous variables.

geom_smooth()

Aids the eye in seeing patterns in the presence of overplotting. geom_smooth() and stat_smooth() are effectively aliases

ggfortify library - autoplot()

Use the ggfortify library to plot 3 essential plots for residuals.

huxreg() table

favstats()

from mosaic library - “favstats” is short for “favorite statistics”: it will give you the some of the most popular summary statistics for numerical variables (mean, max, median, sd, etc.)

What does geom mean?

Geometric objects (geoms) define the basic shape of the elements on the plot. Every geom has a default statistic.

geom_col()

This function creates a bar plot where the heights of the bars represent the actual values of a variable (rather than counts or frequencies, which geom_bar() would use by default).

coord_flip()

This function flips the x and y axes, effectively turning a vertical bar chart into a horizontal bar chart (or vice versa for other plot types).

Give definition of the Central Limit Theorem

A builder orders 200 planks of walnut flooring for the new office you work in. The mean and standard deviation of the distribution of weights (in kg) of the planks are 15 and 1.1 respectively. The probability of a random variable 𝑋 falling between two values 𝑎 and 𝑏 (where 𝑎 < 𝑏) can be found using: 𝑃(𝑎 < 𝑋 < 𝑏) = 𝑃(𝑋 < 𝑏) – 𝑃(𝑋 < 𝑎). Calculate the probability that the mean weight of the planks delivered is between 14.9 kg and 15 kg

Carry out the hypothesis test based on the sample data to establish whether or not the mean difference between population battery lives of notepads from company A and company B equals zero. Provide a decision on the rejection or non-rejection of the 𝐻' . Interpret the decision.

Construct 95% confidence intervals (CIs): one for battery life of notepad from company A, and one for battery life of notepad from company B. Show your workings. Based on the two CIs found, what can you conclude about difference between mean population battery lives of notepads from company A and company B?

How would you use Excel Solver to find the best integer solution?

* We need to add a constraint specifying that our decision variables are integer and resolve. * By default the Excel Solver will find a solution which is not guaranteed to be optimal, but will be within a tolerance from the optimal solution. This is controlled by a Solver option called “Integer optimality (%). By setting this option to zero, we can force the Solver to search for the true optimal solution, but this may increase the solving time. In that case we will no longer get the sensitivity analysis results

Increase of a constraint above "allowable increase":

* impact cannot be assessed using the solution of the base formulation. * You would need to reformulate the problem changing the right- hand side of the 4th constraint and resolving the problem. * One thing you know is that the increase in profit will be at least ...

TRUE. p-value approach provides the probability of obtaining value of the test statistic as extreme as, or more extreme than the actual value obtained when the null hypothesis is true, therefore allowing each individual to select their own significance level

Type I and Type II errors

* Type I Error – probability of rejecting the null hypothesis when it is true. * Type II Error – probability of not rejecting the null hypothesis when it is false

Data Analytics Flashcards

(55 cards)