Data Analytics Flashcards

1
Q

What is the operator to assign objects to variables

A

<-
“assign operator”
helps assign objects to variables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

c() function - what is it?

A

concatenate - Define a vector

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

tidy or untidy dataframe?

A
  • each variable should have one column; not more
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is the pipe operator?

A

%>% - pipe operators allows to concatenate code so it is more readable; instead of using nested functions; takes whatever comes before the concatenator and sends it to the first argument after; allows to pass code of one function directly into the next function as an argument

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What function to use for making a dataframe tidy?

A

pivot_longer()
–> is used to transform data from a wide format (where ecah level of variable has its own column) into a long format (where all values of a variable are placed into a single column)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What does group_by() do?

A

The group_by() function in dplyr is used to group a data frame by one or more variables (columns), so that subsequent operations like summarizing, mutating, or filtering are applied to each group independently

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What does summarise() do?

A

the summarise() function in dplyr is used to reduce or summarize a data frame by calculating a single value for each group of data - it is typically used after group_by() to perform aggregate calculations like finding the mean, sum, count, etc.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What does mutate() do?

A

A function which “mutates”, i.e. adds information - compute transformations on columns: create new columns using existing columns - To drop all non-transformed variables use transmute()

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Confidence interval formula

A
  • x bar +/- z * s/(n^1/2)
  • (n = sample size, s = SD of sample)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is the interpretation of the Confidence Interval?

A

For 95% of the time I am calculating xyz, it is going to fall into that specific interval - 95% of the time the x bar is going to fall into this interval

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Parameter of the t-distribution

A

n-1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Hypothesis testing - z-test - formula

A

z-test = | (x bar - mu) / (sigma / n^1/2) |

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

The two outcomes for a hypothesis test

A

z-test is < (?) than z-critical
–> Fail to Reject the Null Hypothesis / Reject the Null Hypothesis

–> Failing to reject H0 is not the same as accepting H0 - it simply means there isn’t strong enough evidence against H0, not that H0 is true

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

hypothesis testing - comparing to samples - t-test (technically can also be a t-test) - to test whether two means are different or the same - i.e. H0: mu1 - mu2 = 0; H0: mu1 - mu2 is not 0 - set-up of formula for SE:

A
  • SE = square root (s1^2/n1 + s2^2/n2)
  • Use gaussian distribution if n1 and n2 > 30
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

hypothesis testing - comparing to samples - t-test for comparing to sample means - formula

A

t-test = ((x1 bar - x2 bar) - 0) / SE = …
0 because H0 is mu1 - mu2 = 0

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

reorder() function

A

The reorder() function reorders categories (e.g. region_names) based on the numerical data provided (e.g. total_crime)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

aes() function

A

In R, the aes() function is often used within other graphing elements to specify the desired aesthetics. The aes() function can be used in a global manner (applying to all of the graph’s elements) by nesting within ggplot()

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

function qt() - explain

A

qt() provides the t-critical value (students may also talk about the probability that it expects, and the degrees of freedom)

returns the value on the x-axis below which contains the specified probability, or area under the t-distribution below that point. The pt() returns the probability that x is less than a specified value.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Explain what the :: operator does

A

:: operator says package::function i.e. from the stated package, use the stated function.

20
Q

Explain what the functions xpnorm() and qnorm() do?

A

xpnorm() provides probability to the left and to the right of a specific z value in a Standard Normal distribution, i.e. 𝑃(𝑍 < 𝑧) as well as 𝑃(𝑍 > 𝑧) for a normally distributed standardized random variable z.

qnorm() for a given probability what is the corresponding z value of 𝑃(𝑍 < 𝑧)

21
Q

xxx_join and full_join(A,B)

A

full_join(A,B) - It joins tibble A and tibble B by a common column called colName automatically and results in all rows from both tibbles where matching rows are combined. In the case where there are no matching values the result is NA for the missing ones.

22
Q

Discuss the plots that allow you to check whether assumptions of a Linear Regression model have been satisfied - and which R function is that?

A

The function name is autoplot(). There are 3 plots for testing the residual behavior and they are:

  • Residuals vs Fitted: shows if residuals are randomly scattered (have an average of zero). This is shown with the blue line which should be horizontal at 0.
  • Normal Q-Q plot (points should be located on the diagonal – this would mean residuals are normally distributed.
  • Scale location – should have horizontal blue line which would mean new have a case of constant variance with time.
23
Q

identify dplyr verbs?

A

filter() selects rows; group_by() groups by a given column; mutate() creates a
new column with the result of the transform on existing columns;

24
Q

Explain what the t.test() R function is used for?

A

.test() is used to test hypothesis. It provides information on:
- Confidence Interval
- t-stat
- p-value

25
Q

What does the separate function do?

A

separate() has been superseded in favour of separate_wider_position() and separate_wider_delim() because the two functions make the two uses more obvious, the API is more polished, and the handling of problems is better. Superseded functions will not go away, but will only receive critical bug fixes.

Given either a regular expression or a vector of character positions, separate() turns a single character column into multiple columns.

26
Q

what does pivot_longer do?

A

pivot_longer() “lengthens” data, increasing the number of rows and decreasing the number of columns. The inverse transformation is pivot_wider()

27
Q

function to make dataset tidy?

A

pivot_longer()

28
Q

Basic Data Types in R

A
29
Q

Untidy vs. Tidy Data

A
30
Q

DPLYR Package - key verbs

A

Key Verbs: select, filter, arrange, rename, mutate, summarise, group_by

31
Q

select()

A

select() pick columns (variables) directly by the names

32
Q

filter()

A

filter() pick rows (observations) by their values using logical conditions

33
Q

arrange()

A

arrange() reorders rows according to one of variable names while preserving order of columns

34
Q

Data Visualisation with ggplot2

A
35
Q

stats::t.test

A

Use a function from the stats package to produce the results from all three approaches in a single table (i.e. the CI approach, z/t-stat approach, p-value approach)

36
Q

reorder()

A

This function rearranges the levels of a factor variable based on a specified numeric summary (e.g., sum or mean of another variable). In the context of ggplot(aes(x = reorder())), it helps sort the bars or categories along the x-axis based on the values of another variable.

37
Q

geom_point()

A

The point geom is used to create scatterplots. The scatterplot is most useful for displaying the relationship between two continuous variables.

38
Q

geom_smooth()

A

Aids the eye in seeing patterns in the presence of overplotting. geom_smooth() and stat_smooth() are effectively aliases

39
Q

ggfortify library - autoplot()

A

Use the ggfortify library to plot 3 essential plots for residuals.

40
Q

huxreg() table

A
41
Q

favstats()

A

from mosaic library - “favstats” is short for “favorite statistics”: it will give you the some of the most popular summary statistics for numerical variables (mean, max, median, sd, etc.)

42
Q

What does geom mean?

A

Geometric objects (geoms) define the basic shape of the elements on the plot. Every geom has a default statistic.

43
Q

geom_col()

A

This function creates a bar plot where the heights of the bars represent the actual values of a variable (rather than counts or frequencies, which geom_bar() would use by default).

44
Q

coord_flip()

A

This function flips the x and y axes, effectively turning a vertical bar chart into a horizontal bar chart (or vice versa for other plot types).

45
Q

Give definition of the Central Limit Theorem

A
46
Q

Write down the distribution of the sample mean for…

A