Data Science Flashcards

Question 1

Q

How is the mean and variance of two 🎃INDEPENDENT, NORMALLY DISTRIBUTED🎃 variables calculated? calculate for both addition and subtraction

Answer

A

X+Y=T mean(T)= mean(X)+mean(Y) var(T)^2=var(X)^2+var(Y)^2

X-Y=Z mean(Z)= mean(X)-mean(Y) var(Z)^2=var(X)2+var(Y)^2

Question 2

Q

Write the formula of mean and variance, using expected value:

Answer

A

Expected value is basically the same as mean, now:
Here’s how it’s calculated:
mean=E(x) = x*p(x) for all values of x (if they are discrete we use sigma, if they are continuous, we use integral)

For variance: variance =E ((x-E(x))^ 2) (mean of this variable: (x- mean (x))^2 )

Question 3

Q

What does central limit theorem say?

Answer

A

The central limit theorem (CLT) states that the distribution of sample means approximates a normal distribution as the sample size gets larger,
🧨 regardless of the population’s distribution. 🧨
Sample sizes equal to or greater than 30 are often considered sufficient for the CLT to hold.

Question 4

Q

What is a latent variable?

Answer

A

In statistics, latent variables are variables that are not directly observed but are rather inferred through a mathematical model from other variables that are observed

Question 5

Q

What do we mean when we say the sample mean targets the population mean?

Answer

A

Means that it’s a good estimate of the population mean

Question 6

Q

What is the variance of the sampling distribution of means telling us? (Considering CLT)

Answer

A

The formula for variance of the sampling distribution of means is: population variance/n, so the larger the sample size, the smaller the variance and the closer the means of samples to the population mean. In extreme form, n is the whole population, and whatever number of samples we take, we’ll have one mean only (variance really small).

Question 7

Q

How the sampling distribution would look, if the original distribution is not normal?

Answer

A

It approaches a normal distribution.

Question 8

Q

What is the relation between non-normal original distribution and the suitable sample size?

Answer

A

The further the original distribution is from normal, the larger the sample size should be so that the sampling distribution of means approaches normal distribution (typically sample size>=30 will approach normal distribution)

Question 9

Q

If the original distribution is normal, the sampling distribution of the means will also be normal. True or False?

Question 10

Q

Where do we use CLT?

Answer

A

When sample is not normally distributed, we can still use some tools designed for normal distribution using CLT.

Question 11

Q

Summarize the 3 parts of CLT; which parts hold true for any sample size?

Answer

A

1-The mean of the sampling distribution of means is a fair estimate of the mean of the population from which the samples were drawn

2-The variance of the sampling distribution of means is a fair estimate to the variance of the population from which the samples were drawn, divided by n

3-If the original distribution is normal, the sampling distribution of the means will also be normal. Otherwise, if n>=30, we can still safely assume normal.

Part 1 and 2, are true for any sample size

Question 12

Q

What is the sampling distribution of the mean?

Answer

A

The distribution of the means of the samples taken from the original data. (each sample has a mean, sampling distribution of the means, is the distribution patterns of the means of all the samples taken)

Question 13

Q

What is the formula for calculating variance for continuous and discrete variables?

Answer

A

f(x) is the probability distribution function
p(x) is the probability of each discrete value the variable can take
🤓
Continuous Variables:
∫ (x-mean)^2*f(x) dx
🤓
Discrete Variables:
∑ (xi-mean)^2*pi(xi)

Question 14

Q

How does adding/subtracting a constant to/from a variable change its variance?

Answer

A

It doesn’t change its variance

Question 15

Q

How does multiplying a variable by a constant change its variance?

Answer

A

The new variance= the old variance* constant^ 2

Question 16

Q

How does the variance change when we have a set of INDEPENDENT variables added together?

Answer

A

When they are INDEPENDENT we add the variance of each variable:
var (x±y) =var(x) + var(y)

Question 17

Q

How does the variance change when we have a set of DEPENDENT variables added/subtracted?

Answer

A

var (x+y) =var(x) + var(y) + 2 cov(x, y) (cov=covariance)

var (x-y) =var(x) + var(y) - 2 cov(x, y) (cov=covariance)

Question 18

Q

How is the STD of x+y calculated?

Answer

A

1-Calculate x+y variance: var(x) + var(y)

2- Take the square root of the x+y variance: √ (var(x) + var(y))

Question 19

Q

What is The Most Important Probability Distribution for Discrete Random Variables?

Answer

A

When a random variable follows a binomial distribution

Question 20

Q

What conditions should be met before being certain a distribution is binomial?

Answer

A

1- Probability of success in each separate trial is the same
2- Trials are independent: the result of one trial doesn’t depend on others
3- Fixed number of trials
4- Each trial can be classified as either fail or success

Question 21

Q

What is the formula for getting x number of successes with n trials in a binomial distribution? And the shorthand of it?

Answer

A

P(x) = [n!/x!(n-x)!] p^{x} q^{n-x}
X ~ B(N,P)
X is a binomial random variable with N trials and success probability of P

Question 22

Q

What’s the equation for the cumulative probability distribution for binomial distributions?

Answer

A

P(X<=x) =∑ [n!/k!(n-k)!] p^{k} q^{n-k} 0

Question 23

Q

On which parameter does the shape of Binomial probability distribution (probability vs number of trials) depend? How does it change in relation to the change of this parameter?

Answer

A

P, the probability of success the higher the P(closer to 1), more left skewed the probability distribution is. When P is around .5, it’s almost symmertic.
Explanation: If P is close to one, then let’s say we have 10 trials and we start by 1, meaning that: the probability of just having 1 success in 10 trials, it’s going to be really small since there’s a high chance we succeed more than 1 time. So it’s going to get bigger as we increase the number of successes. Therefore it’s going to be left skewed.

Question 24

Q

Binomial probability distribution is discrete. True or False?

Question 25

Q

Explain how the binomial probability distribution is plotted

Answer

A

binomial distribution is made of p( number of successes) on y axis and number of trials on the x axis. for example if the x axis has 20 ticks, it means we have 20 trials, if we want to plot the probability of having one success among 20 trials, it would be y=p(X=1) , x=1

Question 26

Q

How is the mean and std of a binomial variable is calculated? What do they depend on?

Answer

A

They depend on the probability of success
n= number of trials
p= probability of success
mean=n*p
std=√(np(1-p))

Question 27

Q

What is the 10% rule for assuming independence between trials?

Answer

A

We can make inferences based on things being close to a binomial distribution or a normal distribution. In case of binomial distribution, if the sample is less than or equal to 10% of the population, then it’s ok to assume approximate independence.

Question 28

Q

What does binompdf and binomcdf functions take as input and what’s their output?

Answer

A

Binompdf and Binomcdf: both take n: number of trials, p: probability of success, x: number of success.
Binompdf output: the probability of exactly x times of success happening out of n trials
Binomcdf output: the cumulative probability of up to and including x times of success happening out of n trials.

Question 29

Q

What is Bernoulli distribution?

Answer

A

The Bernoulli distribution describes events having exactly two outcomes, which are ubiquitous in real life. Some examples of such events are as follows: a team will win a championship or not, a student will pass or fail an exam, and a rolled dice will either show a 6 or any other number.

Brainscape's Knowledge GenomeTM

Data Science Flashcards

Brainscape's Knowledge Genome^TM