Data Science Flashcards
How is the mean and variance of two 🎃INDEPENDENT, NORMALLY DISTRIBUTED🎃 variables calculated? calculate for both addition and subtraction
X+Y=T mean(T)= mean(X)+mean(Y) var(T)^2=var(X)^2+var(Y)^2
X-Y=Z mean(Z)= mean(X)-mean(Y) var(Z)^2=var(X)2+var(Y)^2
Write the formula of mean and variance, using expected value:
Expected value is basically the same as mean, now:
Here’s how it’s calculated:
mean=E(x) = x*p(x) for all values of x (if they are discrete we use sigma, if they are continuous, we use integral)
For variance: variance =E ((x-E(x))^ 2) (mean of this variable: (x- mean (x))^2 )
What does central limit theorem say?
The central limit theorem (CLT) states that the distribution of sample means approximates a normal distribution as the sample size gets larger,
🧨 regardless of the population’s distribution. 🧨
Sample sizes equal to or greater than 30 are often considered sufficient for the CLT to hold.
What is a latent variable?
In statistics, latent variables are variables that are not directly observed but are rather inferred through a mathematical model from other variables that are observed
What do we mean when we say the sample mean targets the population mean?
Means that it’s a good estimate of the population mean
What is the variance of the sampling distribution of means telling us? (Considering CLT)
The formula for variance of the sampling distribution of means is: population variance/n, so the larger the sample size, the smaller the variance and the closer the means of samples to the population mean. In extreme form, n is the whole population, and whatever number of samples we take, we’ll have one mean only (variance really small).
How the sampling distribution would look, if the original distribution is not normal?
It approaches a normal distribution.
What is the relation between non-normal original distribution and the suitable sample size?
The further the original distribution is from normal, the larger the sample size should be so that the sampling distribution of means approaches normal distribution (typically sample size>=30 will approach normal distribution)
If the original distribution is normal, the sampling distribution of the means will also be normal. True or False?
True
Where do we use CLT?
When sample is not normally distributed, we can still use some tools designed for normal distribution using CLT.
Summarize the 3 parts of CLT; which parts hold true for any sample size?
1-The mean of the sampling distribution of means is a fair estimate of the mean of the population from which the samples were drawn
2-The variance of the sampling distribution of means is a fair estimate to the variance of the population from which the samples were drawn, divided by n
3-If the original distribution is normal, the sampling distribution of the means will also be normal. Otherwise, if n>=30, we can still safely assume normal.
Part 1 and 2, are true for any sample size
What is the sampling distribution of the mean?
The distribution of the means of the samples taken from the original data. (each sample has a mean, sampling distribution of the means, is the distribution patterns of the means of all the samples taken)
What is the formula for calculating variance for continuous and discrete variables?
f(x) is the probability distribution function p(x) is the probability of each discrete value the variable can take 🤓 Continuous Variables: ∫ (x-mean)^2*f(x) dx 🤓 Discrete Variables: ∑ (xi-mean)^2*pi(xi)
How does adding/subtracting a constant to/from a variable change its variance?
It doesn’t change its variance
How does multiplying a variable by a constant change its variance?
The new variance= the old variance* constant^ 2