DSE1101 Flashcards
What is a variable
characteristics observed in a study.
When does variable become categorical
U
observation belongs to a set of categories.
When does variable become quantitative
observations take on numerical values that represent different magnitudes
What is also called independent variable
Explanatory variable
What is also called dependent vairable
Response variable
What is mean
“average, is one way to measure the center
of a distribution.”
What is sample mean
The sample mean is a sample statistics and serve as a point estimate of the population mean.
What kind of variable does histogram show/
distribution of a continuous variable.
What is modality
associated with the numner of peaks your data have. If have one peak, only talking about a general pattern and data is called unimodal.
What is unimodal?
1 peak
What is 2 peaks
bimodal
What is more than 2 peaks
multimodal data
What is it called when all have same peask
uniform data
Where is the peak on negatively skewed data
“Long tail on left
Peak on right”
Give an example of negatively skewed data
“GPA
Age of death”
What is the peak on positively skewed data
“Longer tail on right
Peak on left”
If question ask wheterh left or right skewed, do we remove outliers first?
Yes
When you find data of some people who spend $1000 in super market, is it an error?
No, take them aside to be analysed separately
Why use median over mean?
More robust to outliers
What is the cons of using median
“MEAN IS EASIER TO COMPUTE THAN MEDIAN, REQUIRE MORE COMPUTING POWER
No need to sort”
If question ask wheterh left or right skewed, do we remove outliers first?
YES
If distribution is skewed or has some extreme values, where is the center
median
If distribution is left skewed, where is median in relation to mean
“mean smaller than median
Median is always closer to the PEAK”
What is variance?
the average squared deviation from the sample mean.
What is the formula for variance?
Why we dont use absolute value but square for variance
less computatoinal power, get rid of negative value
What is the interquartile range
Q1 to Q3
Where does the whiskers of box plot extend up to
1.5 x IQR away from lower and upper quartile
What is tukey rule
outliers are values more than 1.5 times the IQR from the quartiles — either below Q1 - 1.5IQR, or above Q3 + 1.5IQR.
Where are outliers?
more than 1.5 times the IQR from the quartiles — either below Q1 - 1.5IQR, or above Q3 + 1.5IQR.
What are robust statistics for variance
Median and IQR
What to do to extremely skewed data?
natural log transformation
horizontal axis of histogram is ____
discrete
What is denoted by omega
sample space
What does a probability model describe
the uncertainty of a random process.
What is an outcome
mutually exclusive and collectively exhaustive results of a random process.
What is an event
collection of one or more outcomes. It is a subset of the sample space.
What is the probaility distribution ?
lists all possible outcomes and the probabilities with which each of them occurs.
What is cumulative probability distribution
“probability that a variable is less than or equal to a particular value.
P(X<=2)”
What is disjoint outcomes?
cannot happen at the same time
What does it mean for 2 variables to be independent
occurrence of B provides no information about A.
What is P(AnB)?
P (B) × P (A|B)
“In 2013, SurveyUSA interviewed a random sample of 500 residence in North Carolina asking them whether the think widespread gun ownership protects law-abiding citizens from crime, or makes society more dangerous.
58% of all respondents said it protects citizens. 67% of White respondents, 28% of Black respondents, and 64% of Hispanic respondents shared the same view.
Based on the probabilities above, opinion on gun ownership and race ethnicity are most likely
complementary
disjoint
independent
dependent”
Dependent (need to calculate using the given that…)
How to express joint probability in X and Y
“P (X = x, Y = y)
eg: P (Rain, Long commute) = P (X = 0, Y = 0) = 0.15”
What is a random variable?
“numeric quantity whose value depends on the outcome of a random process.
Smaller letters denote the values of variable”
What is the difference be DISCRETE RANDOM variable and CONTINUOUS RANDOM VARIABLE
“DISCRETE: takes integer values
Continuous: takes real decimal values”
What is covariance?
extent to which 2 variables move in the same direction
What is correlation?
covariance between two variables divided by the product of their standard deviations.
What is bernoulli distribution?
”- for discrete variables
- binary, with only 2 possible outcomes (0 or 1)”
How to express Bernoulli distribution?
“X ∼ Bernoulli(p)
p is for prob that value is 1”
How to express normal distribution?
N (µ, σ2).
What is error?
= true value of population parameter - point estimate
What is bias?
the systematic tendency to over or under-estimate the true population parameter.
What is sample variability
how much an estimate will tend to vary from one sample to the next.
Sample average is…
a estimator of population MEAN
What does Y bar stand ofr?
sample mean, y bar is a random variable
What is population parameter?
fixed feature of a particular population
- usually unknown in real life
What is sample statistic?
quantity that vary from one sample to another
- easy to compute, as it is statistic of sample from simple random sampling
What kind of distribution is it when parameters and exact distributions are not known?
Asymptotic distribution (use approx on asmple)
“Tending to a distribution”
What do we rely on when following asymptotic distribution?
Law of large numbers
central limit theorem
What is law of large numbers?
sample mean approaches population mean as the sample size increases
What is central limit theorem?
using sample mean and sample variance to approximate distribution of sample mean
What is the law of central limit theorm?
if population variance sigma^2 is known
When n is large, the sampling distribution of Y¯ is approximately normal, regardless of the distribution of the underlying population.
sample mean approx normally distributed with mean miu and variance (sigma^2)/n
random sample size=n
If population variance is unknown, what does sample mean follow?
student t distribution with n-1 degrees of freedom
tails are higher than normal distribution
variance is s^2/n
If you want to conduct hypo testing on whehter coin is fair, what is variance?
sigma^2 = p(1-p) (assuming the coin is fair)
= 0.25
By clt, sample mean is approx normally distributed with :
var(p hat)= sigma^2 / n = 0.0025
2 tail test
waht is confidence interval
plausible range of values for the population parameter.
What is 95% confidence interval?
1.96 +/- Standard error
Suppose we take many samples and build a confidence interval from each sample, then about 95% of these intervals would contain the true population parameter
Standard error
standard deviation
What is margin of error?
width of CI
Linear Regression is ____.
supervised
unsupervised
supervised learning
What is a charcteristic of the y variable for linear regression?
continuous dependent
can linear regression be used to predict discrete outcomes ?
Yes (credit card default)
What does hat denote /
estimate, a predicted value
what is the typical equation of a linear regression model?
Y = β0 + β1X + ϵ
What does ϵ represent in the model linear regression
residual term/ erorr term
DIFFERENCE BETWEEN THE REGRESSION LINE AND THE ACTUAL OBSERVED DATA
What is the equaiton for residual?
= yi − yˆi = yi − (β0 + β1xi)
= vertical distance between each point to purported line
What is the residual sum of squares?
SUM( residuals) for all observations
ALSO CALLED LEAST SQUARES
the variance in
Y that is left unexplained after fitting the regression model.
What is model supoposed to minimise in linear regression? How?
RSS
- sum all residuals , with variables b0 and b1 etc.
- Take the derivative wrt b0 and b1
The regerssion line always passes through which point?
(x bar, y bar)
b0 = y hat - b1(x bar)
sub into eqn y= b0 +b1 x
y bar= y bar - b1 x hat + b1 x hat
b1 x hat CANCEL OFF!!!!
What does best fit line do?
Minimises the square deviation to the proposed line ( least squares fit for the regression line)
How to interpret the y intercept for the y axis?
If there is 0 of x, then ON AVERAGE, able to have y amount
How to interpret the slope of a regression plot?
change of Y when X increases/decreases by one unit
What is residual standard erorr?
estimate of the standard deviation of the residual terms
measures the lack of fit of a model to the data
How many degrees of freedom are there for RSE?
N-2 (scale down)
What is TSS?
total variance in Y
can be explained by model(RSS) + cannot be explained
What is R^2?
measures the goodness of fit
variance in y that can be explained (larger the R^2, the bigger the goodness of fit)
Formula:
(TSS- RSS)/ TSS
What is the purpose of hypo testing for linear regression?
how close the estimatoed b0 and b1 hat are to the true values of b0 and b1
how ot find standard error of an estimator?
repeated sampling, and see what values you get for b0 and b1
How do we conduct hypothesis testing for b0 and b1?
T test with n-2 degree of freedom, where n is sample size(cause estimate b0 and b1)
t= (b1-0 )/ se(b1 hat)
What are the assumptions for the leeast squares line?
- Relationship between X and Y should be linear
- Residual nearly normal
- Residual shave constant variability (homoscedaticity)
What graph should we use to check whether X and Y are linear?
Residuals vs Fitted plot
RED LINE SHOULD BE HORIZONTAL
How to check whether nearly normal residual?
Normal Q-Q plot
points should be roughly along straight diagonal line
What is hte formula for standardised residual?
(ei -e hat )/ SE(e)
How to check for constant variability?
Scale-Location plot (YOU WANT OT HAVE NO PATTERN IN RESIDUAL)
red line is roughly horizontal
How ot check for influential values?
Residual vs leverage plot
check for outlyying vales at upper-right or lower right
If they fall outside of cook distance, then it is influential(should remove points)
How to improve model?
transforming variables(scaling)
seeking additional variables to explain Y
Using more advanced methods
How to read data in R?
read.csv(“file”, head=True)
How to create a linear model in R?
lm1= lm(y var~ x var, data= Advertising)
How to show the coefficients?
summary(lm1)$coefficients
When to reject null hypo with 95% confidence that b1 is more than 0?
when |t| for b1 greater than 1.96
There is relationship between variables
How to obtain confidence interval for b0 and b1 in R?
confit(lm1).
By default 95%
How to find confidence interval of 90% in R for b0 and b1?
confit(lm1, level=0.90)
How to specify that you use a column in dataset?
data$column name
How much of the dataset lies within:
1sd
2sd
3sd
1sd: 68%
2sd: 95
3sd: 99.7