INFERENTIAL STATISTICS Flashcards
inferential stats?
reach conclusions that extend beyond immediate data sets
Bernoulli distribution?
important case of discrete variables–>Binary only 2 possible outcomes (0 or 1)
population parameter?
fixed feature of a particular population e.g. pop mean, pop variance
sample stats?
quantity that vary from one sample to another (obtain population parameter using random sampling as surveying entire population not practical)
Law of large numbers?
as sample size n increases, the sample mean gets closer to population mean
Central limit theorem?
when sample size large (n>=30),sampling distribution of x is approximately normal, regardless of distribution we started out with
hypothesis testing
tells us how extreme our sample outcome is. creates a rejection region, beyond which sample too extreme to maintain that null hypothesis is true
standardisation
Z=(x-mean)/SD Z~N(0,1)
test stat Z
(p observed-p)/sample variance (reject if >1.96)
reject Ho?
p-value<0.05
95% Confidence Interval
(pop mean-1.96SD, pop mean+1.96SD) reject if observed P not in range
import data from file?
Auto=read.csv(‘link’,header=TRUE,na.strings=’?’)
class of Auto?
‘data.frame’
structure of data?
str(Auto)
headers of data?
head(Auto)
names of variables?
names(Auto)
number of observations and variables?
dim(Auto)
frequency of each observation under a origin variable?
table(Auto$origin)
recoding data for ‘origin’? (check using table(Auto$originf)
Auto$originf = factor(Auto$origin,
labels = c(“USA”, “Europe”, “Japan”))
create new data.frame without variable ‘origin’?
new_data=subset(Auto,select=c(-origin))
identify number of rows with missing values (NA)?
sum(is.na(Auto))
locate entries (which row and column) with missing values
which(is.na(Auto),arr.ind=TRUE)
remove rows with missing values?
Auto=na.omit(Auto)
summarising data for a variable?
mean(Auto$variable)
median(Auto$variable) –> quantile(Auto$variable,0.5)
max(Auto$variable),min(Auto$variable) (minus will range)
var(Auto$variable)
sd(Auto$variable)
5 number summary?
quantile(Auto$variable) OR summary(Auto$variable)
interquartile range?
IQR(Auto$variable)
covariance n correspondance of variable?
attach(Auto)
cov(var1,var2)
cor(var1,var2)
barplot of variable?
barplot(summary(Auto$variable), xlab= ‘label’, ylab=’frequency’,col ‘wheat’)
histogram of variable?
hist(Auto$variable, breaks=20, xlab=’variable (#bin=20)’, ylab=’frequency’, main=’’, col=’wheat’)
side by side graphs with 1 row n 2 columns?
par(mfrow=c(1,2))
box plot of variable?
boxplot(Auto$variable, col=’wheat’, main=’title’, horizontal=TRUE)
Detect outliers based on IQR: [Q1 - 1.5IQR, Q3 + 1.5IQR]?
boxplot.stats(Auto$variable)$out
Locate the outliers in the dataset
outlier= boxplot.stats(Auto$variable)$out
outlier_row=which(Auto$variable)%int%c(outlier))
Auto[outlier_row, ]
Detect outliers based on percentile: 2.5% - 97.5%
lower=quantile(Auto$variable,0.025)
upper=quantile(Auto$variable,0.975)
outlier_row=which(Auto$variable>upper|Auto$variable)
scatterplot?
plot(Auto$var1,Auto$var2, xlab=’var1’, ylab=’var2’)