8 | Statistics for Numerical Data I Flashcards
(POLL)
Which measures can be used to describe the data scatter?
* mean
* median
* cv
* sem
* sd
- cv
- sd
(POLL)
If one of the values is 0, the geometric mean is:
* positive
* negative
* zero
* undefined
* 1
zero
(POLL)
The data are normally distributed within +/- 1 SD are …
* 50% of all data
* 2/3 of all data
* 1/3 of all data
* the SD does not tells us this
- 2/3 of all data
(POLL)
Which of the following measures can be used to describe the shape of the distribution to the data scatter?
* cv
* kurtosis
* mean
* sd
* sem
* skewness
* var
- kurtosis
- skewness
(POLL)
With the 3 SD criteria, how many outliers you expect for 1000 values if the data are normally distributed?
* 0
* 1
* 2-3
* 5-10
* 10-100
* 100-1000
- 2-3
(POLL)
You have a numerical variable with ten values. For visualization you would use a?
* Histogram
* Stripchart
* density line
* violinplot
- Stripchart
also possible but not best:
* Histogram (stripchart better)
(POLL)
You visualize a numerical variable against a categorical one, which is the appropiate plot to use?
* barplot
* boxplot
* histogram
* stripchart
* violineplot
* xyplot
- boxplot
- stripchart
- violineplot
(POLL)
Which test(s) you could use to check if your data are normally distributed?
* Chisq-Test
* Fisher-Test
* Kolmorogov-Smirnov-Test
* Shapiro-Wilk-Test
* T-Test
- Kolmorogov-Smirnov-Test
- Shapiro-Wilk-Test
(POLL)
Which things are shown on a boxplot?
* mean,max,minimum,outliers
* median,1st quartile,3rd quartile,outliers
* mean,1st quartile,3rd quartile,outliers
- median,1st quartile,3rd quartile,outliers
R:
How would you create a summary of a dataset with the following variables?
Cat
Data Summaries
—————————————————————————————-
1D 2D 3D | function
—————————————————————————————-
Cat NA NA | table(c1)
Cat Cat NA | table(c1,c1), chisq.test(table(c1,c2))
Cat Cat Cat | ftable(c1,c2,c3)
—————————————————————————————-
Cat Num NA | aggregate(n2,by=list(c1),func)
Cat Num Num | sbi$aggregate2(n2,n3,c1,cor)
Cat Cat Num | aggregate(n3,by=list(c1,c2),func)
—————————————————————————————-
Num NA NA | mean(n1), median(n1), sd(n1), mad(n1)
Num Num NA | cor(n1,n2)
Num Num Num | cor(n1,n2), cor(n1,n3), cor(n2,n3) OR cor(data.frame(n1=n1,n2=n2,n3=n3))
—————————————————————————————-
`
R:
How would you create a summary of a dataset with the following variables?
Cat
Data Summaries
—————————————————————————————-
1D 2D 3D | function
—————————————————————————————-
Cat NA NA | table(c1)
Cat Cat NA | table(c1,c1), chisq.test(table(c1,c2))
Cat Cat Cat | ftable(c1,c2,c3)
—————————————————————————————-
Cat Num NA | aggregate(n2,by=list(c1),func)
Cat Num Num | sbi$aggregate2(n2,n3,c1,cor)
Cat Cat Num | aggregate(n3,by=list(c1,c2),func)
—————————————————————————————-
Num NA NA | mean(n1), median(n1), sd(n1), mad(n1)
Num Num NA | cor(n1,n2)
Num Num Num | cor(n1,n2), cor(n1,n3), cor(n2,n3) OR cor(data.frame(n1=n1,n2=n2,n3=n3))
—————————————————————————————-
R:
How would you create a summary of a dataset with the following variables?
Cat Cat
Data Summaries
—————————————————————————————-
1D 2D 3D | function
—————————————————————————————-
Cat NA NA | table(c1)
Cat Cat NA | table(c1,c1), chisq.test(table(c1,c2))
Cat Cat Cat | ftable(c1,c2,c3)
—————————————————————————————-
Cat Num NA | aggregate(n2,by=list(c1),func)
Cat Num Num | sbi$aggregate2(n2,n3,c1,cor)
Cat Cat Num | aggregate(n3,by=list(c1,c2),func)
—————————————————————————————-
Num NA NA | mean(n1), median(n1), sd(n1), mad(n1)
Num Num NA | cor(n1,n2)
Num Num Num | cor(n1,n2), cor(n1,n3), cor(n2,n3) OR cor(data.frame(n1=n1,n2=n2,n3=n3))
—————————————————————————————-
R:
How would you create a summary of a dataset with the following variables?
Cat Cat Cat
Data Summaries
—————————————————————————————-
1D 2D 3D | function
—————————————————————————————-
Cat NA NA | table(c1)
Cat Cat NA | table(c1,c1), chisq.test(table(c1,c2))
Cat Cat Cat | ftable(c1,c2,c3)
—————————————————————————————-
Cat Num NA | aggregate(n2,by=list(c1),func)
Cat Num Num | sbi$aggregate2(n2,n3,c1,cor)
Cat Cat Num | aggregate(n3,by=list(c1,c2),func)
—————————————————————————————-
Num NA NA | mean(n1), median(n1), sd(n1), mad(n1)
Num Num NA | cor(n1,n2)
Num Num Num | cor(n1,n2), cor(n1,n3), cor(n2,n3) OR cor(data.frame(n1=n1,n2=n2,n3=n3))
—————————————————————————————-
R:
How would you create a summary of a dataset with the following variables?
Cat Num
Data Summaries
—————————————————————————————-
1D 2D 3D | function
—————————————————————————————-
Cat NA NA | table(c1)
Cat Cat NA | table(c1,c1), chisq.test(table(c1,c2))
Cat Cat Cat | ftable(c1,c2,c3)
—————————————————————————————-
Cat Num NA | aggregate(n2,by=list(c1),func)
Cat Num Num | sbi$aggregate2(n2,n3,c1,cor)
Cat Cat Num | aggregate(n3,by=list(c1,c2),func)
—————————————————————————————-
Num NA NA | mean(n1), median(n1), sd(n1), mad(n1)
Num Num NA | cor(n1,n2)
Num Num Num | cor(n1,n2), cor(n1,n3), cor(n2,n3) OR cor(data.frame(n1=n1,n2=n2,n3=n3))
—————————————————————————————-
R:
How would you create a summary of a dataset with the following variables?
Cat Num Num
Data Summaries
—————————————————————————————-
1D 2D 3D | function
—————————————————————————————-
Cat NA NA | table(c1)
Cat Cat NA | table(c1,c1), chisq.test(table(c1,c2))
Cat Cat Cat | ftable(c1,c2,c3)
—————————————————————————————-
Cat Num NA | aggregate(n2,by=list(c1),func)
Cat Num Num | sbi$aggregate2(n2,n3,c1,cor)
Cat Cat Num | aggregate(n3,by=list(c1,c2),func)
—————————————————————————————-
Num NA NA | mean(n1), median(n1), sd(n1), mad(n1)
Num Num NA | cor(n1,n2)
Num Num Num | cor(n1,n2), cor(n1,n3), cor(n2,n3) OR cor(data.frame(n1=n1,n2=n2,n3=n3))
—————————————————————————————-
R:
How would you create a summary of a dataset with the following variables?
Cat Cat Num
Data Summaries
—————————————————————————————-
1D 2D 3D | function
—————————————————————————————-
Cat NA NA | table(c1)
Cat Cat NA | table(c1,c1), chisq.test(table(c1,c2))
Cat Cat Cat | ftable(c1,c2,c3)
—————————————————————————————-
Cat Num NA | aggregate(n2,by=list(c1),func)
Cat Num Num | sbi$aggregate2(n2,n3,c1,cor)
Cat Cat Num | aggregate(n3,by=list(c1,c2),func)
—————————————————————————————-
Num NA NA | mean(n1), median(n1), sd(n1), mad(n1)
Num Num NA | cor(n1,n2)
Num Num Num | cor(n1,n2), cor(n1,n3), cor(n2,n3) OR cor(data.frame(n1=n1,n2=n2,n3=n3))
—————————————————————————————-
R:
How would you create a summary of a dataset with the following variables?
Num
Data Summaries
—————————————————————————————-
1D 2D 3D | function
—————————————————————————————-
Cat NA NA | table(c1)
Cat Cat NA | table(c1,c1), chisq.test(table(c1,c2))
Cat Cat Cat | ftable(c1,c2,c3)
—————————————————————————————-
Cat Num NA | aggregate(n2,by=list(c1),func)
Cat Num Num | sbi$aggregate2(n2,n3,c1,cor)
Cat Cat Num | aggregate(n3,by=list(c1,c2),func)
—————————————————————————————-
Num NA NA | mean(n1), median(n1), sd(n1), mad(n1)
Num Num NA | cor(n1,n2)
Num Num Num | cor(n1,n2), cor(n1,n3), cor(n2,n3) OR cor(data.frame(n1=n1,n2=n2,n3=n3))
—————————————————————————————-
R:
How would you create a summary of a dataset with the following variables?
Num Num
Data Summaries
—————————————————————————————-
1D 2D 3D | function
—————————————————————————————-
Cat NA NA | table(c1)
Cat Cat NA | table(c1,c1), chisq.test(table(c1,c2))
Cat Cat Cat | ftable(c1,c2,c3)
—————————————————————————————-
Cat Num NA | aggregate(n2,by=list(c1),func)
Cat Num Num | sbi$aggregate2(n2,n3,c1,cor)
Cat Cat Num | aggregate(n3,by=list(c1,c2),func)
—————————————————————————————-
Num NA NA | mean(n1), median(n1), sd(n1), mad(n1)
Num Num NA | cor(n1,n2)
Num Num Num | cor(n1,n2), cor(n1,n3), cor(n2,n3) OR cor(data.frame(n1=n1,n2=n2,n3=n3))
—————————————————————————————-
R:
How would you create a summary of a dataset with the following variables?
Num Num Num
Data Summaries
—————————————————————————————-
1D 2D 3D | function
—————————————————————————————-
Cat NA NA | table(c1)
Cat Cat NA | table(c1,c1), chisq.test(table(c1,c2))
Cat Cat Cat | ftable(c1,c2,c3)
—————————————————————————————-
Cat Num NA | aggregate(n2,by=list(c1),func)
Cat Num Num | sbi$aggregate2(n2,n3,c1,cor)
Cat Cat Num | aggregate(n3,by=list(c1,c2),func)
—————————————————————————————-
Num NA NA | mean(n1), median(n1), sd(n1), mad(n1)
Num Num NA | cor(n1,n2)
Num Num Num | cor(n1,n2), cor(n1,n3), cor(n2,n3) OR cor(data.frame(n1=n1,n2=n2,n3=n3))
—————————————————————————————-
Univariate Descriptions of Numerical Data
How can we describe the center?
- center: mean, mean(x,trim=0.1), median
Univariate Descriptions of Numerical Data
How can we describe the scatter?
- scatter: var, sd, cv
Univariate Descriptions of Numerical Data
How can we describe the distribution?
- distribution: quantile, IQR, max, min, range
Univariate Descriptions of Numerical Data
How can we describe the shape?
- shape: skewness, kurtosis
Univariate Descriptions of Numerical Data
How can we describe with graphics?
- plots: boxplot (barplot with arrows)
R:
Calculate mean
- normal/arithmetic mean: mean(x,na.rm=TRUE)
- trimmed mean: mean(x,na.rm=TRUE,trim=0.1)
R:
Calculate median
- median: median(x,na.rm=true)
Pythagorean means? Inequality?
- (arithmetic) mean
- geometric mean
- harmonic mean
- they hold this inequality: arithmetic ≥ geometric ≥ harmonic
SPICK formulae
Pythagorean Means – ranges? Properties?
(arithmetic) mean:
(arithmetic) mean:
* standard
* values must have same properties, same ranges
geometric mean:
* average of multiple properties on different scales,
* ranges, all values must be above zero
harmonic mean:
* average of rates, all values must be above zero
Why do we need other types of mean:
What is the average speed if travelling 60 km at 60km/h and another 60 km at 20km/h?
- might thing 40 → wrong!
- 1 hr at 60 km/h, 3 hrs at 20km/h → 120km/h / 4 → 30km/h
- This can also be calculated using the harmonic mean
- No 0s or negative with harmonic mean!
~~~
> library(diagram)
> openplotmat(xlim=c(‐0.05,1.05),ylim=c(0.2,0.75))
> textdiamond(c(0.5,0.65),lab=’60km/h’,
+ box.col=’light blue’,radx=0.12,rady=0.08)
> textdiamond(c(0.5,0.35),lab=’20km/h’,
+ box.col=’light blue’,radx=0.12,rady=0.08)
> straightarrow(c(0.1,0.55),c(0.9,0.55),lwd=2)
> straightarrow(c(0.9,0.45),c(0.1,0.45),lwd=2)
> axis(1,labels=c(‘0km’,’60km’),at=c(0.1,0.9))
> textround(c(0.1,0.5),lab=’A’,box.col=’salmon’,
+ radx=0.01,rady=0.1,cex=1.2)
> textround(c(0.9,0.5),lab=’B’,box.col=’salmon’,
+ radx=0.01,rady=0.1,cex=1.2)
> hmean(c(20,60))
[1] 30
> hmean(c(20,0))
[1] 0
> hmean(c(20,60,NA))
[1] NA
> hmean(c(20,60,NA),na.rm=TRUE)
[1] 30
~~~
Why do we need other types of mean:
Geometric Mean example GDP?
- Average GDP of European Union?
- Changes in GDP: each country has equal impact?
→ Solution geometric mean!
→ This is also for a stock index helpful!
→ Exercise implement geometric and harmonic mean! SPICK
~~~
> # USD 2016
> bip.gre = 19000
> bip.ger = 42000
> mean(c(bip.ger,bip.gre))
[1] 30500
> mean(c(bip.ger+(bip.ger/10),bip.gre))
[1] 32600
> mean(c(bip.ger,bip.gre+(bip.gre/10)))
[1] 31450
> gmean(c(bip.ger,bip.gre))
[1] 28248.89
> gmean(c(bip.ger+(bip.ger/10),bip.gre))
[1] 29627.69
> gmean(c(bip.ger,bip.gre+(bip.gre/10)))
[1] 29627.69
> gmean(c(0.1,10))
[1] 1
> gmean(c(0.2,10))
[1] 1.414214
> gmean(c(0.1,20))
[1] 1.414214
> gmean(c(0.1,20,100))
[1] 5.848035
> gmean(c(0.15,20,100))
[1] 6.69433
> gmean(c(0.1,30,100))
[1] 6.69433
> gmean(c(0.1,20,150))
[1] 6.69433
Data scatter results from?
- variability:
– imprecision
– experimental error
– biological variability - bias:
What is variability, ie where does it come from?
– imprecision
– experimental error
– biological variability
What is bias?
– systematic error
– does not! contribute to scatter
Variability and Bias - with two Guns and two Gunmen?
Bad gun = bias
Bad gunman = imprecision
Good Gunman (precision)
* Good Gun → small scatter right in the bulls eye (accurate variance)
* Bad Gun → small scatter but moved away from the centre (inaccurate bias)
Bad Gunman (imprecision)
* Good Gun → a lot of scatter, but looks like bulls eye is ca average(accurate variance)
* Bad Gun → a lot of scatter, but looks like (inaccurate bias)
What data scatter measures are there for numerical data?
- standard deviation (SD)
- coefficient of variation (CV)
- standard error of the mean (SEM)
Standard deviation (SD) formula for sample?
– s: sample standard deviation s = √[ (∑(xi − x̄)2 / (N -1) ]
Standard deviation (SD) formula for population?
– σ: population standard deviation = √[ (∑(xi − x̄)2 / N ]
Standard deviation (SD) – proportions in normal data?
normal data:
- ~ 2/3 of data are within 1 SD
- ~ 95% of data are within 2 SD
Issue with SD?
- unit based → SD in m is smaller than SD in cm → use CV
Coefficient of variation (CV) formula? Used to?
– CV% = 100sd(x)/mean(x) → cv =100sx / x
– used to compare different magnitudes
Standard error of the mean formula?
– SEM = sd(x)/sqrt(N) → sem = Sx/ sqrt(N)
Standard error of the mean what does it tell us?
– how close are we to the true population mean
Standard error of the mean how can we get closer to pop mean?
–
more measurements → closer to pop mean
SEM vs SD?
- SEM always smaller than SD so some people prefer to show
- But more measurements → closer to pop mean
- It is not actually expressing the scatter, though it depends on the scatter
- It tells us how close are we to the true population mean
R:
SD, CV, SEM
> sd(survey$kg) [1] NA > sd(survey$kg,na.rm=TRUE) # standard deviation [1] 12.74224 > print(try(cv(survey$kg))) # there is no cv [1] "Error in cv(survey$kg) : konnte Funktion \"cv\" nicht finden\n" attr(,"class") [1] "try‐error" attr(,"condition") # no cv function in R <simpleError in cv(survey$kg): konnte Funktion "cv" nicht finden> > cv=function(x,na.rm=FALSE) { 100*sd(x,na.rm=na.rm)/ # implement own cv function mean(x,na.rm=na.rm) } # where to place this? > cv(survey$kg) [1] NA > sem=function(x,na.rm=FALSE) { # implement own sem function sd(x,na.rm=na.rm)/sqrt(length(x[!is.na(x)])) } > sem(survey$kg) [1] NA > sem(survey$kg,na.rm=TRUE) [1] 0.6033626 > sbi$cv=cv > sbi$sem=sem
What are the first, second, third, and fourth moments of a distribution?
- first moment: mean
- second moment: variance
- skewness
- kurtosis
What is skewness?
- third central moment of a distribution
- shape: signed measure for the degree of symmetry
Skewness – positive values?
- positive values: skewness to the right (long tail)
Skewness – negative values?
- negative values: skewness to the left (short tail)
Skewness – around 0?
- around zero: symmetrical distribution
R:
Which library for skewness and kurtosis?
> library(e1071)
R:
Skewness – normal distribution?
~~~
> library(e1071)
> library(UsingR)
> nym=nym.2002[nym.2002$age>16,]
> skewness(nym$time)
[1] 1.083422 ### longer tail on the right
> skewness(nym$time*‐1)
[1] ‐1.083422 ### longer tail on the left
> x <‐ rnorm(1000)
> skewness(x)
[1] 0.08863
> x <‐ rnorm(1000)
> skewness(x)
[1] ‐0.03025407
time
density
100 200 300 400 500 600
0.000 0.002 0.004 0.006 0.008 0.010
> hist(nym$time,freq=F, ylim=c(0,0.01), main=’‘,xlab=’time’, ylab=’density’)
> box()
> lines(density(nym$time), col=”blue”,lwd=3)
R:
Skewness - Bimodal distribution?
> norm1=c(rnorm(1000, mean=5,sd=2)) > norm2=c(rnorm(500,mean=13,sd=2)) > nonorm=c(norm1,norm2) > median(nonorm) [1] 6.304157 > mean(nonorm) [1] 7.654257 ### mean and median are not close together → indication of long tail or bimodal > skewness(nonorm) [1] 0.4822618 ## right tail > par(mfrow=c(2,1)) > hist(nonorm,col="light blue");box() > points(mean(nonorm),145);text(mean(nonorm),160,"mean") > points(median(nonorm),185);text(median(nonorm),200,"median") > qqnorm(nonorm);qqline(nonorm)
R:
Skewness implementation in R?
> dskewness <‐ function (x,na.rm=TRUE) { if (na.rm) { x=x[!is.na(x)] } else { if (any(is.na(x))) { return(NA) } } g=(sum((x‐mean(x,na.rm=na.rm))^3)/length(x))/(sd(x)^3) return(g) } > skewness(nonorm) [1] 0.4822618 > dskewness(nonorm) [1] 0.4822618 > sbi$skewness=dskewness
What is Kurtosis?
- fourth central moment of a distribution
- peakedness of a distribution
Formula SPICK?
What does a large kurtosis mean?
* large values: sharp peak
What does a ca 0 kurtosis mean?
* around zero: normal peak
What does a small kurtosis mean?
* low values: broad peak
R:
Kurtosis usage?
(e1071) > xn1=runif(1000) > kurtosis(xn1) [1] ‐1.195974 > xn2=runif(1000) > kurtosis(xn1+xn2) [1] ‐0.5967484 > xn3=runif(1000) > kurtosis(xn1+xn2+xn3) [1] ‐0.3342538 > # rnorm gives around 0 > kurtosis(rnorm(1000)) [1] 0.036207 > # rt can be peakier ... > kurtosis(rt(1000,df=10)) [1] 1.443117 > par(mfrow=c(3,1),mai=rep(0.4,4)) > hist(xn1,col='#ff3333',breaks=25); box() > hist(xn1+xn2+xn3,col='#ffff33',breaks=20);box() > hist(rt(1000,df=10),col='#33ff33',breaks=25);box()
R:
Kurtosis – Implementation?
> dkurtosis <‐ function (x,na.rm=TRUE) { if (na.rm) { x=x[!is.na(x)] } else { if (any(is.na(x))) { return(NA) }} g=(sum((x‐mean(x,na.rm=na.rm))^4) /length(x))/(sd(x)^4)‐3 return(g) } > kurtosis(xn1+xn2+xn3) [1] ‐0.3342538 > dkurtosis(xn1+xn2+xn3) [1] ‐0.3342538 > sbi$kurtosis=dkurtosis
Outliers definition?
Grubbs 1969 defined an outlier as:
Observation that appears to deviate markedly from other members of sample in which it occurs.
Outliers measures?
- more than 3 x SD away from the mean
- more than 1.5 times of the IQR (mild) 3 times (extreme) → custom boxplot criteria
What could an outlier be? - …
* an error prone data point ?
* a normal data scatter point ?
* an important specific observation ?
* → It depends
How can we deal with outliers?
* how to delete?
* delete the variable
* delete the value → missing value imputations
* transform the variable
* transform the value
R:
Implement outlier function to check for outliers in normal distribution
> is.outlier <‐ function (x) { sd=sqrt(var(x,na.rm=TRUE)) xm=x‐mean(x,na.rm=TRUE) return(abs(xm)‐3*sd>0) } > rn=rnorm(1000) > table(is.outlier(rn)) FALSE TRUE 998 2 > table(abs(scale(rn))>3) FALSE TRUE 998 2 > is.outlier = function (x) { return(abs(scale(x))>3) } > table(is.outlier(rn)) FALSE TRUE 998 2
What is a violinplot?
- similar to box plot
- except also show P density of data at different values,
- usually smoothed by a kernel density estimator.
R:
Violinplot (lattice)?
> library(lattice) > print(bwplot(nym.2002$age, main="nym.2002$age", panel=panel.violin, col="light blue")) nym.2002$age nym.2002$age 20 40 60 80
Density plot?
- For single numerical variable
- like a smoothed histogram → bar become lines
R:
Density plot?
> hist(nym.2002$age, main="nym.2002$age", freq=F,col="light blue") > box() > lines(density(nym.2002$age), col="blue") > lines(density(nym.2002$age,bw=1), col="red") > lines(density(nym.2002$age,bw=4), col="green") > lines(density(nym.2002$age,bw=10), col="magenta")
Boxplot vs Stripchart?
Boxplot:
* many values
Stripchart:
* few values
* 1D scatter plots (or dot plots)
* good alternative to boxplot when few values
R:
Boxplot vs Stripchart?
par(mfrow=c(1,3)) boxplot(survey$cm~survey$gender,col=c(2,4)) stripchart(survey$cm~survey$gender,col=c(2,4)) stripchart(survey$cm~survey$gender,method="stack", col=c(2,4),vertical=TRUE)
Stack option → looks a bit like a histogram
‘mixture between histogram, density line, violin plot’
When to use which Num ~ Cat plot ?
* violinplot: ?
* stripplot: ?
* boxplot: ?
- violinplot: show bi- or multimodal data
- stripplot: show few values (a dozen per group)
- boxplot: all other cases
Numerical Numerical
Two ways to analyse a relationship between?
- Correlation
- Regression
Which method of analysing a relationship between two numerical variables is interested in the direction of the relationship?
interested: regression
Which method of analysing a relationship between two numerical variables is not interested in the direction of the relationship?
not interested: correlation
R:
How can we get from some other distribution to a normal distribution?
Uniform to Normal Distribution
(extra)
Uniform distribution, lets throw dices:
~~~
> options(digits=3)
> runif(6,1,7)
[1] 2.30 2.69 1.15 3.54 5.52 3.50
> as.integer(runif(10,1,7))
[1] 4 1 1 2 5 3 1 2 2 3
> ru1=as.integer(runif(1000,1,7))
> table(ru1)
ru1
1 2 3 4 5 6
163 166 164 157 162 188
> ru2=as.integer(runif(1000,1,7))
> par(mfrow=c(2,1),mai=rep(0.4,4))
> barplot(table(ru1))
> barplot(table(ru2))
~~~
R:
Mean of two uniform samples
> ru1u2=(ru1+ru2)/2 > table(ru1u2) ru1u2 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 6 30 47 92 99 131 176 159 108 89 48 21 > par(mfrow=c(2,1),mai=rep(0.4,4)) > hist(ru1u2) > barplot(table(ru1u2))
→ hist problem with discrete data
R:
Means of Uniform Distributions
→ from an uniform (or any other ) to a Gaussian distribution …
~~~
> options(digits=3) R:
> r1=runif(1000,1,7)
> par(mfrow=c(3,1),mai=rep(0.4,4))
> hist(r1,col=”light blue”,main=’r1’)
> box()
> for (i in 1:10) {
r1=r1+runif(1000,1,7)
if (i == 2) {
hist(r1/3,main=”r3”, col=”light blue”)
}
}
> hist(r1/11,main=”r11”,
col=”light blue”)
> box()
~~~
Central Limit Theorem
- if sample size is large enough
- Pop of sample means will approximate a Gaussian distribution
- No matter how population is distributed
Central Limit Theorem
How large does sample size need to be?
It depends:
* more normal distribution → 10 or more
* less normal distribution → more samples, 100 should always be enough
Properties of a Normal Distribution?
- symmetrical bell shape
- extends in both directions to ∞
- mean, median close together
- 95% of values within 2 SD
For which distribution(s) is this true: 95% of all values are within 2 SD?
Normal distribution
this assumption gives very wrong results if the the distribution is non-normal ‼
Normal numerical data n~c → which test?
t.test
Non-normal numerical data, skewed or multi-modal distributions n~c → which test?
wilcox.test
R:
<qpnr>norm (d probably meant pdqr?)
</qpnr>
- d: pdf (probability density function = point probability); dnorm(N) → P(X==N)
- p: cdf (cumulative density function); pnorm(N) → P(X<N)
- q: inverse cdf (quantile function); qnorm(P) → X
- r: random number generator
Example:
~~~
> options(digits=3)
> rn=rnorm(1000)
> hist(rn,freq=F,col=”light blue”)
> lines(density(rn), col=”red”,lwd=2)
> box()
> pnorm(0)
[1] 0.5
> summary(rn)
Min. 1st Qu. Median Mean 3rd Qu. Max.
‐3.36 ‐0.62 0.01 0.04 0.68 3.42
> dnorm(0)
[1] 0.399
> qnorm(0.5)
[1] 0
~~~
R:
How can we test for normality?
H0 for Shapiro-Wilk test assumes normality:
~~~
> rn=rnorm(1000,mean=5.5,sd=0.5)
> mean(rn)
[1] 5.5
> median(rn)
[1] 5.51
> shapiro.test(rn)
Shapiro‐Wilk normality test
data: rn
W = 1, p‐value = 0.7
> shapiro.test(rn)$p.value
[1] 0.746
> shapiro.test(runif(100,1,6))$p.value
[1] 0.00346
> shapiro.test(runif(100,1,6))$statistic
W
0.956
> shapiro.test(survey$cm)
Shapiro‐Wilk normality test
data: survey$cm
W = 1, p‐value = 4e‐04
> shapiro.test(survey$cm[survey$gender==’M’])$p.value
[1] 0.409
~~~
→ p-value >= 0.05 we don’t reject H0, that the distribution comes from a normal distribution
→ p-value < 0.05 we reject H0, that the distribution comes from a normal distribution
Shapiro-Wilk Test:
→ p-value >= 0.05 we ______ H0, that the distribution comes from a normal distribution
→ p-value >= 0.05 we don’t reject H0, that the distribution comes from a normal distribution
Reporting Shapiro-Wilk Test?
Analysing height: W = 0.987, p = 0.00042 and weight, W = 0.951, p = 0, of students.
The size of students, W = 0.987, p = 0.00042, as well as the weight of students, W = 0.951, p = 0, were both significantly non-normally distributed.
* with many samples Shapiro-Wilk test is very easy significant
* use it only with visual inspection, histogram, qqplot if having many samples
[Better:
The size of students, W = 0.987, p < 0.001, as well as the weight of students, W = 0.951, p < 0.001, were both significantly non-normal.]
The Shapiro-Wilk test can be used to check if data comes from a normal distribution. How else can you check for normality?
- Kolmogorov‐Smirnov test
- generalized test for any distribution
- check if both samples might come from the same distribution → eg from normal
R:
Normal Data Visualization
> norm=c(rnorm(1000, mean=5,sd=3)) > median(norm) [1] 5.06 > mean(norm) [1] 5.03 > shapiro.test(norm)$p.value [1] 0.419 > par(mfrow=c(1,2),mai=c(0.4,0.4,0.4,0.4)) > hist(norm,col="light blue") > points(mean(norm),145) > text(mean(norm),160,"mean") > points(median(norm),185) > text(median(norm),200,"median") > box() > qqnorm(norm) > qqline(norm) Histogram of norm
R:
QQ-plots
Plotting the observed values against the expected/theoretical
~~~
> options(digits=3)
> rn=rnorm(10)
> head(sort(signif(rn,3)),n=5)
[1] ‐1.030 ‐0.657 ‐0.540 ‐0.439 0.147
> qqnorm(rn)
> qqline(rn)
~~~
T-Distribution: Derived from _____________
Normal distribution
T-Distribution: t is the _______ between ________ and the ______, divided by the _____
t is the difference between the sample mean and the population mean, divided by the SEM
T-Distribution: you perform a sampling experiment where you know the __________
Population value
R:
Practical – what’s being demonstrated here?
T-Distribution: There is always a difference … Let’s simulate many t’s
~~~
> rn=rnorm(10000) # our population is rn
> summary(rn)
Min. 1st Qu. Median Mean 3rd Qu. Max.
‐3.68 ‐0.67 ‐0.02 0.00 0.66 3.99
> mcompare=function () {sam1=sample(rn,10); sam2=sample(rn,10) ; return(mean(sam1) mean(sam2))}
> res=c() # create empty result vector
> for (i in 1:1000) {res=c(res,mcompare()) # append to result}
> hist(res,col=”light blue”,cex.lab=1.5,cex.main=1.5)
> box()
~~~
R:
Practical – how to create own t distribution?
Distribution of t Values
~~~
> getT=function(n) {s1=sample(rn,n) ; t=(mean(s1)‐mean(rn))/(sd(s1)/sqrt(n)) ; return(t) }
> getT(5)
[1] 0.642
> res=c()
> for (i in 1:1000) { res=c(res,getT(10)) }
> par(mfrow=c(2,1),mai=c(0.4,0.4,0.4,0.4))
> summary(res)
Min. 1st Qu. Median Mean 3rd Qu. Max.
‐6.26 ‐0.75 ‐0.01 ‐0.07 0.63 5.36
> hist(res,col=”beige”,freq=F,ylim=c(0,0.5))
> lines(density(res),col=”red”,lwd=2)
> lines(seq(‐5,5,0.1),dt(seq(‐5,5,0.1),df=10),
col=”blue”,lwd=2,lty=1) ; box()
~~~
R:
Function to random data from t-distribution?
> print(try(rt(1000))) [1] "Error in rt(1000) : Argument \"df\" fehlt (ohne Standardwert)\n" attr(,"class") [1] "try‐error" attr(,"condition") <simpleError in rt(1000): Argument "df" fehlt (ohne Standardwert)> > xt=rt(1000,99) > summary(xt) Min. 1st Qu. Median Mean 3rd Qu. Max. ‐3.96 ‐0.69 0.02 0.01 0.69 3.97 > shapiro.test(xt)$p.value [1] 0.271 > ks.test(xt,"pt",99)$p.value [1] 0.721
R:
Qt function?
t* with qt
> df=c(1:6,10,20,50,100,10000);
> t=qt(0.975,df)
> data.frame(t,df)
t df
1 12.71 1
2 4.30 2
3 3.18 3
4 2.78 4
5 2.57 5
6 2.45 6
7 2.23 10
8 2.09 20
9 2.01 50
10 1.98 100
11 1.96 10000
(QUIZ 3)
The ______is useful for calculating the mean of two speeds whereas the ______ can be used if our data are on different scales but we would like monitor changes over time where changes of both variables have the same impact. If we say “mean” we usually are speaking about the ______.
The harmonic mean is useful for calculating the me of two speeds whereas the geometric mean can be used if our data are on different scales but we would like monitor changes over time where changes of both variables have the same impact. If we say “mean” we usually are speaking about the arithmetic mean.
(QUIZ 3)
A measure to describe how close our data are to the population mean is the ______. Its values are smaller than the values of the ______ which are as well unit dependent. As it is good to have as well a unit free description of the data scatter, the ______ should be as well consider
A measure to describe how close our data are to the population mean is the standard error of the mean. Its values are smaller than the values of the standard deviation which are as well unit dependent. As it is good to have as well a unit free description of the data scatter, the coefficient of variation should be as well consider
(QUIZ 3)
The skewness is the ______ central moment of a distribution, whereas the kurtosis is the ______ moment. The ______ measures how sharp or flat a distribution is whereas the ______looks for the symmetry. A ______kurtosis means a very sharp distribution, whereas a ______kurtosis indicates a flat, more uniform like distribution
The skewness is the 3rd central moment of a distribution, whereas the kurtosis is the 4th moment. The kurtosis measures how sharp or flat a distribution is whereas the skewness looks for the symmetry. A positive kurtosis means a very sharp distribution, whereas a negative kurtosis indicates a flat, more uniform like distribution
(QUIZ 3)
A boxplot displays the overall ______of the distribution. The line in the middle of the box indicates the ______, whereas the upper bound indicates the ______, the lower bound the ______. Outside of the whiskers are the ______ which are more than 1.5 ______apart from the mean.
A boxplot displays the overall shape of the distribution. The line in the middle of the box indicates the 50% quantile, whereas the upper bound indicates the 75% quantile, the lower bound the 25% quantile. Outside of the whiskers are the outliers shown which are more than 1.5 IQR apart from the mean.
(QUIZ 3)
The Null-Hypothesis of the ______ test assumes that the data are coming from a ______ distribution. A p-value of less than 0.05 indicates that the data are coming from a ______. The ______ test can be used to compare different distributions. So it can be used as well to check if two sample distributions might be coming from the ______ population. A low p-value indicates that the distributions are coming from _______. In a normal distribution mean and median have values which are ______.
The Null-Hypothesis of the Shapiro Wilk test assumes that the data are coming from a normal distribution. A p-value of less than 0.05 indicates that the data are coming from a non normal distribution. The Kolmogorov Smirnow test can be used to compare different distributions. So it can be used as well to check if two sample distributions might be coming from the same population. A low p-value indicate that the distributions are coming from _ a different population_ In a normal distribution mean and median have values which are close.