8 | Statistics for Numerical Data I Flashcards

1
Q

(POLL)

Which measures can be used to describe the data scatter?
* mean
* median
* cv
* sem
* sd

A
  • cv
  • sd
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

(POLL)

If one of the values is 0, the geometric mean is:
* positive
* negative
* zero
* undefined
* 1

A

zero

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

(POLL)

The data are normally distributed within +/- 1 SD are …
* 50% of all data
* 2/3 of all data
* 1/3 of all data
* the SD does not tells us this

A
  • 2/3 of all data
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

(POLL)

Which of the following measures can be used to describe the shape of the distribution to the data scatter?
* cv
* kurtosis
* mean
* sd
* sem
* skewness
* var

A
  • kurtosis
  • skewness
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

(POLL)

With the 3 SD criteria, how many outliers you expect for 1000 values if the data are normally distributed?
* 0
* 1
* 2-3
* 5-10
* 10-100
* 100-1000

A
  • 2-3
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

(POLL)

You have a numerical variable with ten values. For visualization you would use a?
* Histogram
* Stripchart
* density line
* violinplot

A
  • Stripchart

also possible but not best:
* Histogram (stripchart better)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

(POLL)

You visualize a numerical variable against a categorical one, which is the appropiate plot to use?
* barplot
* boxplot
* histogram
* stripchart
* violineplot
* xyplot

A
  • boxplot
  • stripchart
  • violineplot
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

(POLL)

Which test(s) you could use to check if your data are normally distributed?
* Chisq-Test
* Fisher-Test
* Kolmorogov-Smirnov-Test
* Shapiro-Wilk-Test
* T-Test

A
  • Kolmorogov-Smirnov-Test
  • Shapiro-Wilk-Test
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

(POLL)

Which things are shown on a boxplot?
* mean,max,minimum,outliers
* median,1st quartile,3rd quartile,outliers
* mean,1st quartile,3rd quartile,outliers

A
  • median,1st quartile,3rd quartile,outliers
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

R:
How would you create a summary of a dataset with the following variables?
Cat

A

Data Summaries
—————————————————————————————-
1D 2D 3D | function
—————————————————————————————-
Cat NA NA | table(c1)
Cat Cat NA | table(c1,c1), chisq.test(table(c1,c2))
Cat Cat Cat | ftable(c1,c2,c3)
—————————————————————————————-
Cat Num NA | aggregate(n2,by=list(c1),func)
Cat Num Num | sbi$aggregate2(n2,n3,c1,cor)
Cat Cat Num | aggregate(n3,by=list(c1,c2),func)
—————————————————————————————-
Num NA NA | mean(n1), median(n1), sd(n1), mad(n1)
Num Num NA | cor(n1,n2)
Num Num Num | cor(n1,n2), cor(n1,n3), cor(n2,n3) OR cor(data.frame(n1=n1,n2=n2,n3=n3))
—————————————————————————————-
`

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

R:
How would you create a summary of a dataset with the following variables?

Cat

A

Data Summaries
—————————————————————————————-
1D 2D 3D | function
—————————————————————————————-
Cat NA NA | table(c1)
Cat Cat NA | table(c1,c1), chisq.test(table(c1,c2))
Cat Cat Cat | ftable(c1,c2,c3)
—————————————————————————————-
Cat Num NA | aggregate(n2,by=list(c1),func)
Cat Num Num | sbi$aggregate2(n2,n3,c1,cor)
Cat Cat Num | aggregate(n3,by=list(c1,c2),func)
—————————————————————————————-
Num NA NA | mean(n1), median(n1), sd(n1), mad(n1)
Num Num NA | cor(n1,n2)
Num Num Num | cor(n1,n2), cor(n1,n3), cor(n2,n3) OR cor(data.frame(n1=n1,n2=n2,n3=n3))
—————————————————————————————-

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

R:
How would you create a summary of a dataset with the following variables?

Cat Cat

A

Data Summaries
—————————————————————————————-
1D 2D 3D | function
—————————————————————————————-
Cat NA NA | table(c1)
Cat Cat NA | table(c1,c1), chisq.test(table(c1,c2))
Cat Cat Cat | ftable(c1,c2,c3)
—————————————————————————————-
Cat Num NA | aggregate(n2,by=list(c1),func)
Cat Num Num | sbi$aggregate2(n2,n3,c1,cor)
Cat Cat Num | aggregate(n3,by=list(c1,c2),func)
—————————————————————————————-
Num NA NA | mean(n1), median(n1), sd(n1), mad(n1)
Num Num NA | cor(n1,n2)
Num Num Num | cor(n1,n2), cor(n1,n3), cor(n2,n3) OR cor(data.frame(n1=n1,n2=n2,n3=n3))
—————————————————————————————-

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

R:
How would you create a summary of a dataset with the following variables?

Cat Cat Cat

A

Data Summaries
—————————————————————————————-
1D 2D 3D | function
—————————————————————————————-
Cat NA NA | table(c1)
Cat Cat NA | table(c1,c1), chisq.test(table(c1,c2))
Cat Cat Cat | ftable(c1,c2,c3)
—————————————————————————————-
Cat Num NA | aggregate(n2,by=list(c1),func)
Cat Num Num | sbi$aggregate2(n2,n3,c1,cor)
Cat Cat Num | aggregate(n3,by=list(c1,c2),func)
—————————————————————————————-
Num NA NA | mean(n1), median(n1), sd(n1), mad(n1)
Num Num NA | cor(n1,n2)
Num Num Num | cor(n1,n2), cor(n1,n3), cor(n2,n3) OR cor(data.frame(n1=n1,n2=n2,n3=n3))
—————————————————————————————-

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

R:
How would you create a summary of a dataset with the following variables?

Cat Num

A

Data Summaries
—————————————————————————————-
1D 2D 3D | function
—————————————————————————————-
Cat NA NA | table(c1)
Cat Cat NA | table(c1,c1), chisq.test(table(c1,c2))
Cat Cat Cat | ftable(c1,c2,c3)
—————————————————————————————-
Cat Num NA | aggregate(n2,by=list(c1),func)
Cat Num Num | sbi$aggregate2(n2,n3,c1,cor)
Cat Cat Num | aggregate(n3,by=list(c1,c2),func)
—————————————————————————————-
Num NA NA | mean(n1), median(n1), sd(n1), mad(n1)
Num Num NA | cor(n1,n2)
Num Num Num | cor(n1,n2), cor(n1,n3), cor(n2,n3) OR cor(data.frame(n1=n1,n2=n2,n3=n3))
—————————————————————————————-

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

R:
How would you create a summary of a dataset with the following variables?

Cat Num Num

A

Data Summaries
—————————————————————————————-
1D 2D 3D | function
—————————————————————————————-
Cat NA NA | table(c1)
Cat Cat NA | table(c1,c1), chisq.test(table(c1,c2))
Cat Cat Cat | ftable(c1,c2,c3)
—————————————————————————————-
Cat Num NA | aggregate(n2,by=list(c1),func)
Cat Num Num | sbi$aggregate2(n2,n3,c1,cor)
Cat Cat Num | aggregate(n3,by=list(c1,c2),func)
—————————————————————————————-
Num NA NA | mean(n1), median(n1), sd(n1), mad(n1)
Num Num NA | cor(n1,n2)
Num Num Num | cor(n1,n2), cor(n1,n3), cor(n2,n3) OR cor(data.frame(n1=n1,n2=n2,n3=n3))
—————————————————————————————-

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

R:
How would you create a summary of a dataset with the following variables?

Cat Cat Num

A

Data Summaries
—————————————————————————————-
1D 2D 3D | function
—————————————————————————————-
Cat NA NA | table(c1)
Cat Cat NA | table(c1,c1), chisq.test(table(c1,c2))
Cat Cat Cat | ftable(c1,c2,c3)
—————————————————————————————-
Cat Num NA | aggregate(n2,by=list(c1),func)
Cat Num Num | sbi$aggregate2(n2,n3,c1,cor)
Cat Cat Num | aggregate(n3,by=list(c1,c2),func)
—————————————————————————————-
Num NA NA | mean(n1), median(n1), sd(n1), mad(n1)
Num Num NA | cor(n1,n2)
Num Num Num | cor(n1,n2), cor(n1,n3), cor(n2,n3) OR cor(data.frame(n1=n1,n2=n2,n3=n3))
—————————————————————————————-

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

R:
How would you create a summary of a dataset with the following variables?

Num

A

Data Summaries
—————————————————————————————-
1D 2D 3D | function
—————————————————————————————-
Cat NA NA | table(c1)
Cat Cat NA | table(c1,c1), chisq.test(table(c1,c2))
Cat Cat Cat | ftable(c1,c2,c3)
—————————————————————————————-
Cat Num NA | aggregate(n2,by=list(c1),func)
Cat Num Num | sbi$aggregate2(n2,n3,c1,cor)
Cat Cat Num | aggregate(n3,by=list(c1,c2),func)
—————————————————————————————-
Num NA NA | mean(n1), median(n1), sd(n1), mad(n1)
Num Num NA | cor(n1,n2)
Num Num Num | cor(n1,n2), cor(n1,n3), cor(n2,n3) OR cor(data.frame(n1=n1,n2=n2,n3=n3))
—————————————————————————————-

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

R:
How would you create a summary of a dataset with the following variables?

Num Num

A

Data Summaries
—————————————————————————————-
1D 2D 3D | function
—————————————————————————————-
Cat NA NA | table(c1)
Cat Cat NA | table(c1,c1), chisq.test(table(c1,c2))
Cat Cat Cat | ftable(c1,c2,c3)
—————————————————————————————-
Cat Num NA | aggregate(n2,by=list(c1),func)
Cat Num Num | sbi$aggregate2(n2,n3,c1,cor)
Cat Cat Num | aggregate(n3,by=list(c1,c2),func)
—————————————————————————————-
Num NA NA | mean(n1), median(n1), sd(n1), mad(n1)
Num Num NA | cor(n1,n2)
Num Num Num | cor(n1,n2), cor(n1,n3), cor(n2,n3) OR cor(data.frame(n1=n1,n2=n2,n3=n3))
—————————————————————————————-

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

R:
How would you create a summary of a dataset with the following variables?

Num Num Num

A

Data Summaries
—————————————————————————————-
1D 2D 3D | function
—————————————————————————————-
Cat NA NA | table(c1)
Cat Cat NA | table(c1,c1), chisq.test(table(c1,c2))
Cat Cat Cat | ftable(c1,c2,c3)
—————————————————————————————-
Cat Num NA | aggregate(n2,by=list(c1),func)
Cat Num Num | sbi$aggregate2(n2,n3,c1,cor)
Cat Cat Num | aggregate(n3,by=list(c1,c2),func)
—————————————————————————————-
Num NA NA | mean(n1), median(n1), sd(n1), mad(n1)
Num Num NA | cor(n1,n2)
Num Num Num | cor(n1,n2), cor(n1,n3), cor(n2,n3) OR cor(data.frame(n1=n1,n2=n2,n3=n3))
—————————————————————————————-

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Univariate Descriptions of Numerical Data
How can we describe the center?

A
  • center: mean, mean(x,trim=0.1), median
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Univariate Descriptions of Numerical Data
How can we describe the scatter?

A
  • scatter: var, sd, cv
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Univariate Descriptions of Numerical Data
How can we describe the distribution?

A
  • distribution: quantile, IQR, max, min, range
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Univariate Descriptions of Numerical Data
How can we describe the shape?

A
  • shape: skewness, kurtosis
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Univariate Descriptions of Numerical Data
How can we describe with graphics?

A
  • plots: boxplot (barplot with arrows)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
R: Calculate mean
* normal/arithmetic mean: mean(x,na.rm=TRUE) * trimmed mean: mean(x,na.rm=TRUE,trim=0.1)
26
R: Calculate median
* median: median(x,na.rm=true)
27
Pythagorean means? Inequality?
* (arithmetic) mean * geometric mean * harmonic mean * they hold this inequality: arithmetic ≥ geometric ≥ harmonic SPICK formulae
28
Pythagorean Means – ranges? Properties? (arithmetic) mean:
(arithmetic) mean: * standard * values must have same properties, same ranges geometric mean: * average of multiple properties on different scales, * ranges, all values must be above zero harmonic mean: * average of rates, all values must be above zero
29
Why do we need other types of mean: What is the average speed if travelling 60 km at 60km/h and another 60 km at 20km/h?
* might thing 40 → wrong! * 1 hr at 60 km/h, 3 hrs at 20km/h → 120km/h / 4 → 30km/h * This can also be calculated using the harmonic mean * No 0s or negative with harmonic mean! ``` > library(diagram) > openplotmat(xlim=c(‐0.05,1.05),ylim=c(0.2,0.75)) > textdiamond(c(0.5,0.65),lab='60km/h', + box.col='light blue',radx=0.12,rady=0.08) > textdiamond(c(0.5,0.35),lab='20km/h', + box.col='light blue',radx=0.12,rady=0.08) > straightarrow(c(0.1,0.55),c(0.9,0.55),lwd=2) > straightarrow(c(0.9,0.45),c(0.1,0.45),lwd=2) > axis(1,labels=c('0km','60km'),at=c(0.1,0.9)) > textround(c(0.1,0.5),lab='A',box.col='salmon', + radx=0.01,rady=0.1,cex=1.2) > textround(c(0.9,0.5),lab='B',box.col='salmon', + radx=0.01,rady=0.1,cex=1.2) > hmean(c(20,60)) [1] 30 > hmean(c(20,0)) [1] 0 > hmean(c(20,60,NA)) [1] NA > hmean(c(20,60,NA),na.rm=TRUE) [1] 30 ```
30
Why do we need other types of mean: Geometric Mean example GDP?
* Average GDP of European Union? * Changes in GDP: each country has equal impact? → Solution geometric mean! → This is also for a stock index helpful! → Exercise implement geometric and harmonic mean! SPICK ``` > # USD 2016 > bip.gre = 19000 > bip.ger = 42000 > mean(c(bip.ger,bip.gre)) [1] 30500 > mean(c(bip.ger+(bip.ger/10),bip.gre)) [1] 32600 > mean(c(bip.ger,bip.gre+(bip.gre/10))) [1] 31450 > gmean(c(bip.ger,bip.gre)) [1] 28248.89 > gmean(c(bip.ger+(bip.ger/10),bip.gre)) [1] 29627.69 > gmean(c(bip.ger,bip.gre+(bip.gre/10))) [1] 29627.69 > gmean(c(0.1,10)) [1] 1 > gmean(c(0.2,10)) [1] 1.414214 > gmean(c(0.1,20)) [1] 1.414214 > gmean(c(0.1,20,100)) [1] 5.848035 > gmean(c(0.15,20,100)) [1] 6.69433 > gmean(c(0.1,30,100)) [1] 6.69433 > gmean(c(0.1,20,150)) [1] 6.69433
31
Data scatter results from?
* variability: – imprecision – experimental error – biological variability * bias:
32
What is variability, ie where does it come from?
– imprecision – experimental error – biological variability
33
What is bias?
– systematic error – does not! contribute to scatter
34
Variability and Bias - with two Guns and two Gunmen?
Bad gun = bias Bad gunman = imprecision Good Gunman (precision) * Good Gun → small scatter right in the bulls eye (accurate variance) * Bad Gun → small scatter but moved away from the centre (inaccurate bias) Bad Gunman (imprecision) * Good Gun → a lot of scatter, but looks like bulls eye is ca average(accurate variance) * Bad Gun → a lot of scatter, but looks like (inaccurate bias)
35
What data scatter measures are there for numerical data?
* standard deviation (SD) * coefficient of variation (CV) * standard error of the mean (SEM)
36
Standard deviation (SD) formula for sample?
– s: sample standard deviation s = √[ (∑(xi − x̄)2 / (N -1) ]
37
Standard deviation (SD) formula for population?
– σ: population standard deviation = √[ (∑(xi − x̄)2 / N ]
38
Standard deviation (SD) – proportions in normal data? normal data:
* ~ 2/3 of data are within 1 SD * ~ 95% of data are within 2 SD
39
Issue with SD?
* unit based → SD in m is smaller than SD in cm → use CV
40
Coefficient of variation (CV) formula? Used to?
– CV% = 100*sd(x)/mean(x) → cv =100*sx / x – used to compare different magnitudes
41
Standard error of the mean formula?
– SEM = sd(x)/sqrt(N) → sem = Sx/ sqrt(N)
42
Standard error of the mean what does it tell us?
– how close are we to the true population mean
43
Standard error of the mean how can we get closer to pop mean? –
more measurements → closer to pop mean
44
SEM vs SD?
* SEM always smaller than SD so some people prefer to show * But more measurements → closer to pop mean * It is not actually expressing the scatter, though it depends on the scatter * It tells us how close are we to the true population mean
45
R: SD, CV, SEM
``` > sd(survey$kg) [1] NA > sd(survey$kg,na.rm=TRUE) # standard deviation [1] 12.74224 > print(try(cv(survey$kg))) # there is no cv [1] "Error in cv(survey$kg) : konnte Funktion \"cv\" nicht finden\n" attr(,"class") [1] "try‐error" attr(,"condition") # no cv function in R > cv=function(x,na.rm=FALSE) { 100*sd(x,na.rm=na.rm)/ # implement own cv function mean(x,na.rm=na.rm) } # where to place this? > cv(survey$kg) [1] NA > sem=function(x,na.rm=FALSE) { # implement own sem function sd(x,na.rm=na.rm)/sqrt(length(x[!is.na(x)])) } > sem(survey$kg) [1] NA > sem(survey$kg,na.rm=TRUE) [1] 0.6033626 > sbi$cv=cv > sbi$sem=sem ```
46
What are the first, second, third, and fourth moments of a distribution?
* first moment: mean * second moment: variance * skewness * kurtosis
47
What is skewness?
* third central moment of a distribution * shape: signed measure for the degree of symmetry
48
Skewness – positive values?
* positive values: skewness to the right (long tail)
49
Skewness – negative values?
* negative values: skewness to the left (short tail)
50
Skewness – around 0?
* around zero: symmetrical distribution
51
R: Which library for skewness and kurtosis?
> library(e1071)
52
R: Skewness – normal distribution?
``` > library(e1071) > library(UsingR) > nym=nym.2002[nym.2002$age>16,] > skewness(nym$time) [1] 1.083422 ### longer tail on the right > skewness(nym$time*‐1) [1] ‐1.083422 ### longer tail on the left > x <‐ rnorm(1000) > skewness(x) [1] 0.08863 > x <‐ rnorm(1000) > skewness(x) [1] ‐0.03025407 time density 100 200 300 400 500 600 0.000 0.002 0.004 0.006 0.008 0.010 > hist(nym$time,freq=F, ylim=c(0,0.01), main='',xlab='time', ylab='density') > box() > lines(density(nym$time), col="blue",lwd=3)
53
R: Skewness - Bimodal distribution?
``` > norm1=c(rnorm(1000, mean=5,sd=2)) > norm2=c(rnorm(500,mean=13,sd=2)) > nonorm=c(norm1,norm2) > median(nonorm) [1] 6.304157 > mean(nonorm) [1] 7.654257 ### mean and median are not close together → indication of long tail or bimodal > skewness(nonorm) [1] 0.4822618 ## right tail > par(mfrow=c(2,1)) > hist(nonorm,col="light blue");box() > points(mean(nonorm),145);text(mean(nonorm),160,"mean") > points(median(nonorm),185);text(median(nonorm),200,"median") > qqnorm(nonorm);qqline(nonorm) ```
54
R: Skewness implementation in R?
``` > dskewness <‐ function (x,na.rm=TRUE) { if (na.rm) { x=x[!is.na(x)] } else { if (any(is.na(x))) { return(NA) } } g=(sum((x‐mean(x,na.rm=na.rm))^3)/length(x))/(sd(x)^3) return(g) } > skewness(nonorm) [1] 0.4822618 > dskewness(nonorm) [1] 0.4822618 > sbi$skewness=dskewness ```
55
What is Kurtosis?
* fourth central moment of a distribution * peakedness of a distribution Formula SPICK? What does a large kurtosis mean? * large values: sharp peak What does a ca 0 kurtosis mean? * around zero: normal peak What does a small kurtosis mean? * low values: broad peak
56
R: Kurtosis usage?
``` (e1071) > xn1=runif(1000) > kurtosis(xn1) [1] ‐1.195974 > xn2=runif(1000) > kurtosis(xn1+xn2) [1] ‐0.5967484 > xn3=runif(1000) > kurtosis(xn1+xn2+xn3) [1] ‐0.3342538 > # rnorm gives around 0 > kurtosis(rnorm(1000)) [1] 0.036207 > # rt can be peakier ... > kurtosis(rt(1000,df=10)) [1] 1.443117 > par(mfrow=c(3,1),mai=rep(0.4,4)) > hist(xn1,col='#ff3333',breaks=25); box() > hist(xn1+xn2+xn3,col='#ffff33',breaks=20);box() > hist(rt(1000,df=10),col='#33ff33',breaks=25);box() ```
57
R: Kurtosis – Implementation?
``` > dkurtosis <‐ function (x,na.rm=TRUE) { if (na.rm) { x=x[!is.na(x)] } else { if (any(is.na(x))) { return(NA) }} g=(sum((x‐mean(x,na.rm=na.rm))^4) /length(x))/(sd(x)^4)‐3 return(g) } > kurtosis(xn1+xn2+xn3) [1] ‐0.3342538 > dkurtosis(xn1+xn2+xn3) [1] ‐0.3342538 > sbi$kurtosis=dkurtosis ```
58
Outliers definition?
Grubbs 1969 defined an outlier as: Observation that appears to deviate markedly from other members of sample in which it occurs.
59
Outliers measures?
* more than 3 x SD away from the mean * more than 1.5 times of the IQR (mild) 3 times (extreme) → custom boxplot criteria
60
What could an outlier be? - … * an error prone data point ? * a normal data scatter point ? * an important specific observation ? * → It depends
How can we deal with outliers? * how to delete? * delete the variable * delete the value → missing value imputations * transform the variable * transform the value
61
R: Implement outlier function to check for outliers in normal distribution
``` > is.outlier <‐ function (x) { sd=sqrt(var(x,na.rm=TRUE)) xm=x‐mean(x,na.rm=TRUE) return(abs(xm)‐3*sd>0) } > rn=rnorm(1000) > table(is.outlier(rn)) FALSE TRUE 998 2 > table(abs(scale(rn))>3) FALSE TRUE 998 2 > is.outlier = function (x) { return(abs(scale(x))>3) } > table(is.outlier(rn)) FALSE TRUE 998 2 ```
62
What is a violinplot?
* similar to box plot * except also show P density of data at different values, * usually smoothed by a kernel density estimator.
63
R: Violinplot (lattice)?
``` > library(lattice) > print(bwplot(nym.2002$age, main="nym.2002$age", panel=panel.violin, col="light blue")) nym.2002$age nym.2002$age 20 40 60 80 ```
64
Density plot?
* For single numerical variable * like a smoothed histogram → bar become lines
65
R: Density plot?
``` > hist(nym.2002$age, main="nym.2002$age", freq=F,col="light blue") > box() > lines(density(nym.2002$age), col="blue") > lines(density(nym.2002$age,bw=1), col="red") > lines(density(nym.2002$age,bw=4), col="green") > lines(density(nym.2002$age,bw=10), col="magenta") ```
66
Boxplot vs Stripchart?
Boxplot: * many values Stripchart: * few values * 1D scatter plots (or dot plots) * good alternative to boxplot when few values
67
R: Boxplot vs Stripchart?
``` par(mfrow=c(1,3)) boxplot(survey$cm~survey$gender,col=c(2,4)) stripchart(survey$cm~survey$gender,col=c(2,4)) stripchart(survey$cm~survey$gender,method="stack", col=c(2,4),vertical=TRUE) ``` Stack option → looks a bit like a histogram ‘mixture between histogram, density line, violin plot’
68
When to use which Num ~ Cat plot ? * violinplot: ? * stripplot: ? * boxplot: ?
* violinplot: show bi- or multimodal data * stripplot: show few values (a dozen per group) * boxplot: all other cases
69
Numerical Numerical Two ways to analyse a relationship between?
* Correlation * Regression
70
Which method of analysing a relationship between two numerical variables **is** interested in the direction of the relationship?
interested: regression
71
Which method of analysing a relationship between two numerical variables **is not** interested in the direction of the relationship?
not interested: correlation
72
R: How can we get from some other distribution to a normal distribution? Uniform to Normal Distribution (extra)
Uniform distribution, lets throw dices: ``` > options(digits=3) > runif(6,1,7) [1] 2.30 2.69 1.15 3.54 5.52 3.50 > as.integer(runif(10,1,7)) [1] 4 1 1 2 5 3 1 2 2 3 > ru1=as.integer(runif(1000,1,7)) > table(ru1) ru1 1 2 3 4 5 6 163 166 164 157 162 188 > ru2=as.integer(runif(1000,1,7)) > par(mfrow=c(2,1),mai=rep(0.4,4)) > barplot(table(ru1)) > barplot(table(ru2)) ```
73
R: Mean of two uniform samples
``` > ru1u2=(ru1+ru2)/2 > table(ru1u2) ru1u2 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 6 30 47 92 99 131 176 159 108 89 48 21 > par(mfrow=c(2,1),mai=rep(0.4,4)) > hist(ru1u2) > barplot(table(ru1u2)) ``` → hist problem with discrete data
74
R: Means of Uniform Distributions
→ from an uniform (or any other ) to a Gaussian distribution … ``` > options(digits=3) R: > r1=runif(1000,1,7) > par(mfrow=c(3,1),mai=rep(0.4,4)) > hist(r1,col="light blue",main='r1') > box() > for (i in 1:10) { r1=r1+runif(1000,1,7) if (i == 2) { hist(r1/3,main="r3", col="light blue") } } > hist(r1/11,main="r11", col="light blue") > box() ```
75
Central Limit Theorem
* if sample size is large enough * Pop of sample means will approximate a Gaussian distribution * No matter how population is distributed
76
Central Limit Theorem How large does sample size need to be?
It depends: * more normal distribution → 10 or more * less normal distribution → more samples, 100 should always be enough
77
Properties of a Normal Distribution?
* symmetrical bell shape * extends in both directions to ∞ * mean, median close together * 95% of values within 2 SD
78
For which distribution(s) is this true: 95% of all values are within 2 SD?
Normal distribution this assumption gives very wrong results if the the distribution is non-normal ‼
79
Normal numerical data n~c → which test?
t.test
80
Non-normal numerical data, skewed or multi-modal distributions n~c → which test?
wilcox.test
81
R: norm (d probably meant pdqr?)
* d: pdf (probability density function = point probability); dnorm(N) → P(X==N) * p: cdf (cumulative density function); pnorm(N) → P(X options(digits=3) > rn=rnorm(1000) > hist(rn,freq=F,col="light blue") > lines(density(rn), col="red",lwd=2) > box() > pnorm(0) [1] 0.5 > summary(rn) Min. 1st Qu. Median Mean 3rd Qu. Max. ‐3.36 ‐0.62 0.01 0.04 0.68 3.42 > dnorm(0) [1] 0.399 > qnorm(0.5) [1] 0 ```
82
R: How can we test for normality?
H0 for Shapiro-Wilk test assumes normality: ``` > rn=rnorm(1000,mean=5.5,sd=0.5) > mean(rn) [1] 5.5 > median(rn) [1] 5.51 > shapiro.test(rn) Shapiro‐Wilk normality test data: rn W = 1, p‐value = 0.7 > shapiro.test(rn)$p.value [1] 0.746 > shapiro.test(runif(100,1,6))$p.value [1] 0.00346 > shapiro.test(runif(100,1,6))$statistic W 0.956 > shapiro.test(survey$cm) Shapiro‐Wilk normality test data: survey$cm W = 1, p‐value = 4e‐04 > shapiro.test(survey$cm[survey$gender=='M'])$p.value [1] 0.409 ``` → p-value >= 0.05 we don’t reject H0, that the distribution comes from a normal distribution → p-value < 0.05 we reject H0, that the distribution comes from a normal distribution
83
Shapiro-Wilk Test: → p-value >= 0.05 we ______ H0, that the distribution comes from a normal distribution
→ p-value >= 0.05 we _don’t reject_ H0, that the distribution comes from a normal distribution
84
Reporting Shapiro-Wilk Test? Analysing height: W = 0.987, p = 0.00042 and weight, W = 0.951, p = 0, of students.
The size of students, W = 0.987, p = 0.00042, as well as the weight of students, W = 0.951, p = 0, were both significantly non-normally distributed. * with many samples Shapiro-Wilk test is very easy significant * use it only with visual inspection, histogram, qqplot if having many samples [Better: The size of students, W = 0.987, _p < 0.001_, as well as the weight of students, W = 0.951, _p < 0.001_, were both significantly non-normal.]
85
The Shapiro-Wilk test can be used to check if data comes from a normal distribution. How else can you check for normality?
* Kolmogorov‐Smirnov test * generalized test for any distribution * check if both samples might come from the same distribution → eg from normal
86
R: Normal Data Visualization
``` > norm=c(rnorm(1000, mean=5,sd=3)) > median(norm) [1] 5.06 > mean(norm) [1] 5.03 > shapiro.test(norm)$p.value [1] 0.419 > par(mfrow=c(1,2),mai=c(0.4,0.4,0.4,0.4)) > hist(norm,col="light blue") > points(mean(norm),145) > text(mean(norm),160,"mean") > points(median(norm),185) > text(median(norm),200,"median") > box() > qqnorm(norm) > qqline(norm) Histogram of norm ```
87
R: QQ-plots
Plotting the observed values against the expected/theoretical ``` > options(digits=3) > rn=rnorm(10) > head(sort(signif(rn,3)),n=5) [1] ‐1.030 ‐0.657 ‐0.540 ‐0.439 0.147 > qqnorm(rn) > qqline(rn) ```
88
T-Distribution: Derived from _____________
Normal distribution
89
T-Distribution: t is the _______ between ________ and the ______, divided by the _____
t is the _difference_ between the _sample mean_ and the _population mean_, divided by the _SEM_
90
T-Distribution: you perform a sampling experiment where you know the __________
Population value
91
R: Practical – what’s being demonstrated here? T-Distribution: There is always a difference … Let’s simulate many t’s ``` > rn=rnorm(10000) # our population is rn > summary(rn) Min. 1st Qu. Median Mean 3rd Qu. Max. ‐3.68 ‐0.67 ‐0.02 0.00 0.66 3.99 > mcompare=function () {sam1=sample(rn,10); sam2=sample(rn,10) ; return(mean(sam1) mean(sam2))} > res=c() # create empty result vector > for (i in 1:1000) {res=c(res,mcompare()) # append to result} > hist(res,col="light blue",cex.lab=1.5,cex.main=1.5) > box() ```
92
R: Practical – how to create own t distribution?
Distribution of t Values ``` > getT=function(n) {s1=sample(rn,n) ; t=(mean(s1)‐mean(rn))/(sd(s1)/sqrt(n)) ; return(t) } > getT(5) [1] 0.642 > res=c() > for (i in 1:1000) { res=c(res,getT(10)) } > par(mfrow=c(2,1),mai=c(0.4,0.4,0.4,0.4)) > summary(res) Min. 1st Qu. Median Mean 3rd Qu. Max. ‐6.26 ‐0.75 ‐0.01 ‐0.07 0.63 5.36 > hist(res,col="beige",freq=F,ylim=c(0,0.5)) > lines(density(res),col="red",lwd=2) > lines(seq(‐5,5,0.1),dt(seq(‐5,5,0.1),df=10), col="blue",lwd=2,lty=1) ; box() ```
93
R: Function to random data from t-distribution?
``` > print(try(rt(1000))) [1] "Error in rt(1000) : Argument \"df\" fehlt (ohne Standardwert)\n" attr(,"class") [1] "try‐error" attr(,"condition") > xt=rt(1000,99) > summary(xt) Min. 1st Qu. Median Mean 3rd Qu. Max. ‐3.96 ‐0.69 0.02 0.01 0.69 3.97 > shapiro.test(xt)$p.value [1] 0.271 > ks.test(xt,"pt",99)$p.value [1] 0.721 ```
94
R: Qt function?
t* with qt > df=c(1:6,10,20,50,100,10000); > t=qt(0.975,df) > data.frame(t,df) t df 1 12.71 1 2 4.30 2 3 3.18 3 4 2.78 4 5 2.57 5 6 2.45 6 7 2.23 10 8 2.09 20 9 2.01 50 10 1.98 100 11 1.96 10000
95
(QUIZ 3) The ______is useful for calculating the mean of two speeds whereas the ______ can be used if our data are on different scales but we would like monitor changes over time where changes of both variables have the same impact. If we say "mean" we usually are speaking about the ______.
The _harmonic mean_ is useful for calculating the me of two speeds whereas the _geometric mean_ can be used if our data are on different scales but we would like monitor changes over time where changes of both variables have the same impact. If we say "mean" we usually are speaking about the _arithmetic mean_.
96
(QUIZ 3) A measure to describe how close our data are to the population mean is the ______. Its values are smaller than the values of the ______ which are as well unit dependent. As it is good to have as well a unit free description of the data scatter, the ______ should be as well consider
A measure to describe how close our data are to the population mean is the _standard error of the mean_. Its values are smaller than the values of the standard deviation which are as well unit dependent. As it is good to have as well a unit free description of the data scatter, the _coefficient of variation_ should be as well consider
97
(QUIZ 3) The skewness is the ______ central moment of a distribution, whereas the kurtosis is the ______ moment. The ______ measures how sharp or flat a distribution is whereas the ______looks for the symmetry. A ______kurtosis means a very sharp distribution, whereas a ______kurtosis indicates a flat, more uniform like distribution
The skewness is the _3rd_ central moment of a distribution, whereas the kurtosis is the _4th_ moment. The _kurtosis_ measures how sharp or flat a distribution is whereas the _skewness_ looks for the symmetry. A _positive_ kurtosis means a very sharp distribution, whereas a _negative_ kurtosis indicates a flat, more uniform like distribution
98
(QUIZ 3) A boxplot displays the overall ______of the distribution. The line in the middle of the box indicates the ______, whereas the upper bound indicates the ______, the lower bound the ______. Outside of the whiskers are the ______ which are more than 1.5 ______apart from the mean.
A boxplot displays the overall _shape_ of the distribution. The line in the middle of the box indicates the _50% quantile_, whereas the upper bound indicates the _75% quantile_, the lower bound the _25% quantile_. Outside of the whiskers are the _outliers_ shown which are more than 1.5 _IQR_ apart from the mean.
99
(QUIZ 3) The Null-Hypothesis of the ______ test assumes that the data are coming from a ______ distribution. A p-value of less than 0.05 indicates that the data are coming from a ______. The ______ test can be used to compare different distributions. So it can be used as well to check if two sample distributions might be coming from the ______ population. A low p-value indicates that the distributions are coming from _______. In a normal distribution mean and median have values which are ______.
The Null-Hypothesis of the Shapiro Wilk test assumes that the data are coming from a _normal_ distribution. A p-value of less than 0.05 indicates that the data are coming from a _non normal_ distribution. The _Kolmogorov Smirnow_ test can be used to compare different distributions. So it can be used as well to check if two sample distributions might be coming from the _same_ population. A low p-value indicate that the distributions are coming from _ a different population_ In a normal distribution mean and median have values which are _close_.