8 | Statistics for Numerical Data I Flashcards

1
Q

(POLL)

Which measures can be used to describe the data scatter?
* mean
* median
* cv
* sem
* sd

A
  • cv
  • sd
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

(POLL)

If one of the values is 0, the geometric mean is:
* positive
* negative
* zero
* undefined
* 1

A

zero

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

(POLL)

The data are normally distributed within +/- 1 SD are …
* 50% of all data
* 2/3 of all data
* 1/3 of all data
* the SD does not tells us this

A
  • 2/3 of all data
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

(POLL)

Which of the following measures can be used to describe the shape of the distribution to the data scatter?
* cv
* kurtosis
* mean
* sd
* sem
* skewness
* var

A
  • kurtosis
  • skewness
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

(POLL)

With the 3 SD criteria, how many outliers you expect for 1000 values if the data are normally distributed?
* 0
* 1
* 2-3
* 5-10
* 10-100
* 100-1000

A
  • 2-3
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

(POLL)

You have a numerical variable with ten values. For visualization you would use a?
* Histogram
* Stripchart
* density line
* violinplot

A
  • Stripchart

also possible but not best:
* Histogram (stripchart better)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

(POLL)

You visualize a numerical variable against a categorical one, which is the appropiate plot to use?
* barplot
* boxplot
* histogram
* stripchart
* violineplot
* xyplot

A
  • boxplot
  • stripchart
  • violineplot
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

(POLL)

Which test(s) you could use to check if your data are normally distributed?
* Chisq-Test
* Fisher-Test
* Kolmorogov-Smirnov-Test
* Shapiro-Wilk-Test
* T-Test

A
  • Kolmorogov-Smirnov-Test
  • Shapiro-Wilk-Test
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

(POLL)

Which things are shown on a boxplot?
* mean,max,minimum,outliers
* median,1st quartile,3rd quartile,outliers
* mean,1st quartile,3rd quartile,outliers

A
  • median,1st quartile,3rd quartile,outliers
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

R:
How would you create a summary of a dataset with the following variables?
Cat

A

Data Summaries
—————————————————————————————-
1D 2D 3D | function
—————————————————————————————-
Cat NA NA | table(c1)
Cat Cat NA | table(c1,c1), chisq.test(table(c1,c2))
Cat Cat Cat | ftable(c1,c2,c3)
—————————————————————————————-
Cat Num NA | aggregate(n2,by=list(c1),func)
Cat Num Num | sbi$aggregate2(n2,n3,c1,cor)
Cat Cat Num | aggregate(n3,by=list(c1,c2),func)
—————————————————————————————-
Num NA NA | mean(n1), median(n1), sd(n1), mad(n1)
Num Num NA | cor(n1,n2)
Num Num Num | cor(n1,n2), cor(n1,n3), cor(n2,n3) OR cor(data.frame(n1=n1,n2=n2,n3=n3))
—————————————————————————————-
`

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

R:
How would you create a summary of a dataset with the following variables?

Cat

A

Data Summaries
—————————————————————————————-
1D 2D 3D | function
—————————————————————————————-
Cat NA NA | table(c1)
Cat Cat NA | table(c1,c1), chisq.test(table(c1,c2))
Cat Cat Cat | ftable(c1,c2,c3)
—————————————————————————————-
Cat Num NA | aggregate(n2,by=list(c1),func)
Cat Num Num | sbi$aggregate2(n2,n3,c1,cor)
Cat Cat Num | aggregate(n3,by=list(c1,c2),func)
—————————————————————————————-
Num NA NA | mean(n1), median(n1), sd(n1), mad(n1)
Num Num NA | cor(n1,n2)
Num Num Num | cor(n1,n2), cor(n1,n3), cor(n2,n3) OR cor(data.frame(n1=n1,n2=n2,n3=n3))
—————————————————————————————-

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

R:
How would you create a summary of a dataset with the following variables?

Cat Cat

A

Data Summaries
—————————————————————————————-
1D 2D 3D | function
—————————————————————————————-
Cat NA NA | table(c1)
Cat Cat NA | table(c1,c1), chisq.test(table(c1,c2))
Cat Cat Cat | ftable(c1,c2,c3)
—————————————————————————————-
Cat Num NA | aggregate(n2,by=list(c1),func)
Cat Num Num | sbi$aggregate2(n2,n3,c1,cor)
Cat Cat Num | aggregate(n3,by=list(c1,c2),func)
—————————————————————————————-
Num NA NA | mean(n1), median(n1), sd(n1), mad(n1)
Num Num NA | cor(n1,n2)
Num Num Num | cor(n1,n2), cor(n1,n3), cor(n2,n3) OR cor(data.frame(n1=n1,n2=n2,n3=n3))
—————————————————————————————-

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

R:
How would you create a summary of a dataset with the following variables?

Cat Cat Cat

A

Data Summaries
—————————————————————————————-
1D 2D 3D | function
—————————————————————————————-
Cat NA NA | table(c1)
Cat Cat NA | table(c1,c1), chisq.test(table(c1,c2))
Cat Cat Cat | ftable(c1,c2,c3)
—————————————————————————————-
Cat Num NA | aggregate(n2,by=list(c1),func)
Cat Num Num | sbi$aggregate2(n2,n3,c1,cor)
Cat Cat Num | aggregate(n3,by=list(c1,c2),func)
—————————————————————————————-
Num NA NA | mean(n1), median(n1), sd(n1), mad(n1)
Num Num NA | cor(n1,n2)
Num Num Num | cor(n1,n2), cor(n1,n3), cor(n2,n3) OR cor(data.frame(n1=n1,n2=n2,n3=n3))
—————————————————————————————-

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

R:
How would you create a summary of a dataset with the following variables?

Cat Num

A

Data Summaries
—————————————————————————————-
1D 2D 3D | function
—————————————————————————————-
Cat NA NA | table(c1)
Cat Cat NA | table(c1,c1), chisq.test(table(c1,c2))
Cat Cat Cat | ftable(c1,c2,c3)
—————————————————————————————-
Cat Num NA | aggregate(n2,by=list(c1),func)
Cat Num Num | sbi$aggregate2(n2,n3,c1,cor)
Cat Cat Num | aggregate(n3,by=list(c1,c2),func)
—————————————————————————————-
Num NA NA | mean(n1), median(n1), sd(n1), mad(n1)
Num Num NA | cor(n1,n2)
Num Num Num | cor(n1,n2), cor(n1,n3), cor(n2,n3) OR cor(data.frame(n1=n1,n2=n2,n3=n3))
—————————————————————————————-

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

R:
How would you create a summary of a dataset with the following variables?

Cat Num Num

A

Data Summaries
—————————————————————————————-
1D 2D 3D | function
—————————————————————————————-
Cat NA NA | table(c1)
Cat Cat NA | table(c1,c1), chisq.test(table(c1,c2))
Cat Cat Cat | ftable(c1,c2,c3)
—————————————————————————————-
Cat Num NA | aggregate(n2,by=list(c1),func)
Cat Num Num | sbi$aggregate2(n2,n3,c1,cor)
Cat Cat Num | aggregate(n3,by=list(c1,c2),func)
—————————————————————————————-
Num NA NA | mean(n1), median(n1), sd(n1), mad(n1)
Num Num NA | cor(n1,n2)
Num Num Num | cor(n1,n2), cor(n1,n3), cor(n2,n3) OR cor(data.frame(n1=n1,n2=n2,n3=n3))
—————————————————————————————-

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

R:
How would you create a summary of a dataset with the following variables?

Cat Cat Num

A

Data Summaries
—————————————————————————————-
1D 2D 3D | function
—————————————————————————————-
Cat NA NA | table(c1)
Cat Cat NA | table(c1,c1), chisq.test(table(c1,c2))
Cat Cat Cat | ftable(c1,c2,c3)
—————————————————————————————-
Cat Num NA | aggregate(n2,by=list(c1),func)
Cat Num Num | sbi$aggregate2(n2,n3,c1,cor)
Cat Cat Num | aggregate(n3,by=list(c1,c2),func)
—————————————————————————————-
Num NA NA | mean(n1), median(n1), sd(n1), mad(n1)
Num Num NA | cor(n1,n2)
Num Num Num | cor(n1,n2), cor(n1,n3), cor(n2,n3) OR cor(data.frame(n1=n1,n2=n2,n3=n3))
—————————————————————————————-

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

R:
How would you create a summary of a dataset with the following variables?

Num

A

Data Summaries
—————————————————————————————-
1D 2D 3D | function
—————————————————————————————-
Cat NA NA | table(c1)
Cat Cat NA | table(c1,c1), chisq.test(table(c1,c2))
Cat Cat Cat | ftable(c1,c2,c3)
—————————————————————————————-
Cat Num NA | aggregate(n2,by=list(c1),func)
Cat Num Num | sbi$aggregate2(n2,n3,c1,cor)
Cat Cat Num | aggregate(n3,by=list(c1,c2),func)
—————————————————————————————-
Num NA NA | mean(n1), median(n1), sd(n1), mad(n1)
Num Num NA | cor(n1,n2)
Num Num Num | cor(n1,n2), cor(n1,n3), cor(n2,n3) OR cor(data.frame(n1=n1,n2=n2,n3=n3))
—————————————————————————————-

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

R:
How would you create a summary of a dataset with the following variables?

Num Num

A

Data Summaries
—————————————————————————————-
1D 2D 3D | function
—————————————————————————————-
Cat NA NA | table(c1)
Cat Cat NA | table(c1,c1), chisq.test(table(c1,c2))
Cat Cat Cat | ftable(c1,c2,c3)
—————————————————————————————-
Cat Num NA | aggregate(n2,by=list(c1),func)
Cat Num Num | sbi$aggregate2(n2,n3,c1,cor)
Cat Cat Num | aggregate(n3,by=list(c1,c2),func)
—————————————————————————————-
Num NA NA | mean(n1), median(n1), sd(n1), mad(n1)
Num Num NA | cor(n1,n2)
Num Num Num | cor(n1,n2), cor(n1,n3), cor(n2,n3) OR cor(data.frame(n1=n1,n2=n2,n3=n3))
—————————————————————————————-

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

R:
How would you create a summary of a dataset with the following variables?

Num Num Num

A

Data Summaries
—————————————————————————————-
1D 2D 3D | function
—————————————————————————————-
Cat NA NA | table(c1)
Cat Cat NA | table(c1,c1), chisq.test(table(c1,c2))
Cat Cat Cat | ftable(c1,c2,c3)
—————————————————————————————-
Cat Num NA | aggregate(n2,by=list(c1),func)
Cat Num Num | sbi$aggregate2(n2,n3,c1,cor)
Cat Cat Num | aggregate(n3,by=list(c1,c2),func)
—————————————————————————————-
Num NA NA | mean(n1), median(n1), sd(n1), mad(n1)
Num Num NA | cor(n1,n2)
Num Num Num | cor(n1,n2), cor(n1,n3), cor(n2,n3) OR cor(data.frame(n1=n1,n2=n2,n3=n3))
—————————————————————————————-

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Univariate Descriptions of Numerical Data
How can we describe the center?

A
  • center: mean, mean(x,trim=0.1), median
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Univariate Descriptions of Numerical Data
How can we describe the scatter?

A
  • scatter: var, sd, cv
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Univariate Descriptions of Numerical Data
How can we describe the distribution?

A
  • distribution: quantile, IQR, max, min, range
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Univariate Descriptions of Numerical Data
How can we describe the shape?

A
  • shape: skewness, kurtosis
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Univariate Descriptions of Numerical Data
How can we describe with graphics?

A
  • plots: boxplot (barplot with arrows)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

R:
Calculate mean

A
  • normal/arithmetic mean: mean(x,na.rm=TRUE)
  • trimmed mean: mean(x,na.rm=TRUE,trim=0.1)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

R:
Calculate median

A
  • median: median(x,na.rm=true)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

Pythagorean means? Inequality?

A
  • (arithmetic) mean
  • geometric mean
  • harmonic mean
  • they hold this inequality: arithmetic ≥ geometric ≥ harmonic
    SPICK formulae
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

Pythagorean Means – ranges? Properties?
(arithmetic) mean:

A

(arithmetic) mean:
* standard
* values must have same properties, same ranges
geometric mean:
* average of multiple properties on different scales,
* ranges, all values must be above zero
harmonic mean:
* average of rates, all values must be above zero

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

Why do we need other types of mean:
What is the average speed if travelling 60 km at 60km/h and another 60 km at 20km/h?

A
  • might thing 40 → wrong!
  • 1 hr at 60 km/h, 3 hrs at 20km/h → 120km/h / 4 → 30km/h
  • This can also be calculated using the harmonic mean
  • No 0s or negative with harmonic mean!
    ~~~
    > library(diagram)
    > openplotmat(xlim=c(‐0.05,1.05),ylim=c(0.2,0.75))
    > textdiamond(c(0.5,0.65),lab=’60km/h’,
    + box.col=’light blue’,radx=0.12,rady=0.08)
    > textdiamond(c(0.5,0.35),lab=’20km/h’,
    + box.col=’light blue’,radx=0.12,rady=0.08)
    > straightarrow(c(0.1,0.55),c(0.9,0.55),lwd=2)
    > straightarrow(c(0.9,0.45),c(0.1,0.45),lwd=2)
    > axis(1,labels=c(‘0km’,’60km’),at=c(0.1,0.9))
    > textround(c(0.1,0.5),lab=’A’,box.col=’salmon’,
    + radx=0.01,rady=0.1,cex=1.2)
    > textround(c(0.9,0.5),lab=’B’,box.col=’salmon’,
    + radx=0.01,rady=0.1,cex=1.2)
    > hmean(c(20,60))
    [1] 30
    > hmean(c(20,0))
    [1] 0
    > hmean(c(20,60,NA))
    [1] NA
    > hmean(c(20,60,NA),na.rm=TRUE)
    [1] 30
    ~~~
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

Why do we need other types of mean:
Geometric Mean example GDP?

A
  • Average GDP of European Union?
  • Changes in GDP: each country has equal impact?
    → Solution geometric mean!
    → This is also for a stock index helpful!
    → Exercise implement geometric and harmonic mean! SPICK
    ~~~
    > # USD 2016
    > bip.gre = 19000
    > bip.ger = 42000
    > mean(c(bip.ger,bip.gre))
    [1] 30500
    > mean(c(bip.ger+(bip.ger/10),bip.gre))
    [1] 32600
    > mean(c(bip.ger,bip.gre+(bip.gre/10)))
    [1] 31450
    > gmean(c(bip.ger,bip.gre))
    [1] 28248.89
    > gmean(c(bip.ger+(bip.ger/10),bip.gre))
    [1] 29627.69
    > gmean(c(bip.ger,bip.gre+(bip.gre/10)))
    [1] 29627.69
    > gmean(c(0.1,10))
    [1] 1
    > gmean(c(0.2,10))
    [1] 1.414214
    > gmean(c(0.1,20))
    [1] 1.414214
    > gmean(c(0.1,20,100))
    [1] 5.848035
    > gmean(c(0.15,20,100))
    [1] 6.69433
    > gmean(c(0.1,30,100))
    [1] 6.69433
    > gmean(c(0.1,20,150))
    [1] 6.69433
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

Data scatter results from?

A
  • variability:
    – imprecision
    – experimental error
    – biological variability
  • bias:
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

What is variability, ie where does it come from?

A

– imprecision
– experimental error
– biological variability

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

What is bias?

A

– systematic error
– does not! contribute to scatter

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

Variability and Bias - with two Guns and two Gunmen?

A

Bad gun = bias
Bad gunman = imprecision
Good Gunman (precision)
* Good Gun → small scatter right in the bulls eye (accurate variance)
* Bad Gun → small scatter but moved away from the centre (inaccurate bias)
Bad Gunman (imprecision)
* Good Gun → a lot of scatter, but looks like bulls eye is ca average(accurate variance)
* Bad Gun → a lot of scatter, but looks like (inaccurate bias)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

What data scatter measures are there for numerical data?

A
  • standard deviation (SD)
  • coefficient of variation (CV)
  • standard error of the mean (SEM)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

Standard deviation (SD) formula for sample?

A

– s: sample standard deviation s = √[ (∑(xi − x̄)2 / (N -1) ]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

Standard deviation (SD) formula for population?

A

– σ: population standard deviation = √[ (∑(xi − x̄)2 / N ]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
38
Q

Standard deviation (SD) – proportions in normal data?
normal data:

A
  • ~ 2/3 of data are within 1 SD
  • ~ 95% of data are within 2 SD
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
39
Q

Issue with SD?

A
  • unit based → SD in m is smaller than SD in cm → use CV
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
40
Q

Coefficient of variation (CV) formula? Used to?

A

– CV% = 100sd(x)/mean(x) → cv =100sx / x
– used to compare different magnitudes

41
Q

Standard error of the mean formula?

A

– SEM = sd(x)/sqrt(N) → sem = Sx/ sqrt(N)

42
Q

Standard error of the mean what does it tell us?

A

– how close are we to the true population mean

43
Q

Standard error of the mean how can we get closer to pop mean?

A

more measurements → closer to pop mean

44
Q

SEM vs SD?

A
  • SEM always smaller than SD so some people prefer to show
  • But more measurements → closer to pop mean
  • It is not actually expressing the scatter, though it depends on the scatter
  • It tells us how close are we to the true population mean
45
Q

R:
SD, CV, SEM

A
> sd(survey$kg)
[1] NA
> sd(survey$kg,na.rm=TRUE) # standard deviation
[1] 12.74224
> print(try(cv(survey$kg))) # there is no cv 
[1] "Error in cv(survey$kg) : konnte Funktion \"cv\" nicht finden\n"
attr(,"class")
[1] "try‐error"
attr(,"condition") # no cv function in R
<simpleError in cv(survey$kg): konnte Funktion "cv" nicht finden>
> cv=function(x,na.rm=FALSE) { 100*sd(x,na.rm=na.rm)/ # implement own cv function
mean(x,na.rm=na.rm) } # where to place this?
> cv(survey$kg)
[1] NA
> sem=function(x,na.rm=FALSE) {  # implement own sem function
sd(x,na.rm=na.rm)/sqrt(length(x[!is.na(x)])) }
> sem(survey$kg)
[1] NA
> sem(survey$kg,na.rm=TRUE)
[1] 0.6033626
> sbi$cv=cv
> sbi$sem=sem
46
Q

What are the first, second, third, and fourth moments of a distribution?

A
  • first moment: mean
  • second moment: variance
  • skewness
  • kurtosis
47
Q

What is skewness?

A
  • third central moment of a distribution
  • shape: signed measure for the degree of symmetry
48
Q

Skewness – positive values?

A
  • positive values: skewness to the right (long tail)
49
Q

Skewness – negative values?

A
  • negative values: skewness to the left (short tail)
50
Q

Skewness – around 0?

A
  • around zero: symmetrical distribution
51
Q

R:
Which library for skewness and kurtosis?

A

> library(e1071)

52
Q

R:
Skewness – normal distribution?

A

~~~
> library(e1071)
> library(UsingR)
> nym=nym.2002[nym.2002$age>16,]
> skewness(nym$time)
[1] 1.083422 ### longer tail on the right
> skewness(nym$time*‐1)
[1] ‐1.083422 ### longer tail on the left
> x <‐ rnorm(1000)
> skewness(x)
[1] 0.08863
> x <‐ rnorm(1000)
> skewness(x)
[1] ‐0.03025407
time
density
100 200 300 400 500 600
0.000 0.002 0.004 0.006 0.008 0.010
> hist(nym$time,freq=F, ylim=c(0,0.01), main=’‘,xlab=’time’, ylab=’density’)
> box()
> lines(density(nym$time), col=”blue”,lwd=3)

53
Q

R:
Skewness - Bimodal distribution?

A
> norm1=c(rnorm(1000, mean=5,sd=2))
> norm2=c(rnorm(500,mean=13,sd=2))
> nonorm=c(norm1,norm2)
> median(nonorm) 
[1] 6.304157
> mean(nonorm)
[1] 7.654257  ### mean and median are not close together → indication of long tail or bimodal 
> skewness(nonorm)
[1] 0.4822618 ## right tail 
> par(mfrow=c(2,1))
> hist(nonorm,col="light blue");box()
> points(mean(nonorm),145);text(mean(nonorm),160,"mean")
> points(median(nonorm),185);text(median(nonorm),200,"median")
> qqnorm(nonorm);qqline(nonorm)
54
Q

R:
Skewness implementation in R?

A
> dskewness <‐ function (x,na.rm=TRUE) {
if (na.rm) {
x=x[!is.na(x)]
} else {
if (any(is.na(x))) {
return(NA)
}
}
g=(sum((x‐mean(x,na.rm=na.rm))^3)/length(x))/(sd(x)^3)
return(g)
}
> skewness(nonorm)
[1] 0.4822618
> dskewness(nonorm)
[1] 0.4822618
> sbi$skewness=dskewness
55
Q

What is Kurtosis?

A
  • fourth central moment of a distribution
  • peakedness of a distribution
    Formula SPICK?

What does a large kurtosis mean?
* large values: sharp peak

What does a ca 0 kurtosis mean?
* around zero: normal peak

What does a small kurtosis mean?
* low values: broad peak

56
Q

R:
Kurtosis usage?

A
(e1071)
> xn1=runif(1000)
> kurtosis(xn1)
[1] ‐1.195974
> xn2=runif(1000)
> kurtosis(xn1+xn2)
[1] ‐0.5967484
> xn3=runif(1000)
> kurtosis(xn1+xn2+xn3)
[1] ‐0.3342538
> # rnorm gives around 0
> kurtosis(rnorm(1000))
[1] 0.036207
> # rt can be peakier ...
> kurtosis(rt(1000,df=10))
[1] 1.443117
> par(mfrow=c(3,1),mai=rep(0.4,4))
> hist(xn1,col='#ff3333',breaks=25); box()
> hist(xn1+xn2+xn3,col='#ffff33',breaks=20);box()
> hist(rt(1000,df=10),col='#33ff33',breaks=25);box()
57
Q

R:
Kurtosis – Implementation?

A
> dkurtosis <‐ function (x,na.rm=TRUE) {
if (na.rm) {
x=x[!is.na(x)]
} else {
if (any(is.na(x))) {
return(NA)
}}
g=(sum((x‐mean(x,na.rm=na.rm))^4)
/length(x))/(sd(x)^4)‐3
return(g)
}
> kurtosis(xn1+xn2+xn3)
[1] ‐0.3342538
> dkurtosis(xn1+xn2+xn3)
[1] ‐0.3342538
> sbi$kurtosis=dkurtosis
58
Q

Outliers definition?

A

Grubbs 1969 defined an outlier as:
Observation that appears to deviate markedly from other members of sample in which it occurs.

59
Q

Outliers measures?

A
  • more than 3 x SD away from the mean
  • more than 1.5 times of the IQR (mild) 3 times (extreme) → custom boxplot criteria
60
Q

What could an outlier be? - …
* an error prone data point ?
* a normal data scatter point ?
* an important specific observation ?
* → It depends

A

How can we deal with outliers?
* how to delete?
* delete the variable
* delete the value → missing value imputations
* transform the variable
* transform the value

61
Q

R:
Implement outlier function to check for outliers in normal distribution

A
> is.outlier <‐ function (x) {
sd=sqrt(var(x,na.rm=TRUE))
xm=x‐mean(x,na.rm=TRUE)
return(abs(xm)‐3*sd>0)
}
> rn=rnorm(1000)
> table(is.outlier(rn))
FALSE TRUE
998 2
> table(abs(scale(rn))>3)
FALSE TRUE
998 2
> is.outlier = function (x) { return(abs(scale(x))>3) }
> table(is.outlier(rn))
FALSE TRUE
998 2
62
Q

What is a violinplot?

A
  • similar to box plot
  • except also show P density of data at different values,
  • usually smoothed by a kernel density estimator.
63
Q

R:
Violinplot (lattice)?

A
> library(lattice)
> print(bwplot(nym.2002$age,
main="nym.2002$age",
panel=panel.violin,
col="light blue"))
nym.2002$age
nym.2002$age
20 40 60 80
64
Q

Density plot?

A
  • For single numerical variable
  • like a smoothed histogram → bar become lines
65
Q

R:
Density plot?

A
> hist(nym.2002$age,
main="nym.2002$age",
freq=F,col="light blue")
> box()
> lines(density(nym.2002$age),
col="blue")
> lines(density(nym.2002$age,bw=1),
col="red")
> lines(density(nym.2002$age,bw=4),
col="green")
> lines(density(nym.2002$age,bw=10),
col="magenta")
66
Q

Boxplot vs Stripchart?

A

Boxplot:
* many values
Stripchart:
* few values
* 1D scatter plots (or dot plots)
* good alternative to boxplot when few values

67
Q

R:
Boxplot vs Stripchart?

A
par(mfrow=c(1,3))
boxplot(survey$cm~survey$gender,col=c(2,4))
stripchart(survey$cm~survey$gender,col=c(2,4))
stripchart(survey$cm~survey$gender,method="stack", col=c(2,4),vertical=TRUE)

Stack option → looks a bit like a histogram
‘mixture between histogram, density line, violin plot’

68
Q

When to use which Num ~ Cat plot ?
* violinplot: ?
* stripplot: ?
* boxplot: ?

A
  • violinplot: show bi- or multimodal data
  • stripplot: show few values (a dozen per group)
  • boxplot: all other cases
69
Q

Numerical Numerical
Two ways to analyse a relationship between?

A
  • Correlation
  • Regression
70
Q

Which method of analysing a relationship between two numerical variables is interested in the direction of the relationship?

A

interested: regression

71
Q

Which method of analysing a relationship between two numerical variables is not interested in the direction of the relationship?

A

not interested: correlation

72
Q

R:
How can we get from some other distribution to a normal distribution?
Uniform to Normal Distribution
(extra)

A

Uniform distribution, lets throw dices:
~~~
> options(digits=3)
> runif(6,1,7)
[1] 2.30 2.69 1.15 3.54 5.52 3.50
> as.integer(runif(10,1,7))
[1] 4 1 1 2 5 3 1 2 2 3
> ru1=as.integer(runif(1000,1,7))
> table(ru1)
ru1
1 2 3 4 5 6
163 166 164 157 162 188
> ru2=as.integer(runif(1000,1,7))
> par(mfrow=c(2,1),mai=rep(0.4,4))
> barplot(table(ru1))
> barplot(table(ru2))
~~~

73
Q

R:
Mean of two uniform samples

A
> ru1u2=(ru1+ru2)/2
> table(ru1u2)
ru1u2
1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 6
30 47 92 99 131 176 159 108 89 48 21
> par(mfrow=c(2,1),mai=rep(0.4,4))
> hist(ru1u2)
> barplot(table(ru1u2))

→ hist problem with discrete data

74
Q

R:
Means of Uniform Distributions

A

→ from an uniform (or any other ) to a Gaussian distribution …
~~~
> options(digits=3) R:
> r1=runif(1000,1,7)
> par(mfrow=c(3,1),mai=rep(0.4,4))
> hist(r1,col=”light blue”,main=’r1’)
> box()
> for (i in 1:10) {
r1=r1+runif(1000,1,7)
if (i == 2) {
hist(r1/3,main=”r3”, col=”light blue”)
}
}
> hist(r1/11,main=”r11”,
col=”light blue”)
> box()
~~~

75
Q

Central Limit Theorem

A
  • if sample size is large enough
  • Pop of sample means will approximate a Gaussian distribution
  • No matter how population is distributed
76
Q

Central Limit Theorem
How large does sample size need to be?

A

It depends:
* more normal distribution → 10 or more
* less normal distribution → more samples, 100 should always be enough

77
Q

Properties of a Normal Distribution?

A
  • symmetrical bell shape
  • extends in both directions to ∞
  • mean, median close together
  • 95% of values within 2 SD
78
Q

For which distribution(s) is this true: 95% of all values are within 2 SD?

A

Normal distribution
this assumption gives very wrong results if the the distribution is non-normal ‼

79
Q

Normal numerical data n~c → which test?

80
Q

Non-normal numerical data, skewed or multi-modal distributions n~c → which test?

A

wilcox.test

81
Q

R:

<qpnr>norm (d probably meant pdqr?)
</qpnr>

A
  • d: pdf (probability density function = point probability); dnorm(N) → P(X==N)
  • p: cdf (cumulative density function); pnorm(N) → P(X<N)
  • q: inverse cdf (quantile function); qnorm(P) → X
  • r: random number generator
    Example:
    ~~~
    > options(digits=3)
    > rn=rnorm(1000)
    > hist(rn,freq=F,col=”light blue”)
    > lines(density(rn), col=”red”,lwd=2)
    > box()
    > pnorm(0)
    [1] 0.5
    > summary(rn)
    Min. 1st Qu. Median Mean 3rd Qu. Max.
    ‐3.36 ‐0.62 0.01 0.04 0.68 3.42
    > dnorm(0)
    [1] 0.399
    > qnorm(0.5)
    [1] 0
    ~~~
82
Q

R:
How can we test for normality?

A

H0 for Shapiro-Wilk test assumes normality:
~~~
> rn=rnorm(1000,mean=5.5,sd=0.5)
> mean(rn)
[1] 5.5
> median(rn)
[1] 5.51
> shapiro.test(rn)
Shapiro‐Wilk normality test
data: rn
W = 1, p‐value = 0.7
> shapiro.test(rn)$p.value
[1] 0.746
> shapiro.test(runif(100,1,6))$p.value
[1] 0.00346
> shapiro.test(runif(100,1,6))$statistic
W
0.956
> shapiro.test(survey$cm)
Shapiro‐Wilk normality test
data: survey$cm
W = 1, p‐value = 4e‐04
> shapiro.test(survey$cm[survey$gender==’M’])$p.value
[1] 0.409
~~~
→ p-value >= 0.05 we don’t reject H0, that the distribution comes from a normal distribution
→ p-value < 0.05 we reject H0, that the distribution comes from a normal distribution

83
Q

Shapiro-Wilk Test:
→ p-value >= 0.05 we ______ H0, that the distribution comes from a normal distribution

A

→ p-value >= 0.05 we don’t reject H0, that the distribution comes from a normal distribution

84
Q

Reporting Shapiro-Wilk Test?
Analysing height: W = 0.987, p = 0.00042 and weight, W = 0.951, p = 0, of students.

A

The size of students, W = 0.987, p = 0.00042, as well as the weight of students, W = 0.951, p = 0, were both significantly non-normally distributed.
* with many samples Shapiro-Wilk test is very easy significant
* use it only with visual inspection, histogram, qqplot if having many samples
[Better:
The size of students, W = 0.987, p < 0.001, as well as the weight of students, W = 0.951, p < 0.001, were both significantly non-normal.]

85
Q

The Shapiro-Wilk test can be used to check if data comes from a normal distribution. How else can you check for normality?

A
  • Kolmogorov‐Smirnov test
  • generalized test for any distribution
  • check if both samples might come from the same distribution → eg from normal
86
Q

R:
Normal Data Visualization

A
> norm=c(rnorm(1000, mean=5,sd=3))
> median(norm)
[1] 5.06
> mean(norm)
[1] 5.03
> shapiro.test(norm)$p.value
[1] 0.419
> par(mfrow=c(1,2),mai=c(0.4,0.4,0.4,0.4))
> hist(norm,col="light blue")
> points(mean(norm),145)
> text(mean(norm),160,"mean")
> points(median(norm),185)
> text(median(norm),200,"median")
> box()
> qqnorm(norm)
> qqline(norm) Histogram of norm
87
Q

R:
QQ-plots

A

Plotting the observed values against the expected/theoretical
~~~
> options(digits=3)
> rn=rnorm(10)
> head(sort(signif(rn,3)),n=5)
[1] ‐1.030 ‐0.657 ‐0.540 ‐0.439 0.147
> qqnorm(rn)
> qqline(rn)
~~~

88
Q

T-Distribution: Derived from _____________

A

Normal distribution

89
Q

T-Distribution: t is the _______ between ________ and the ______, divided by the _____

A

t is the difference between the sample mean and the population mean, divided by the SEM

90
Q

T-Distribution: you perform a sampling experiment where you know the __________

A

Population value

91
Q

R:
Practical – what’s being demonstrated here?
T-Distribution: There is always a difference … Let’s simulate many t’s
~~~
> rn=rnorm(10000) # our population is rn
> summary(rn)
Min. 1st Qu. Median Mean 3rd Qu. Max.
‐3.68 ‐0.67 ‐0.02 0.00 0.66 3.99
> mcompare=function () {sam1=sample(rn,10); sam2=sample(rn,10) ; return(mean(sam1) mean(sam2))}
> res=c() # create empty result vector
> for (i in 1:1000) {res=c(res,mcompare()) # append to result}
> hist(res,col=”light blue”,cex.lab=1.5,cex.main=1.5)
> box()
~~~

92
Q

R:
Practical – how to create own t distribution?

A

Distribution of t Values
~~~
> getT=function(n) {s1=sample(rn,n) ; t=(mean(s1)‐mean(rn))/(sd(s1)/sqrt(n)) ; return(t) }
> getT(5)
[1] 0.642
> res=c()
> for (i in 1:1000) { res=c(res,getT(10)) }
> par(mfrow=c(2,1),mai=c(0.4,0.4,0.4,0.4))
> summary(res)
Min. 1st Qu. Median Mean 3rd Qu. Max.
‐6.26 ‐0.75 ‐0.01 ‐0.07 0.63 5.36
> hist(res,col=”beige”,freq=F,ylim=c(0,0.5))
> lines(density(res),col=”red”,lwd=2)
> lines(seq(‐5,5,0.1),dt(seq(‐5,5,0.1),df=10),
col=”blue”,lwd=2,lty=1) ; box()
~~~

93
Q

R:
Function to random data from t-distribution?

A
> print(try(rt(1000)))
[1] "Error in rt(1000) : Argument \"df\" fehlt (ohne Standardwert)\n"
attr(,"class")
[1] "try‐error"
attr(,"condition")
<simpleError in rt(1000): Argument "df" fehlt (ohne Standardwert)>
> xt=rt(1000,99)
> summary(xt)
Min. 1st Qu. Median Mean 3rd Qu. Max.
‐3.96 ‐0.69 0.02 0.01 0.69 3.97
> shapiro.test(xt)$p.value
[1] 0.271
> ks.test(xt,"pt",99)$p.value
[1] 0.721
94
Q

R:
Qt function?

A

t* with qt
> df=c(1:6,10,20,50,100,10000);
> t=qt(0.975,df)
> data.frame(t,df)
t df
1 12.71 1
2 4.30 2
3 3.18 3
4 2.78 4
5 2.57 5
6 2.45 6
7 2.23 10
8 2.09 20
9 2.01 50
10 1.98 100
11 1.96 10000

95
Q

(QUIZ 3)
The ______is useful for calculating the mean of two speeds whereas the ______ can be used if our data are on different scales but we would like monitor changes over time where changes of both variables have the same impact. If we say “mean” we usually are speaking about the ______.

A

The harmonic mean is useful for calculating the me of two speeds whereas the geometric mean can be used if our data are on different scales but we would like monitor changes over time where changes of both variables have the same impact. If we say “mean” we usually are speaking about the arithmetic mean.

96
Q

(QUIZ 3)

A measure to describe how close our data are to the population mean is the ______. Its values are smaller than the values of the ______ which are as well unit dependent. As it is good to have as well a unit free description of the data scatter, the ______ should be as well consider

A

A measure to describe how close our data are to the population mean is the standard error of the mean. Its values are smaller than the values of the standard deviation which are as well unit dependent. As it is good to have as well a unit free description of the data scatter, the coefficient of variation should be as well consider

97
Q

(QUIZ 3)
The skewness is the ______ central moment of a distribution, whereas the kurtosis is the ______ moment. The ______ measures how sharp or flat a distribution is whereas the ______looks for the symmetry. A ______kurtosis means a very sharp distribution, whereas a ______kurtosis indicates a flat, more uniform like distribution

A

The skewness is the 3rd central moment of a distribution, whereas the kurtosis is the 4th moment. The kurtosis measures how sharp or flat a distribution is whereas the skewness looks for the symmetry. A positive kurtosis means a very sharp distribution, whereas a negative kurtosis indicates a flat, more uniform like distribution

98
Q

(QUIZ 3)
A boxplot displays the overall ______of the distribution. The line in the middle of the box indicates the ______, whereas the upper bound indicates the ______, the lower bound the ______. Outside of the whiskers are the ______ which are more than 1.5 ______apart from the mean.

A

A boxplot displays the overall shape of the distribution. The line in the middle of the box indicates the 50% quantile, whereas the upper bound indicates the 75% quantile, the lower bound the 25% quantile. Outside of the whiskers are the outliers shown which are more than 1.5 IQR apart from the mean.

99
Q

(QUIZ 3)
The Null-Hypothesis of the ______ test assumes that the data are coming from a ______ distribution. A p-value of less than 0.05 indicates that the data are coming from a ______. The ______ test can be used to compare different distributions. So it can be used as well to check if two sample distributions might be coming from the ______ population. A low p-value indicates that the distributions are coming from _______. In a normal distribution mean and median have values which are ______.

A

The Null-Hypothesis of the Shapiro Wilk test assumes that the data are coming from a normal distribution. A p-value of less than 0.05 indicates that the data are coming from a non normal distribution. The Kolmogorov Smirnow test can be used to compare different distributions. So it can be used as well to check if two sample distributions might be coming from the same population. A low p-value indicate that the distributions are coming from _ a different population_ In a normal distribution mean and median have values which are close.