8 | Statistics for Numerical Data I Flashcards

Question

R: Calculate mean

Answer 1

* normal/arithmetic mean: mean(x,na.rm=TRUE) * trimmed mean: mean(x,na.rm=TRUE,trim=0.1)

Answer 2

* median: median(x,na.rm=true)

Answer 3

* (arithmetic) mean * geometric mean * harmonic mean * they hold this inequality: arithmetic ≥ geometric ≥ harmonic SPICK formulae

Answer 4

(arithmetic) mean: * standard * values must have same properties, same ranges geometric mean: * average of multiple properties on different scales, * ranges, all values must be above zero harmonic mean: * average of rates, all values must be above zero

Answer 5

* might thing 40 → wrong! * 1 hr at 60 km/h, 3 hrs at 20km/h → 120km/h / 4 → 30km/h * This can also be calculated using the harmonic mean * No 0s or negative with harmonic mean! ``` > library(diagram) > openplotmat(xlim=c(‐0.05,1.05),ylim=c(0.2,0.75)) > textdiamond(c(0.5,0.65),lab='60km/h', + box.col='light blue',radx=0.12,rady=0.08) > textdiamond(c(0.5,0.35),lab='20km/h', + box.col='light blue',radx=0.12,rady=0.08) > straightarrow(c(0.1,0.55),c(0.9,0.55),lwd=2) > straightarrow(c(0.9,0.45),c(0.1,0.45),lwd=2) > axis(1,labels=c('0km','60km'),at=c(0.1,0.9)) > textround(c(0.1,0.5),lab='A',box.col='salmon', + radx=0.01,rady=0.1,cex=1.2) > textround(c(0.9,0.5),lab='B',box.col='salmon', + radx=0.01,rady=0.1,cex=1.2) > hmean(c(20,60)) [1] 30 > hmean(c(20,0)) [1] 0 > hmean(c(20,60,NA)) [1] NA > hmean(c(20,60,NA),na.rm=TRUE) [1] 30 ```

Answer 6

* Average GDP of European Union? * Changes in GDP: each country has equal impact? → Solution geometric mean! → This is also for a stock index helpful! → Exercise implement geometric and harmonic mean! SPICK ``` > # USD 2016 > bip.gre = 19000 > bip.ger = 42000 > mean(c(bip.ger,bip.gre)) [1] 30500 > mean(c(bip.ger+(bip.ger/10),bip.gre)) [1] 32600 > mean(c(bip.ger,bip.gre+(bip.gre/10))) [1] 31450 > gmean(c(bip.ger,bip.gre)) [1] 28248.89 > gmean(c(bip.ger+(bip.ger/10),bip.gre)) [1] 29627.69 > gmean(c(bip.ger,bip.gre+(bip.gre/10))) [1] 29627.69 > gmean(c(0.1,10)) [1] 1 > gmean(c(0.2,10)) [1] 1.414214 > gmean(c(0.1,20)) [1] 1.414214 > gmean(c(0.1,20,100)) [1] 5.848035 > gmean(c(0.15,20,100)) [1] 6.69433 > gmean(c(0.1,30,100)) [1] 6.69433 > gmean(c(0.1,20,150)) [1] 6.69433

Answer 7

* variability: – imprecision – experimental error – biological variability * bias:

Answer 8

– imprecision – experimental error – biological variability

Answer 9

– systematic error – does not! contribute to scatter

Answer 10

Bad gun = bias Bad gunman = imprecision Good Gunman (precision) * Good Gun → small scatter right in the bulls eye (accurate variance) * Bad Gun → small scatter but moved away from the centre (inaccurate bias) Bad Gunman (imprecision) * Good Gun → a lot of scatter, but looks like bulls eye is ca average(accurate variance) * Bad Gun → a lot of scatter, but looks like (inaccurate bias)

Answer 11

* standard deviation (SD) * coefficient of variation (CV) * standard error of the mean (SEM)

Answer 12

– s: sample standard deviation s = √[ (∑(x_i − x̄)² / (N -1) ]

Answer 13

– σ: population standard deviation = √[ (∑(x_i − x̄)² / N ]

Answer 14

* ~ 2/3 of data are within 1 SD * ~ 95% of data are within 2 SD

Answer 15

* unit based → SD in m is smaller than SD in cm → use CV

Answer 16

– CV% = 100*sd(x)/mean(x) → cv =100*s_x / x – used to compare different magnitudes

Answer 17

– SEM = sd(x)/sqrt(N) → sem = S_x/ sqrt(N)

Answer 18

– how close are we to the true population mean

Answer 19

more measurements → closer to pop mean

Answer 20

* SEM always smaller than SD so some people prefer to show * But more measurements → closer to pop mean * It is not actually expressing the scatter, though it depends on the scatter * It tells us how close are we to the true population mean

Answer 21

``` > sd(survey$kg) [1] NA > sd(survey$kg,na.rm=TRUE) # standard deviation [1] 12.74224 > print(try(cv(survey$kg))) # there is no cv [1] "Error in cv(survey$kg) : konnte Funktion \"cv\" nicht finden\n" attr(,"class") [1] "try‐error" attr(,"condition") # no cv function in R > cv=function(x,na.rm=FALSE) { 100*sd(x,na.rm=na.rm)/ # implement own cv function mean(x,na.rm=na.rm) } # where to place this? > cv(survey$kg) [1] NA > sem=function(x,na.rm=FALSE) { # implement own sem function sd(x,na.rm=na.rm)/sqrt(length(x[!is.na(x)])) } > sem(survey$kg) [1] NA > sem(survey$kg,na.rm=TRUE) [1] 0.6033626 > sbi$cv=cv > sbi$sem=sem ```

Answer 22

* first moment: mean * second moment: variance * skewness * kurtosis

Answer 23

* third central moment of a distribution * shape: signed measure for the degree of symmetry

Answer 24

* positive values: skewness to the right (long tail)

Answer 25

* negative values: skewness to the left (short tail)

Answer 26

* around zero: symmetrical distribution

Answer 27

> library(e1071)

Answer 28

``` > library(e1071) > library(UsingR) > nym=nym.2002[nym.2002$age>16,] > skewness(nym$time) [1] 1.083422 ### longer tail on the right > skewness(nym$time*‐1) [1] ‐1.083422 ### longer tail on the left > x <‐ rnorm(1000) > skewness(x) [1] 0.08863 > x <‐ rnorm(1000) > skewness(x) [1] ‐0.03025407 time density 100 200 300 400 500 600 0.000 0.002 0.004 0.006 0.008 0.010 > hist(nym$time,freq=F, ylim=c(0,0.01), main='',xlab='time', ylab='density') > box() > lines(density(nym$time), col="blue",lwd=3)

Answer 29

``` > norm1=c(rnorm(1000, mean=5,sd=2)) > norm2=c(rnorm(500,mean=13,sd=2)) > nonorm=c(norm1,norm2) > median(nonorm) [1] 6.304157 > mean(nonorm) [1] 7.654257 ### mean and median are not close together → indication of long tail or bimodal > skewness(nonorm) [1] 0.4822618 ## right tail > par(mfrow=c(2,1)) > hist(nonorm,col="light blue");box() > points(mean(nonorm),145);text(mean(nonorm),160,"mean") > points(median(nonorm),185);text(median(nonorm),200,"median") > qqnorm(nonorm);qqline(nonorm) ```

Answer 30

``` > dskewness <‐ function (x,na.rm=TRUE) { if (na.rm) { x=x[!is.na(x)] } else { if (any(is.na(x))) { return(NA) } } g=(sum((x‐mean(x,na.rm=na.rm))^3)/length(x))/(sd(x)^3) return(g) } > skewness(nonorm) [1] 0.4822618 > dskewness(nonorm) [1] 0.4822618 > sbi$skewness=dskewness ```

Answer 31

* fourth central moment of a distribution * peakedness of a distribution Formula SPICK? What does a large kurtosis mean? * large values: sharp peak What does a ca 0 kurtosis mean? * around zero: normal peak What does a small kurtosis mean? * low values: broad peak

Answer 32

``` (e1071) > xn1=runif(1000) > kurtosis(xn1) [1] ‐1.195974 > xn2=runif(1000) > kurtosis(xn1+xn2) [1] ‐0.5967484 > xn3=runif(1000) > kurtosis(xn1+xn2+xn3) [1] ‐0.3342538 > # rnorm gives around 0 > kurtosis(rnorm(1000)) [1] 0.036207 > # rt can be peakier ... > kurtosis(rt(1000,df=10)) [1] 1.443117 > par(mfrow=c(3,1),mai=rep(0.4,4)) > hist(xn1,col='#ff3333',breaks=25); box() > hist(xn1+xn2+xn3,col='#ffff33',breaks=20);box() > hist(rt(1000,df=10),col='#33ff33',breaks=25);box() ```

Answer 33

``` > dkurtosis <‐ function (x,na.rm=TRUE) { if (na.rm) { x=x[!is.na(x)] } else { if (any(is.na(x))) { return(NA) }} g=(sum((x‐mean(x,na.rm=na.rm))^4) /length(x))/(sd(x)^4)‐3 return(g) } > kurtosis(xn1+xn2+xn3) [1] ‐0.3342538 > dkurtosis(xn1+xn2+xn3) [1] ‐0.3342538 > sbi$kurtosis=dkurtosis ```

Answer 34

Grubbs 1969 defined an outlier as: Observation that appears to deviate markedly from other members of sample in which it occurs.

Answer 35

* more than 3 x SD away from the mean * more than 1.5 times of the IQR (mild) 3 times (extreme) → custom boxplot criteria

Answer 36

How can we deal with outliers? * how to delete? * delete the variable * delete the value → missing value imputations * transform the variable * transform the value

Answer 37

``` > is.outlier <‐ function (x) { sd=sqrt(var(x,na.rm=TRUE)) xm=x‐mean(x,na.rm=TRUE) return(abs(xm)‐3*sd>0) } > rn=rnorm(1000) > table(is.outlier(rn)) FALSE TRUE 998 2 > table(abs(scale(rn))>3) FALSE TRUE 998 2 > is.outlier = function (x) { return(abs(scale(x))>3) } > table(is.outlier(rn)) FALSE TRUE 998 2 ```

Answer 38

* similar to box plot * except also show P density of data at different values, * usually smoothed by a kernel density estimator.

Answer 39

``` > library(lattice) > print(bwplot(nym.2002$age, main="nym.2002$age", panel=panel.violin, col="light blue")) nym.2002$age nym.2002$age 20 40 60 80 ```

Answer 40

* For single numerical variable * like a smoothed histogram → bar become lines

Answer 41

``` > hist(nym.2002$age, main="nym.2002$age", freq=F,col="light blue") > box() > lines(density(nym.2002$age), col="blue") > lines(density(nym.2002$age,bw=1), col="red") > lines(density(nym.2002$age,bw=4), col="green") > lines(density(nym.2002$age,bw=10), col="magenta") ```

Answer 42

Boxplot: * many values Stripchart: * few values * 1D scatter plots (or dot plots) * good alternative to boxplot when few values

Answer 43

``` par(mfrow=c(1,3)) boxplot(survey$cm~survey$gender,col=c(2,4)) stripchart(survey$cm~survey$gender,col=c(2,4)) stripchart(survey$cm~survey$gender,method="stack", col=c(2,4),vertical=TRUE) ``` Stack option → looks a bit like a histogram ‘mixture between histogram, density line, violin plot’

Answer 44

* violinplot: show bi- or multimodal data * stripplot: show few values (a dozen per group) * boxplot: all other cases

Answer 45

* Correlation * Regression

Answer 46

interested: regression

Answer 47

not interested: correlation

Answer 48

Uniform distribution, lets throw dices: ``` > options(digits=3) > runif(6,1,7) [1] 2.30 2.69 1.15 3.54 5.52 3.50 > as.integer(runif(10,1,7)) [1] 4 1 1 2 5 3 1 2 2 3 > ru1=as.integer(runif(1000,1,7)) > table(ru1) ru1 1 2 3 4 5 6 163 166 164 157 162 188 > ru2=as.integer(runif(1000,1,7)) > par(mfrow=c(2,1),mai=rep(0.4,4)) > barplot(table(ru1)) > barplot(table(ru2)) ```

Answer 49

``` > ru1u2=(ru1+ru2)/2 > table(ru1u2) ru1u2 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 6 30 47 92 99 131 176 159 108 89 48 21 > par(mfrow=c(2,1),mai=rep(0.4,4)) > hist(ru1u2) > barplot(table(ru1u2)) ``` → hist problem with discrete data

Answer 50

→ from an uniform (or any other ) to a Gaussian distribution … ``` > options(digits=3) R: > r1=runif(1000,1,7) > par(mfrow=c(3,1),mai=rep(0.4,4)) > hist(r1,col="light blue",main='r1') > box() > for (i in 1:10) { r1=r1+runif(1000,1,7) if (i == 2) { hist(r1/3,main="r3", col="light blue") } } > hist(r1/11,main="r11", col="light blue") > box() ```

Answer 51

* if sample size is large enough * Pop of sample means will approximate a Gaussian distribution * No matter how population is distributed

Answer 52

It depends: * more normal distribution → 10 or more * less normal distribution → more samples, 100 should always be enough

Answer 53

* symmetrical bell shape * extends in both directions to ∞ * mean, median close together * 95% of values within 2 SD

Answer 54

Normal distribution this assumption gives very wrong results if the the distribution is non-normal ‼

Answer 55

wilcox.test

Answer 56

* d: pdf (probability density function = point probability); dnorm(N) → P(X==N) * p: cdf (cumulative density function); pnorm(N) → P(X options(digits=3) > rn=rnorm(1000) > hist(rn,freq=F,col="light blue") > lines(density(rn), col="red",lwd=2) > box() > pnorm(0) [1] 0.5 > summary(rn) Min. 1st Qu. Median Mean 3rd Qu. Max. ‐3.36 ‐0.62 0.01 0.04 0.68 3.42 > dnorm(0) [1] 0.399 > qnorm(0.5) [1] 0 ```

Answer 57

H0 for Shapiro-Wilk test assumes normality: ``` > rn=rnorm(1000,mean=5.5,sd=0.5) > mean(rn) [1] 5.5 > median(rn) [1] 5.51 > shapiro.test(rn) Shapiro‐Wilk normality test data: rn W = 1, p‐value = 0.7 > shapiro.test(rn)$p.value [1] 0.746 > shapiro.test(runif(100,1,6))$p.value [1] 0.00346 > shapiro.test(runif(100,1,6))$statistic W 0.956 > shapiro.test(survey$cm) Shapiro‐Wilk normality test data: survey$cm W = 1, p‐value = 4e‐04 > shapiro.test(survey$cm[survey$gender=='M'])$p.value [1] 0.409 ``` → p-value >= 0.05 we don’t reject H0, that the distribution comes from a normal distribution → p-value < 0.05 we reject H0, that the distribution comes from a normal distribution

Answer 58

→ p-value >= 0.05 we _don’t reject_ H0, that the distribution comes from a normal distribution

Answer 59

The size of students, W = 0.987, p = 0.00042, as well as the weight of students, W = 0.951, p = 0, were both significantly non-normally distributed. * with many samples Shapiro-Wilk test is very easy significant * use it only with visual inspection, histogram, qqplot if having many samples [Better: The size of students, W = 0.987, _p < 0.001_, as well as the weight of students, W = 0.951, _p < 0.001_, were both significantly non-normal.]

Answer 60

* Kolmogorov‐Smirnov test * generalized test for any distribution * check if both samples might come from the same distribution → eg from normal

Answer 61

``` > norm=c(rnorm(1000, mean=5,sd=3)) > median(norm) [1] 5.06 > mean(norm) [1] 5.03 > shapiro.test(norm)$p.value [1] 0.419 > par(mfrow=c(1,2),mai=c(0.4,0.4,0.4,0.4)) > hist(norm,col="light blue") > points(mean(norm),145) > text(mean(norm),160,"mean") > points(median(norm),185) > text(median(norm),200,"median") > box() > qqnorm(norm) > qqline(norm) Histogram of norm ```

Answer 62

Plotting the observed values against the expected/theoretical ``` > options(digits=3) > rn=rnorm(10) > head(sort(signif(rn,3)),n=5) [1] ‐1.030 ‐0.657 ‐0.540 ‐0.439 0.147 > qqnorm(rn) > qqline(rn) ```

Answer 63

Normal distribution

Answer 64

t is the _difference_ between the _sample mean_ and the _population mean_, divided by the _SEM_

Answer 65

Population value

Answer 66

Distribution of t Values ``` > getT=function(n) {s1=sample(rn,n) ; t=(mean(s1)‐mean(rn))/(sd(s1)/sqrt(n)) ; return(t) } > getT(5) [1] 0.642 > res=c() > for (i in 1:1000) { res=c(res,getT(10)) } > par(mfrow=c(2,1),mai=c(0.4,0.4,0.4,0.4)) > summary(res) Min. 1st Qu. Median Mean 3rd Qu. Max. ‐6.26 ‐0.75 ‐0.01 ‐0.07 0.63 5.36 > hist(res,col="beige",freq=F,ylim=c(0,0.5)) > lines(density(res),col="red",lwd=2) > lines(seq(‐5,5,0.1),dt(seq(‐5,5,0.1),df=10), col="blue",lwd=2,lty=1) ; box() ```

Answer 67

``` > print(try(rt(1000))) [1] "Error in rt(1000) : Argument \"df\" fehlt (ohne Standardwert)\n" attr(,"class") [1] "try‐error" attr(,"condition") > xt=rt(1000,99) > summary(xt) Min. 1st Qu. Median Mean 3rd Qu. Max. ‐3.96 ‐0.69 0.02 0.01 0.69 3.97 > shapiro.test(xt)$p.value [1] 0.271 > ks.test(xt,"pt",99)$p.value [1] 0.721 ```

Answer 68

t* with qt > df=c(1:6,10,20,50,100,10000); > t=qt(0.975,df) > data.frame(t,df) t df 1 12.71 1 2 4.30 2 3 3.18 3 4 2.78 4 5 2.57 5 6 2.45 6 7 2.23 10 8 2.09 20 9 2.01 50 10 1.98 100 11 1.96 10000

Answer 69

The _harmonic mean_ is useful for calculating the me of two speeds whereas the _geometric mean_ can be used if our data are on different scales but we would like monitor changes over time where changes of both variables have the same impact. If we say "mean" we usually are speaking about the _arithmetic mean_.

Answer 70

A measure to describe how close our data are to the population mean is the _standard error of the mean_. Its values are smaller than the values of the standard deviation which are as well unit dependent. As it is good to have as well a unit free description of the data scatter, the _coefficient of variation_ should be as well consider

Answer 71

The skewness is the _3rd_ central moment of a distribution, whereas the kurtosis is the _4th_ moment. The _kurtosis_ measures how sharp or flat a distribution is whereas the _skewness_ looks for the symmetry. A _positive_ kurtosis means a very sharp distribution, whereas a _negative_ kurtosis indicates a flat, more uniform like distribution

Answer 72

A boxplot displays the overall _shape_ of the distribution. The line in the middle of the box indicates the _50% quantile_, whereas the upper bound indicates the _75% quantile_, the lower bound the _25% quantile_. Outside of the whiskers are the _outliers_ shown which are more than 1.5 _IQR_ apart from the mean.

Answer 73

The Null-Hypothesis of the Shapiro Wilk test assumes that the data are coming from a _normal_ distribution. A p-value of less than 0.05 indicates that the data are coming from a _non normal_ distribution. The _Kolmogorov Smirnow_ test can be used to compare different distributions. So it can be used as well to check if two sample distributions might be coming from the _same_ population. A low p-value indicate that the distributions are coming from _ a different population_ In a normal distribution mean and median have values which are _close_.