5 | Introduction To Statistics Flashcards

1
Q

(POLL)

Two sections of one hospital have very different survival rates for patients with heart problems. Station 2 (second floor) has much better performance, than station 1 (ground floor)? Who is most likely responsible?

  • The doctors, on station two, they are just better!
  • The porter, who asks the patients usually: “Feel you fit to walk the stairs into the second floor?”.
  • The patient, who just feel better in higher floors?
  • None of them
A

The porter, who asks the patients usually: “Feel you fit to walk the stairs into the second floor?”.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

(POLL)

Looking at the relation between black schokolade consumption and IQ is:

  • correlational research
  • experimental research
  • making me hungry
  • none of these suggestions
A

correlational research

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

(POLL)

Double blind trials are clinical trials where neither the patient nor the doctor knows the medication, select true statements:

  • they are correlational research
  • they are experimental research
  • they are worse than observational studies because they increase selection bias
  • they are better than observational studies because they decrease selection bias
  • they have no selection bias
  • they still can have selection bias
A
  • they are experimental research
  • they are better than observational studies because they decrease selection bias
  • they still can have selection bias
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

(POLL)

To evaluate an outcome of a patient after a virus infection the following boxes were prepared for a survey: asymptomatic, common cold, long term suffering, dead … What type of variable is this?

  • discrete numerical 0, 1, 2, 3 etc for the levels
  • continuous numerical 0.0 1.0, 2.0, 3.0 for the level
  • nominal categorical, 0 (asymptomatic), 1 (cold), 2 (suffer a lot), 3 (dead)
  • ordinal categorical, 0 (asymptomatic), 1 (cold), 2 (suffer a lot), 3 (dead)
A

ordinal categorical, 0 (asymptomatic), 1 (cold), 2 (suffer a lot), 3 (dead)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

(POLL)

Which of the following measures are robust against outliers?

  • mean
  • trimmed mean
  • Median
A
  • trimmed mean
  • Median
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

(POLL)

Which measures give information about the spread of the data

  • mean
  • trimmed mean
  • median
  • IQR
  • sd
  • z-score
A
  • IQR
  • sd
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Name some terms from statistics which are deceptive when compared with the terms in the context of science

A

significant, error, hypothesis

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is the aim of statistics

A

descriptive:
- describe and summarize the data
- visualize trends for better understanding

inferential –> make conclusions from sample to population:
- decide whether two groups are different
- decide if sample is different to total population
- describe the relationship between two variables

Statistics:
- can decide if data difference is just a random one
- just estimates the “degree” of randomness
- can’t tell you how likely it is that there is really a difference

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Sampling: requirements / best practices?

A
  • sample items must be randomly selected (not some subset)
  • items must be independent from each other (selection of one item should not alter the chance of other items to be selected)?
  • having a plan before entering the greenhouse
  • more samples are better
  • if groups (eg experiment and control): balanced sampling is better
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Name some common sampling problems

A
  • your cohort and topic might change over time (cancer, virus)
  • true population is more diverse than the population you were sampling from
  • using a convenience sample rather than a random sample (falling cats)
  • your measured variable ist just a proxy for an other variable (poll)
  • imprecise measurements (misunderstandings, wrong scale for some people)
  • combination of different measurements required
  • even with clear results: there is room for interpretation
  • statistics can’t help against bad experimental design
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Three examples of poor sampling

A
  • study about cats surviving falls: those that ended up in the bin not part of study, only ones that got taken to the vet.
  • telephone interview: 1936 election in US Landon/Roosevelt predicuted landslide victory for Landon, but telephone owners disproportionately conservative/republican
  • coronavirus in DE: total cases increasing, but no real random sampling took place, eg to check antibodies. Numbers based on reported illnesses
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Name 4 different types of sampling strategies

A
  • simple random sampling (SRS
  • systematic sampling
  • stratified sampling
  • cluster sampling
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Simple random sampling?

A

simple random sampling:

all subsets of a sampling frame have an equal probability of being selected

eg throw a dice to decide who to choose

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Systematic sampling?

A

systematic sampling

relies on arranging the study population according to some ordering scheme and then selecting elements at regular intervals through that ordered list.

eg every kth person

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Stratified sampling?

A

stratified sampling

making subgroups based on categories eg male and female

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Clustered sampling?

A

Sometimes it is more cost-effective to select respondents in groups (‘clusters’). Sampling is often clustered by geography, or by time periods.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Research Methods for the Sample - two types?

A

correlational and experimental

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Correlational vs experimental research - what’s the difference?

Explain with the example: does reading books help with learning?

A

correlational research
– we don’t manipulate a variable

experimental research
– we manipulate a variable

Eg: does reading books help with learning?

correlational research
– we don’t manipulate a variable
– we observe what happens naturally w/o interfering
– we just collect answers for reading behaviour

experimental research
– we manipulate a variable
– divide our sample for reading into two groups
– one group must read statistic books
– other group is not allowed to read statistics books
– after a month we summarize

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Correlational vs experimental - what is the preferred method?

A

experimental

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Correlational vs experimental - why would you choose correlational?

A

It’s difficult to research experimentally for some reason:
- ethical reasons
- financial reasons

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What are the Four Different Levels of Measurement, in descending level or precision?

A

Data types:

Categorical (quality)
- nominal
- ordinal

Numerical (quantity)
- discrete
- continuous
(often not clear distinction eg with height)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Define this level of measurement and give an example:

nominal

A

A categorical level of measurement that doesn’t have any order.

eg
- gender: female, male (binary)
- smoker: yes, no (binary)
- protein structures: H, E, …
- nucleotides: A,C, G, T, U

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Define this level of measurement and give an example:

ordinal

A

A categorical level of measurement with a specified order.

eg:
- age: young,medium,old
- grade: 1,2,3,4,5
- month: 01..12(?)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Define this level of measurement and give an example:

discrete

A

A numerical level of measurement, of discrete numbers.

eg:
– age: 6, 8, 84
– height: 112, 176, 161cm
– number of helices per 1000 AA
– length of helices

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

Define this level of measurement and give an example:

continuous

A

A numerical level of measurement, non-discrete

eg:
– weight: 79.99kg, 72kg,…
– height: 12.2, 12.5, 15.0

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

What can you ask about a datatype, in order to determine which datatype/level of measurement it is?

A
  • Can you calculate a mean?

Yes –>
- is the mean always a possible value?
Yes –> continuous numerical
No –> discrete numerical

No –>
- Is there a logical order of values?
Yes –> Ordinal categorical
No –> Nominal categorical

27
Q

Explain univariate, bivariate, multivariate

A

how many variables there are in the statistical problem

28
Q

Groups in Categorical and Normality in
Numerical Data..

meaning for descriptive statistics? inferential statistics?

A

how many groups/levels in this variable ?

eg:
- two groups (investigate running times for female vs
male)
- three or more groups (running times for young,
medium aged and masters)

descriptive statistics:
same graphics, same functions (barplot, table, modus, …)

inferential statistics:
different tests!!
– t.test vs aov (2 groups vs 3 or more groups for normal data)
– wilcox.test vs kruskal.test
(2 vs 3 groups for non-normal data)

29
Q

Explain the difference between dependent and independent variables and give an example

A

dependent:
– depends on another –> an outcome variable

independent
– variable influences another (dependent) variable –> is a predictor variable
– might be manipulated

example:
let’s assume we can predict weight based on sex
and height (machine learning!)
– weight is the outcome variable (dependent)
- sex and height are predictor variables (independent)

30
Q

Define descriptive statistics

A

Descriptive statistics are used to describe the main features of a collection of data in quantitative terms.

Descriptive statistics are distinguished from inferential statistics, in that descriptive statistics aim to quantitatively summarize a data set, rather than being used to support inferential statements about the population that the data are thought to represent. … to give the audience an overall sense of the data being analysed.

31
Q

Define inferential statistics

A

Statistical inference or statistical induction comprises the use of statistics and random sampling to make inferences concerning some unknown aspect of a population.

32
Q

Name the different sample data centers

A

– Modus: most frequent value (more males?)
– Median: value where 50% of data are smaller and 50% of
data are larger (robust against outliers)?
– interquartile range (IQR) = 3.Quart-1.Quart (mid 50%)
– Mean: sum off all values / divided by number of all
values

33
Q

Which sample data center can be used for nominal datatypes?

A
  • modus
34
Q

Which sample data center can be used for ordinal datatypes?

A
  • modus
  • (median ?)
  • (mean ?)
35
Q

Which sample data center can be used for numerical datatypes?

A
  • median
  • mean
36
Q

How can one describe the sample distribution?

A

with:
- max, min
- quantile
- IQR (interquartile range)
- standard deviation
- CV (coefficient of variation)

37
Q

Do the following describe the sample distribution?

  • SEM
  • CI
  • P-value
A

They describe more the population

38
Q

SEM:

What does this stand for?
What does it measure?
How is it calculated?

A

Standard error of the mean.

Measures how much discrepancy is likely in a sample’s mean compared with the population mean.

Calculate: divide SD by the square root of the sample size.

39
Q

CI:

What does this stand for?
What does it measure?
How is it calculated?

A

Confidence interval

Range of values estimate expected to fall between if test redone, within certain level of confidence.

Confidence, in statistics, is another way to describe probability.

CI = mean of estimate plus and minus variation in that estimate.

40
Q

P-value

What does this stand for?
What does it measure?
How is it calculated?

A

p stands for probability

P values used in hypothesis testing to help decide whether to reject null hypothesis (inferential statistics)

Describes how likely you are to have found a particular set of observations if the null hypothesis were true.

Calculated from a statistical test.

41
Q

IQR

Stands for?
Meaning?

A

Interquartile range.

In descriptive statistics: tells you spread of middle half of distribution.

Quartiles segment any distribution that’s ordered from low to high into four equal parts.

The interquartile range (IQR) contains the second and third quartiles, or the middle half of your data set

42
Q

What is meant by parameters vs statistics?

A

A parameter is a number describing a whole population (e.g., population mean)

A statistic is a number describing a sample (e.g., sample mean).

we use the sample to estimate the parameters of the
population

Parameters and statistics have the same name but mean different things (eg mean of sample ȳ ≈ μ, mean of the population)

43
Q

How can you tell which one is parameter and which one is statistic? eg ȳ, μ

A

Usually:
- use latin letters for statistics (sample)
- use greek letters for parameters (population

44
Q

How is deviation of samples and populations described?

A

deviation from mean: s, sd (Sample variance s2, population variance σ2)

45
Q

How is the uncertainty/quality of results described, in inferential statistics?

A

SEM
CI

46
Q

R:

how do you get the median from a data object survey with a subset cm?

What if there are some empty cells?

A

> median(survey$cm)

> median(survey$cm,na.rm=T)

47
Q

R:

what does this command do?

> summary(survey$cm)

A

returns:
- min
- 1st qu
- median
- mean
- 3rd qu
- max
- NA’s

of that dataset

47
Q

R:

how do you get the mean from a data object survey with a subset cm, and we want it to be less sensitive to outliers? How does this work exactly?

A

> mean(survey$cm,na.rm=TRUE,trim=0.1)

removes upper and lower 10%

48
Q

R:

how do you get the min / max / standard deviation ?

A

min()
max()
sd()

49
Q

R:

How do you get the quantiles for a data object survey with a subset cm? and also take care of any NA’s

A

> quantile(survey$cm,c(0.25,0.5,0.75),na.rm=T)

50
Q

R:

you have a data object survey with 4 subsets. how can you get a summary for all of them?

A

> summary(survey[,c(1:4)])`

51
Q

R:

mean, trimmed mean, median - what is sensitive to outliers?

A
  • mean: too sensitive to outliers!
  • trimmed mean: less sensitive
  • median: not sensitive - most robust measure

–> use trimmed mean or median if there are outliers

52
Q

R:

what does the aggregate function do?

A

n R, you can use the aggregate function to compute summary statistics for subsets of the data. This function is very similar to the tapply function, but you can also input a formula or a time series object and in addition, the output is of class data.frame.

53
Q

R:

how can you use the aggregate function on a data object survey with subsets cm and gender to get the median height for each gender?

A

> aggregate(survey$cm,by=list(survey$gender),median,
na.rm=T)

54
Q

R:

How can you get a graphic to illustrate the height statistics, showing the quartiles of two gender subsets of a data object survey with subsets cm and gender ?

use red for female and blue for male.

A

> boxplot(survey$cm~survey$gender,
col=c(“red”,”blue”))

55
Q

R:

Boxplot - what are the whiskers? how big are they?

A

extensions to IQR

1.5 x IQR.

anything beyond this: outlier

56
Q

What is Z-Score Center-Deviation/Dispersion?

A

Normalisation procedure

  • data transformation
  • data now have mean of 0
  • data now have sd of 1
  • 95% of data are within a z-score of -1.96 and + 1.9

z = (x - x¯ ) / s

56
Q

R:

You have a data object survey with subsets gender, smoker.

How can you make a table showing the number per gender? eg:
F M
297 175

How can you make a table showing the number per gender divided by whether they smoke or not? eg:
N Y
F 260 36
M 148 27

Do you need to do anything about NA’s

Can you apply functions to the resulting tables?

A

> table(survey$gender)

> table(survey$gender,survey$smoker)

NA’s are removed automatically

Yeas you can

57
Q

R:

How can z-score be implemented?

A

Two ways:

> summary((survey$cm - mean(survey$cm,na.rm=TRUE))/
sd(survey$cm,na.rm=TRUE))

> summary(scale(survey$cm))

58
Q

R:

What is the cut function?

A

The cut function is used in R for cutting a numeric value into bins of continuous values

numeric –> categorical (factor)

‘poor mans start’ to get first overview of data

58
Q

R:

what is known as factors in R?

A

categorical data

59
Q

R:

You have a data object survey with subset cm.

How can you make a new object cSize which holds the cm info, but changed from numeric to factor? with three intervals .

Give the divisions a name as well?

A

> cSize=cut(survey$cm,c(0,160,185,250))

(values are breakpoints so intervals will be 1- 160, 160 - 185, 185 - 250)

> levels(cSize)=c(“dwarfs”,”normals”,”giants”)

59
Q

R:

With cut , how can you make the result not just categorical nominal, but also ordinal?

A

> cSize=cut(survey$cm,c(0,160,185,250),ordered=TRUE)