5 | Introduction To Statistics Flashcards
(POLL)
Two sections of one hospital have very different survival rates for patients with heart problems. Station 2 (second floor) has much better performance, than station 1 (ground floor)? Who is most likely responsible?
- The doctors, on station two, they are just better!
- The porter, who asks the patients usually: “Feel you fit to walk the stairs into the second floor?”.
- The patient, who just feel better in higher floors?
- None of them
The porter, who asks the patients usually: “Feel you fit to walk the stairs into the second floor?”.
(POLL)
Looking at the relation between black schokolade consumption and IQ is:
- correlational research
- experimental research
- making me hungry
- none of these suggestions
correlational research
(POLL)
Double blind trials are clinical trials where neither the patient nor the doctor knows the medication, select true statements:
- they are correlational research
- they are experimental research
- they are worse than observational studies because they increase selection bias
- they are better than observational studies because they decrease selection bias
- they have no selection bias
- they still can have selection bias
- they are experimental research
- they are better than observational studies because they decrease selection bias
- they still can have selection bias
(POLL)
To evaluate an outcome of a patient after a virus infection the following boxes were prepared for a survey: asymptomatic, common cold, long term suffering, dead … What type of variable is this?
- discrete numerical 0, 1, 2, 3 etc for the levels
- continuous numerical 0.0 1.0, 2.0, 3.0 for the level
- nominal categorical, 0 (asymptomatic), 1 (cold), 2 (suffer a lot), 3 (dead)
- ordinal categorical, 0 (asymptomatic), 1 (cold), 2 (suffer a lot), 3 (dead)
ordinal categorical, 0 (asymptomatic), 1 (cold), 2 (suffer a lot), 3 (dead)
(POLL)
Which of the following measures are robust against outliers?
- mean
- trimmed mean
- Median
- trimmed mean
- Median
(POLL)
Which measures give information about the spread of the data
- mean
- trimmed mean
- median
- IQR
- sd
- z-score
- IQR
- sd
Name some terms from statistics which are deceptive when compared with the terms in the context of science
significant, error, hypothesis
Explain the difference between dependent and independent variables and give an example
dependent:
– depends on another –> an outcome variable
independent
– variable influences another (dependent) variable –> is a predictor variable
– might be manipulated
example:
let’s assume we can predict weight based on sex
and height (machine learning!)
– weight is the outcome variable (dependent)
- sex and height are predictor variables (independent)
Define descriptive statistics.
vs inferential?
📊descriptive statistics:
- describe main features of data (sample)
- in quantitative terms
vs inferential statistics:
- used to support inferential statements
- data (sample👥) –> population 👥👥👥
Define inferential statistics
Statistical inference or statistical induction comprises the use of statistics and random sampling to make inferences concerning some unknown aspect of a population.
Name the different sample data centers
- Modus: most frequent value
- Median: value where 50% of data are smaller and 50% of
data are larger (robust against outliers) - IQR (interquartile range) = 3.Quart-1.Quart = mid 50%
- Mean
Which sample data center can be used for nominal datatypes?
- modus
Which sample data center can be used for ordinal datatypes?
- modus
- (median ?)
- (mean ?)
Which sample data center can be used for numerical datatypes?
- median
- mean
How can one describe the sample distribution?
with:
- max, min
- quantile
- IQR (interquartile range)
- standard deviation
- CV (coefficient of variation)
Do the following describe the sample distribution?
- SEM
- CI
- P-value
They describe more the population
SEM:
What does this stand for?
What does it measure?
How is it calculated?
stands for:
standard error of the mean.
measures:
likelihod of discr. in sample’s mean vs pop mean.
calculate:
SD / sqrt(N)
CI:
What does this stand for?
What does it measure?
How is it calculated?
Confidence interval
Range of values estimate expected to fall between if test redone, within certain level of confidence.
Confidence, in statistics, is another way to describe probability.
CI = mean of estimate plus and minus variation in that estimate.
P-value
What does this stand for?
What does it measure?
How is it calculated?
p stands for probability
P values used in hypothesis testing to help decide whether to reject null hypothesis (inferential statistics)
Describes how likely you are to have found a particular set of observations if the null hypothesis were true.
Calculated from a statistical test.
IQR
Stands for?
Meaning?
Interquartile range.
In descriptive statistics: tells you spread of middle half of distribution.
Quartiles segment any distribution that’s ordered from low to high into four equal parts.
The interquartile range (IQR) contains the second and third quartiles, or the middle half of your data set
What is meant by parameters vs statistics?
A parameter is a number describing a whole population (e.g., population mean)
A statistic is a number describing a sample (e.g., sample mean).
we use the sample to estimate the parameters of the
population
Parameters and statistics have the same name but mean different things (eg mean of sample ȳ ≈ μ, mean of the population)
How can you tell which one is parameter and which one is statistic? eg ȳ, μ
Usually:
- use latin letters for statistics (sample)
- use greek letters for parameters (population
How is deviation of samples and populations described?
- **deviation from mean: s, sd **
- (Sample variance s2, population variance σ2)
How is the uncertainty/quality of results described, in inferential statistics?
SEM
CI
R:
how do you get the median from a data object survey with a subset cm?
What if there are some empty cells?
> median(survey$cm)
> median(survey$cm,na.rm=T)
R:
what does this command do?
> summary(survey$cm)
returns:
- min
- 1st qu
- median
- mean
- 3rd qu
- max
- NA’s
of that dataset
R:
how do you get the mean from a data object survey with a subset cm, and we want it to be less sensitive to outliers? How does this work exactly?
> mean(survey$cm,na.rm=TRUE,trim=0.1)
removes upper and lower 10%
R:
how do you get the min / max / standard deviation ?
min()
max()
sd()
R:
How do you get the quantiles for a data object survey with a subset cm? and also take care of any NA’s
> quantile(survey$cm,c(0.25,0.5,0.75),na.rm=T)
R:
you have a data object survey with 4 subsets. how can you get a summary for all of them?
> summary(survey[,c(1:4)])
R:
mean, trimmed mean, median - what is sensitive to outliers?
- mean: too sensitive to outliers!
- trimmed mean: less sensitive
- median: not sensitive - most robust measure
–> use trimmed mean or median if there are outliers
R:
what does the aggregate function do?
- compute summary stats for data subsets
- very similar to tapply() but:
- can also input formula or time series object
- output is df
R:
how can you use the aggregate function on a data object survey with subsets cm and gender to get the median height for each gender?
> aggregate(survey$cm,by=list(survey$gender),median,
na.rm=T)
R:
How can you get a graphic to illustrate the height statistics, showing the quartiles of two gender subsets of a data object survey with subsets cm and gender ?
use red for female and blue for male.
> boxplot(survey$cm~survey$gender, col=c("red","blue"))
R:
Boxplot - what are the whiskers? how big are they?
extensions to IQR
1.5 x IQR.
anything beyond this: outlier
What is Z-Score Center-Deviation/Dispersion?
Normalisation procedure
- data transformation
- data now have mean of 0
- data now have sd of 1
- 95% of data are within a z-score of -1.96 and + 1.96
z = (x - x¯ ) / s
R:
You have a data object survey with subsets gender, smoker.
How can you make a table showing the number per gender? eg:
F M
297 175
How can you make a table showing the number per gender divided by whether they smoke or not? eg:
N Y
F 260 36
M 148 27
Do you need to do anything about NA’s
Can you apply functions to the resulting tables?
> table(survey$gender)
> table(survey$gender,survey$smoker)
NA’s are removed automatically
Yeas you can
R:
How can z-score be implemented?
Two ways:
> summary((survey$cm - mean(survey$cm,na.rm=TRUE))/
sd(survey$cm,na.rm=TRUE))
> summary(scale(survey$cm))
R:
What is the cut function?
The cut function is used in R for cutting a numeric value into bins of continuous values
numeric –> categorical (factor)
‘poor mans start’ to get first overview of data
R:
what is known as factors in R?
categorical data
R:
You have a data object survey with subset cm.
How can you make a new object cSize which holds the cm info, but changed from numeric to factor? with three intervals .
Give the divisions a name as well?
> cSize=cut(survey$cm,c(0,160,185,250))
(values are breakpoints so intervals will be 1- 160, 160 - 185, 185 - 250)
> levels(cSize)=c(“dwarfs”,”normals”,”giants”)
R:
With cut , how can you make the result not just categorical nominal, but also ordinal?
> cSize=cut(survey$cm,c(0,160,185,250),ordered=TRUE)
Aims of statistics?
- Describe and summarize data
- Visualize trends for better understanding
- Make conclusions about a population based on analysis of the sample
What types of conclusions can we make about a population based on the sample using statistics?
- Decide whether two groups can be taken as different
- Decide if sample is different to total population
- Describe the relationship between two variables (proportional analysis)
- Decide if data difference is just a random one
Statistics and randomness?
- Statistics just estimates “degree” of randomness
- Statistics can’t tell you how likely it is that there is really a difference
What is important when sampling?
- Sample items must be randomly selected (not a subset but a random selection!)
- Items must be independent from each other (selection of one item should not alter the chance of other items to be selected)?
- Having a plan before entering the greenhouse
- More samples are better
- Groups: balanced sampling is better
Example of when random sampling is difficult?
- Cats falling from windows: information was usually from veterinarians just those that survived falling! (= convenience sample)
- Roosevelt survey - telephone interviewing. Mostly just wealthy people had phones republicans
- Coronavirus: selective testing - no proper sampling of population!
Common sampling problems
- your cohort and topic might change over time (cancer, virus)
- true population is more diverse than the population you were sampling from
- using a convenience sample rather than a random sample (falling cats)
- your measured variable is just a proxy for another variable
- imprecise measurements (misunderstandings, wrong scale for some people)
- combination of different measurements required
- even with clear results: there is room for interpretation
- statistics can’t help against bad experimental design
Name 4 sampling strategies
- Accidental sampling (close to hand) → often very biased
- Simple random sampling
- Systematic sampling (eg every 10th person)
- Stratified sampling (making subgroups based on categories)
- Cluster sampling
What is simple random sampling? What problems can arise and how can this be improved upon?
- SRS
- Every element has same chance of selection as sample (eg dice)
- Randomness might be a problem, especially for large populations or small samples
- Systematic or stratified sampling might overcome this
- Example: balancing sexes 50/50 or 60/40 (UP ratio)
What is systematic sampling? Aka? Example? What problem may arise?
- Arranging population by some ordering
- Selecting elements in regular intervals
- Aka interval sampling
Example: - Every kth person, but don’t start at the beginning
Issues: - Vulnerable to periodicities, every 10th house is on street crossings …
What is stratified sampling?
- You know categories in your data eg male, female, old, young…
- Sample from each category according to your distribution of those categories in your population
- One of the methods above for sampling within the strata
What two types of research can be done on the sample?
- correlational research
- experimental research
What is meant by correlational research?
- Just observe what happens in nature
- we don’t manipulate a variable
- does reading books improve learning
- we just collect answers for reading behaviour
What is meant by experimental research?
- we manipulate a variable
- item divide our sample for reading in two groups
- one group must read statistic books
- other group is not allowed to read statistics books
- after a month we summarize
Correlational vs experimental - why would you choose correlational?
It’s difficult to research experimentally for some reason:
- ethical reasons
- financial reasons
What is the main distinction when it comes to types of data?
- Measurement levels
- Categorical (qualitative) vs numerical
What is meant by data dimensions?
- Uni-, bi-, or multivariate data
How can categorical data types be further distinguished?
- Nominal: eg gender, smoker, protein structures, nucleotides etc
- Ordinal: age (young, medium, old), grade, month etc
How can numerical data be further distinguished?
- Discrete: eg age (1,2,3,4..), height (eg 100, 101, 102 cm), length of helices
- Continuous: eg height (eg 100.100…cm), weight (79, 998…kg)
Define this level of measurement and give an example:
nominal
A categorical level of measurement that doesn’t have any order.
eg
- gender: female, male (binary)
- smoker: yes, no (binary)
- protein structures: H, E, …
- nucleotides: A,C, G, T, U
Define this level of measurement and give an example:
ordinal
A categorical level of measurement with a specified order.
eg:
- age: young,medium,old
- grade: 1,2,3,4,5
- month: 01..12(?)
Define this level of measurement and give an example:
Discrete
A numerical level of measurement, of discrete numbers.
eg:
– age: 6, 8, 84
– height: 112, 176, 161cm
– number of helices per 1000 AA
– length of helices
Define this level of measurement and give an example:
Continuous
A numerical level of measurement, non-discrete
eg:
– weight: 79.99kg, 72kg,…
– height: 12.2, 12.5, 15.0
What can you ask about a datatype, in order to determine which datatype/level of measurement it is?
Can you calculate a mean?
- Yes –>
- is the mean always a possible value?
- Yes –> continuous numerical
- No –> discrete numerical
- No –>
- Is there a logical order of values?
- Yes –> Ordinal categorical
- No –> Nominal categorical
- Is there a logical order of values?
Explain univariate, bivariate, multivariate
how many variables there are in the statistical problem
What is important to consider regarding group number and distribution type when analysing categorical data ?
- Different group number and distribution different statistical tests to use
In inferential statistics, which test would you use if you have 2 groups and a normal distribution? Eg running times for female vs male.
- t test
In inferential statistics, what test would you use if you have 3 or more groups and a normal distribution? Eg running time for young, medium, and old ages.
- ANOVA
In inferential statistics, what test would you use if you have 2 groups and a non-normal distribution?
- Wilcox
In inferential statistics, what test would you use if you have 3 groups and a non-normal distribution?
- Kruskal
What are the different ways to find the center of the sample?
- Mean – sensitive to outliers
- Trimmed mean – less sensitive to outliers
- Median – not sensitive to outliers
- Modus – more relevant to numerical data
Regarding dependent and independent variables, which might be manipulated in an experiment?
- independent variable
What is an example of categorical nominal data and which statistic would you use for the sample data center?
- Gender
- Modus: most frequent value
What is an example of categorical ordinal data and what statistic would you use for the sample data center?
- Age (young, medium, old)
- Modus is most appropriate
- (Median, mean could also possibly be used if the data can be converted to numerical)
When are median and mean most appropriate?
- Numerical data – both discrete and continuous
How can we figure out the sample distribution?
- Max, min, IQR, standard deviation, CV
- (To describe the population: SEM, CI, p-value)
What are statistics vs parameters?
- Sample → characterised by statistic
- Population → characterised by parameter
- We use the sample to estimate the parameters of the population
- parameters and statistics have the same name but mean different things
- Usually: latin letters for sample, Greek letters for population
What are uncertainty qualities?
* SEM, CI
* How much can we trust the statistics
How can we look at the dispersion of a sample?
* SD
* Quartiles
* IQR = the range for 50% of around mean data (box of boxplot is the IQR)
Boxplot: what do the whiskers represent?
* Usually 1.5 x the IQR (outside this are outliers)
What is z-score?
* A Transformation →
Normalisation procedure
* New mean: 0
* New sd: 1
* 95% of data are within a z-score of -1.96 and + 1.96 …
What is the formula for Z-Score Center-Deviation/Dispersion?
z = (x - x̄) / s
R:
What does as.factor() do?
- converts a numeric or character vector into a factor with levels
- converts a vector into a factor, preserving its categorical nature.
- Factors are essential for handling categorical data in R.
R:
What does this do: mean(data, na.rm=TRUE, trim=0.1)
- Returns the mean of data, removing any NAs and trimming off outliers.
R:
Functions for median, minimum, maximum, standard deviation?
- median()
- min()
- max()
- sd()
R:
What does quantile() function do?
- Returns median (value at 50%, which divides sample in 2)
- Along with other interquartile values (25%, 75%)
R:
What does the summary() function do?
- Numerical data: returns min, max, quantiles, number of NAs
- Categorical data: returns count of groups, NAs
R:
What is the modus function in R?
- Trick question – there isn’t one
- But it can easily be implemented
R:
What can we use the aggregate() function for ?
- Syntax: aggregate(x, by, FUN, …)
- apply a function to subsets of data, typically grouped by one or more factors or variables.
- The data in x is divided into groups based on the by variable(s).
- The specified function FUN is applied to each subset of the data.
- Result is a summary table (df) showing grouped values, computed summaries.
R:
What function can we use for z score normalisation?
- Scale()
R:
What does the table() function do / which type of data do we use it on?
- Tabulates data to give summary (no NAs)
- Table is a matrix so relevant functions can be applied
- useful mostly for categorical data
- (numerical data can be transformed into categorical data and that way tabulated as well )
R:
What is a factor in R?
- Categorical data
R:
How can we transform data from numerical to categorical?
- cut() function, assign levels with function
- Example:
- csize = cut(survey$cm, c(0,160,185,250) [add “ordered=TRUE” to keep the order]
- levels(csize)= c(“dwarfs”,”normal”,”giants”)
R:
Ordinal data, special type of categorical data – what’s the difference?
- nominal factors: can only use == or !=
- ordered factors: can use also numeric operands , <=
(QUIZ)
The goal of ______ statistics is to summarize and describe the ______ whereas the goal of ______ statistics is to conclude to the population ______. We use sample ______ to estimate unknown ______ of the population.
The goal of descriptive statistics is to summarize and describe the sample whereas the goal of inferential statistics is to conclude to the population. We use sample statistics to estimate unknown parameters of the population.