Lecture 1 - Basics Flashcards

1
Q

What is a parameter?

A

an attribute of a population, or relationship between populations or variables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is a population?

A

Complete set of items that you want to draw inferences

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is a sample?

A

a random subset of population

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

exploratory vs confirmatory

A

Exploratory is generating hypotheses, confirmatory is testing hypotheses

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

how to calc SE

A

SD divided by the square root of sample number

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

type 1 error

A

false rejection of H0

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

type 2 error

A

false acceptance of H0

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

internal validity

A

extent to which you believe results of the study

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

external validity

A

extent to which results apply to real word

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

observational vs experimental

A

in an experiment, only one factor varies between conditions, so if theres a diff you can be certain it is casual

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

what is the mean? for norm dist

A

value that minimises the sum of SQUARED deviations of data values from it
The average (i.e. sum(x)/N)
The central value
The most likely value to get if you were to sample from the population
Centre of ‘mass’

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

what is the median?

A

value that minimums the sum of abosulte UNSQUARED deviations of data values from it

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

which test if data is categorical?

A

chi squared

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

which test if data is ranked or ordinal?

A

non - parametric,
• Mann-Whitney U (or Wilcoxin Mann-Whitney) – t-test to compare medians of 2 independent groups (groups must be completely independent of each other)
• Wilcoxin One sample
• Wilcoxon signed-rank test – same as paired t-test, it compares two related samples, matched samples, or repeated measurements on a single sample
• Kruskal-Wallis (One-way ANOVA with independent measures) – extends Mann-Whitney U when there are >2 samples. It compares two or more independent samples of equal or different sample sizes
• Friedman test (one-way ANOVA with repeated measures)
• Spearman correlation (non-para version of Pearson correlation)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

what test to use for interval or ratio data?

A

parametic, t-test, anova, reg etc

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

what is an outlier? and who made this definition

A

an observation which deviates so much from the other observations as to arouse suspicions that it was generated by a different mechanism - Hawkins

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

real life examples of when we look for outliers

A

Fraud detection & credit card theft- Unusual spending patterns
Medical diagnosis - Problems suggested by test results that don’t fit normal pattern for age/sex/history
Detecting drug cheats in sport - Abnormally high/low blood steroids, etc.
Detecting measurement errors or unusual events

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

3 ways to detect outliers

graphs used to detect them

boxplot- what do they need to be?

A

Analyse sample - data points that are ‘far from’ the rest are outliers
Fit model to data – outliers are points that don’t fit that model
Either approach can be statistical or purely graphical (‘by eye’)

Start with graphical methods

can use hist, scatterplot, normal probability plot (or QQ Plot), boxplot = Points higher than one-and-a-half inter-quartile ranges from the upper quartile, or lower than than one-and-a-half inter-quartile ranges from the lowerquartile, are plotted as circles and so identified as possible outliers.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

what is the R package for outliers? and what could be a drawback?

A

Dixon test (only works with small sample sizes (<30) only

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

which two tests are robust to even quite large violations of normality and homogeneity of variance and when do they become less so?

A

ANOVA and t-test - One-way ANOVA only starts to give odd results if largest variance is >9x smallest

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

what is homogeneity of variance? what test is it used in and when can this not be violated?

A

The assumption of homogeneity of variance is that the variance within each of the populations is equal. This is an assumption of analysis of variance (ANOVA). ANOVA works well even when this assumption is violated except in the case where there are unequal numbers of subjects in the various groups.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

what are Robust’ statistical methods? (there are 4 answers)

A

Non-parametric tests
Simple, accepted, but range of tests limited
Parametric tests on ranked data
Permutation tests
Flexible, but require decent sample size per group (see lecture on Monte Carlo methods)
Parametric methods that work on the ‘middle’ of the data

23
Q
non parametric of the following:
one sample t test
two sample t test
paired t test
one way anova 
one way repeated anova
corrrelation
A
one sample t test = Wilcoxin
two sample t test = Mann Whitney U
paired t test = Wilcoxin matched pairs signed rank
one way anova = Kruskal wallis
one wau repeated anova = friedman test 
corrrelation = spearman
24
Q

when residuals are normally distributed (and so parametrics are OK) the non-parametric equivalents are … less powerful.

A

5%

25
Q

When residuals are non-normal, what has more power? non para or para?

A

non parametric

26
Q

what are false positives?

A

type 1 error

27
Q

pros and cons of ranked parametric

A

Doing parametric stats on rank-transformed data seems to give you the same Type I error rate and power as the custom non-parametric test
… and has the bonus of more possible tests being available (i.e. any parametric test)
But, you can’t interpret the exact form of a relationship sensibly, so beware

28
Q

limitation of reg on ranked data

A

Regression of rank(Y) on rank(X) tells you they are related, but not the shape of curve

robust regression ignores outliers

29
Q

why are missing values bad?

A

Lead to unbalanced design and so…

Loss of power (a shame but not tragic)

30
Q

types of missing values

A

Missing Completely At Random - good

Missing At Random = Unobserved data are not random but follow the same pattern as the observed values

Not Missing At Random = Unobserved values are different from the observed ones – a bigger problem.

31
Q

when are missing values okay?

A

when they are missing completely at random, or they are not random but follow the same pattern as the observed values.

32
Q

when are missing values bad?

A

if not missing at random and if the unobserved values are different from the observed ones

33
Q

what to do if missing at random?

A

ignore NA’s, unless repeated measures..
Step 1 : investigate missing values, where are they
Find whether the ‘NA’s are randomly distributed with respect to the predictors and responses

can do Loglinear model on missing/non-missing to see if they are missing at random!

or can Replace with values that don’t bias subsequent analyses - such as most common value for that variable if there is normal distribution
BUT if data skewed use median!!

Or using other variables that are correlated OR using 10 nearest neighbours

34
Q

what does DMwR package use when fitting most central value?

A

median for numeric variables

the mode for categorical variables

35
Q

survival analysis examples

A

-Parametric
Specify distribution, e.g. exponential, Weibull
-Cox regression
‘semi-parametric’ – assumes survival curves have same shape, but different rates
-Non-parametric
Can only test one factor

36
Q

what to do with outliers?

A

delete if known cause.
Use robust statistics to reduce their influence
Or replace with ‘non-influential’ values

37
Q

what is censoring?

A

Data points where the actual value isn’t known but you can set boundaries on what it must have been

Censoring is a special case of ‘partial information’ (not missing) that can be dealt with by survival analysis

38
Q

what is the basis to a statistical model?

A

Pattern observed = signal + noise

the mean is a fixed signal, as is a slope, as is the difference between two means
Random normal variation – the ‘noise’

39
Q

what are modeling?

A

not the data, but the population from which our sample came from

40
Q

overfitting.. how can it come about

A

more parameters can mean better fit, but risk of overfiting and can just end up redescribing the data

41
Q

Can all population parameters be estimated accurately with a large sample size?

A

no, better to start with the mean - can estimate the population mean from your sample mean and the likely range of values around this from your sample standard deviation

42
Q

why is SD biased?

A

Because the values in a sample are more likely to come from near the mean
So the sample is unlikely to have as many extreme values as the parent population
So the variance (and stdev) of the sample are lower than that of the population

43
Q

how does a t test calculate its value?

A

sample mean measured in “standard errors” by dividing your sample mean by the standard error

44
Q

what is power?

what do you need to know to calc it?

A

probablity of rejecting the null hyp if the null hyp is false i.e. correct rejection of H0
probability of a type II error

need effect size and expected SD

45
Q

internal validity

A

Internal validity is the extent to which you believe the results of the study

Factors that increase internal validity:
Homogeneous sample (e.g. single strain, single sex, single genotype)
Homogeneous conditions (constant temperature, humidity, lighting regime, diet)
Single experimenter administering treatments
46
Q

Is it ever justified to raise the threshold p-value?

A

yes if type 2 error is far worse than a Type I error

47
Q

problems of mulitple testing

A

the chance of at least one test being ‘significant’ (p<0.05) is around 40% EVEN WHEN THE NULL HYPOTHESIS IS TRUE!

48
Q

solutions to avoid multiple testing

A

Fix primary and secondary dependent variables a priori
Control experiment-wise (family-wise) alpha and adjust test-specific alpha accordingly

Use multivariate methods
Data reduction

49
Q

pseduoreplication?

A

Improper inflation of sample size due to non-independence of data

50
Q

objects in R include..

A

Single variables
Arrays and matrices
Structures (containing, e.g., different types of variables: text or numeric

51
Q

sep “”
sep ‘\t’
sep ‘\n’

A

sep “” = space delimited data
sep ‘\t’ = tab-delimited data
sep ‘\n’ = new line delimited data

52
Q

correlation vs regression?

A

correlation requires norm dist, regression - Residuals around the line relating y to x must be normally distributed, x need not be normal (nor even y)

53
Q

Ordinary Least Squares

A

we minimise the squares of the residuals, (+ve/−ve diffs have same effect)

54
Q

MA vs RMA (diffs in variance)

A
Major Axis (MA) Regression
	if variances similar

Reduced Major Axis (RMA) Regression
if variances unequal