Definitions Flashcards
Statistical Inference
The process of drawing conclusions about the probability distribution function associated to one or more variables on a population from information obtained on a sample.
Population
A set about which we wish to draw conclusions.
Variable
Defined on a population is some characteristic of elements of the population.
Census
A study where the variables in question are measured for every member of a population.
Survey
A study where the variables in question are measured on a SRS of the population.
SRS
Sample taken with replacement.
Subset
A sample taken without replacement.
Levels of a variable
The possible outcomes you consider for a variable.
Simple Random Sample (SRS)
An SRS of size N of a population is a vector of length N consisting of elements of the population, where every elements of the population has an equal chance of being chosen for each entry of the vector.
Observational study
The collection and analysis of data with the goal in mind of determining the characteristics of a population.
Experiment
Occurs when a researcher is able to control which members of a sample receive one or more interventions or treatments (experimental group) and which do not (control group) or which receive some other comparison treatment (comparison group).
True experiment
An experiment where participants are randomly allocated to groups.
Quasi experiment
An experiment where participants are allocated to groups through some non-random process.
Response/dependent variable
The variable whose values are to be predicted from other values.
Predictor/independent variables
Variables whose values are used to predict values of another variable.
Lurking/confounding variables
Variables that are not measured in an observational study, but which influences both the prediction and response variables.
Nuisance/covariant variables
A variable that is recorded in a study because it may affect the response, but it not one of the primary variables of interest.
Factors
The nuisance and predictor variables in a study of experiment.
Probability Density Function (PDF)
A function that describes the likelihood of a random variable taking a given value.
Statistic
Any quantity that may be calculated from the values of a set of random variables on a random sample of a population.
Estimator
A statistic on a sample which is often taken to estimate some function of the parameters in a model for the random variables on the population.
Sampling distribution
The probability density function associated to a statistic calculated of size n from a population.
Most likely value
Of a statistic under a given null hypothesis is the maximum value for the sampling distribution for that statistic under the null hypothesis.
Region of acceptance
Given a significance level, for the null hypothesis is the interval of possible values for the statistic on a given sample that wiki not lead you to reject the null hypothesis.
P-value
Tells you how likely a result is or more extreme that the one obtained from a given study or experiment is to have occurred purely by chance if the null hypothesis is correct.
Categorical variable
A random variable whose possible values cannot be put in any meaningful order.
Quantitative variable
Any random variable whose values can be put in a meaningful order.
Ordinal variable
Variables that have word labels and can be put into order.
Model
For a random variable is a choice of a standard form we know or assume that the probability density function associated to the variable we have.
Bernoulli trial
A random variable, X, with two possible outcomes and a single parameter, representing p(X=1).
Normal random variable
A random variable whose PDF is a normal distribution.
Exploratory data analysis
A set of techniques involving summary statistics and graphical methods for exploring data before you do formal inference.
Kth q-quantile
For a set of data or a distribution is the number below which k/q of the distribution lies.
Q-Q plot
For a data set with n points against a model distribution is the plot of (X,Y) values where the kth y-value is the kth smallest datapoint in the set, and the kth x-value is the kth n quantile for the model distribution.
Standard normal (Z)
A normal with mean = 0 and sd = 1.
5 number summary
{lowest datapoint, lower quartile, median, upper quartile, highest datapoint}
Robust
The mean and standard deviations are strongly influenced by changes to only a few data points.
Interaction plot
Used when interested in studying the effect of two categorical predictor variables on a single response variable.
Effect size
The difference between the actual value of the parameter, on the population, and the value of the parameter under the null hypothesis.
Confidence interval
An X% confidence interval for a parameter theta is an interval (L,U) generated by some procedure that in repeated sampling has an X% probability of containing the true value of theta for all possible values of theta.
Confidence procedure
An X% confidence procedure is any procedure that generates intervals containing theta in X% of repeated samples.
Unstandardised effect size
The difference in means, m1 - m2.
Type 1 error
Rejecting the null hypothesis when it shouldn’t be.
Type 2 error
Not rejecting the null hypothesis when we should.
Smallest relevant effect size
The smallest difference from the null hypothesis value of the parameter that we consider to be important.
Power
1 - beta of a statistical test is the probability of rejecting the null when the null is false with some effect size greater that epsilon. (Probability of not making a type 2 error when the effect size is large enough to be of interest to us).
T-statistic
The statistic we get by replacing the population standard deviation by the sample standard deviation in the z-statistic.
Degrees of freedom
dF = n-1
ANOVA
A generalisation of t-tests.
Full model density
Mu + alpha(i) + epsilon, mu is a reference level, alpha(i) represents the deviations of the mean for the ith treatment group from the reference level mu.
Reduced model density
Mu + epsilon
SS(R)
Residual sum of squares from the reduced model.
SS(F)
Residual sum of squares from the full model.
One-way ANOVA
Used when there is one categorical predictor variable and one continuous response variable.
F distributions
A measure used in ANOVA, the further away from 1 it is, the more wrong the null is.
Non-parametric test
One that does not make any assumptions about the distribution of residuals.
Ranks
If you have a list of data from quantitative or an ordinal variable, you can put it in order. The position of the datapoint in this ordered list is its rank. If several datapoints are equal, then the rank of each one is the average of their positions on the list.
Wilcoxon signed-ranks test
Non-parametric version of a 1 sample or paired t-test. For a 1-sample test; it tests the null hypothesis. H0: the median of the population is m0. For a paired t-test; it tests the null hypothesis. H0: the medians of the two populations satisfy m1-m2=m0.
Mann-Whitney u test
Non parametric version of the independent samples t-test. It tests the null hypothesis that the probability of an element of the first group being greater than an element of the second group is exactly 0.5.
Kruskal-wallis test
Non-parametric version of one-way ANOVA carried out on ranks. H0: for any two groups you consider, the probability that a random element in the first group will yield a greater value of your variable than a random element in the second group of exactly 0.5. Ha: for at least two groups, the probability is different from 0.5.
Chi-squared test
To compare the expected values in each cell of the tables with the observed values. H0: response is independent of condition. H1: response depends upon condition. If there is a big difference, we can conclude that it is unlikely that there is no difference in the population.
Publication bias
Refers to the idea that the scientific studies which end up getting published are a biased sample of the total population of scientific studies.
P-hacking
Incentives to find significant p values.