intro to stats Flashcards
categorical variables
Variable varies by type
- Levels are usually string-based (character-based)
- Can be numerical if the numbers are used as names (no numerical value associated with the number)
integer variables
- Numerical variable consisting of whole numbers
- Numbers have real numerical meaning
continuous variables
- Numerical variables which can theoretically have infinite decimal places
Dichotomous variables
- only 2 levels
- 0/1, Ctrl/Treatment, TRUE/FALSE
- Can be categorical or integer
Variables defined by data type
nominal variables
categorical variables
ordinal variables
ranked data
ex. 1st, 2nd, 3rd
interval and ratio scale variables
can be integers or decimal places
ratio : true zero, ratios can be meaningfully calculated (ex. 0K is absence of heat)
interval: does not have true zero,(0C is not absence of heat)
variables defined by casual relationship
In an experimental setting, we manipulate the independent variable, and measure scores for the dependent variable
nuisance variables
- confounding variables can potentially change the value of the outcome variable, and vary systematically with the predictor variable
- obscuring variables can also potentially change the value of the outcome variable, but do not vary systematically with the predictor variable
experimental designs 1
Different individuals are in different experimental conditions
* Between-subjects designs
* Independent groups designs
experiment designs 2
The same individuals are in different experimental conditions
* Within-subjects designs
* Repeated-measures designs
mixed designs
some predictor variables are between-subjects and some are within-subjects
inference
based on various methods such as hypothesis testing,
confidence interval estimation and parameter estimation
inferential statistics
uses sample statistics to estimate the value of a population parameter
parameter
a constant numerical characteristic of a population
- can include shapes (normal distribution), as shapes can be defined numerically
statistic
corresponding value calculated for a sample
population parameter and sample statistics symbols
standard deviation
- sigma: population parameter
- s: sample statistic
mean
- mu: population parameter
- M: sample statistic (or x bar)
i
index or individual
- refers to each score
statistics are invented tools
- statistical tools are invented to estimate probabilities that guesses are correct
characteristics of popular statistical methods
- common sense
- ease of use
- inertia (being good enough)
x^2
test statistic is a single number that represents how well the observed data fits your null hypothesis
-needs to produce a single number that incorporates two properties of the data(number and proportion)
probability distribution
divide counts (y) by the total number of simulations
- used to obtain probabilities associated with specific outcomes
p value
probability of obtaining our observed results if H0 was true
- p value low = low probability of obtaining our observed results, if H0 is true
classical statistics
calculates the theoretical probability distribution that would be obtained if the null hypothesis is correct
ex. df= k-1
simulation vs. classical statistics
theoretical
- any value of x^2 is possible, probability distribution is continuous
simulation based
- limited number of x^2 values possible
- small number of possible outcomes when counting numbers of heads/tails from 100 coin tosses
- probability distribution is discrete
simulation based are better, only deal with possible outcomes
- but classical stats are more widespread
how can we assess variability amongst a set 6 scores?
- calculate difference between each score and a single point
- difference between each score and the mean
- DEVIATION SCORE (xi -x bar)
how to calculate mean deviation?
- ignore the signs (mean absolute deviation)
- remove the signs by squaring all deviation scores, calculating the average, then taking the square root (standard deviation
MAD vs standard deviation (s)
- outliers will distort estimates of s more than MAD- larger deviation scores get even larger when squared
- MAD is more intuitive cause s is result of squaring, adding, square-rooting
- in real datasets, MAD estimates from a sample may be better estimates of the underlying population parameter than s
S
- s is one of the parameters used to define the normal distribution, which is centrally important in classical statistics
- Fisher (1920) demonstrated that in a perfect normal distribution, sample s is a better estimate of population standard deviation compared with sample MAD as an estimate of population MAD (s estimates its corresponding parameter better than MAD)
s is the dominant measure of variation used in stats
standard deviation (s)
s and variance (s^2) are primary measures of variation
divide n-1 (degrees of freedom)
s= dividing the sum of squares by the degrees of freedom, then taking the square root
mean deviation score is calculated by summing the squared deviation scores, then dividing number of scores that vary, then taking the square root
df
first calculate sum of squares (SS): sum(xi-x bar)^2
df (number of things that can vary): n-1
Ex = n x x-bar
purpose of s= generate estimate of average variation
normal distribution
natural variables commonly approx to normal distribution
errors of measurement commonly approximate to the normal distribution
means calculated from multiple samples drawn from pop will approx to normal distribution
is a probability distribution
- common use: derive probability that a score selected at random from normal-distributed pop will have a specific value
reading the normal distribution
distribution
y axis: ignore values
- for most plots: interested in value of y that corresponds to a value of x
- probability distrubtions: interested in area under the cure between two variables of x, and express it as the percentage of total area under the curve
x-axis: number of standard deviations from the mean
- standard deviation: average deviation from the mean
summary of normal distribution
originated from attempts to stop disputes between gamblers
distrubtion is an approx of the binomial distribution with a large number of trials (games) and can be calculated simply from mu and s
combo of mathematical simplicity and usefulness of the normal distribution in modelling real variables and errors resulted in it holding a central position in classical statistics
if we know a population mu and s, and know that the variable is normally distributed, we can easily estimate the probability that a score will be within a specific range of values