Exam II Material Flashcards
distributions can be
skewed
skewed data can be
negative or positive
distributions that are peaky or taily are called
kurtosis
distributions that are very flat with long tails are called
platykurtic
distributions that are very pointy/peaky are called
leptokurtic
distributions that are just right are called
mesokurtic
3 ways to test to determine is data is normal
- whether data is normal enough depends on what you will do with the data
- there arent as many hard rules
- key is to justify what you are doing
what are tests of normality
kolmogorov- smirnov and shapiro-wilk
what are kolmogorov- smirnov and shapiro-wilk very sensitive to
n
kolmogorov- smirnov and shapiro-wilk, between these two what is considered better
S-W
what are Q-Q plots
a good visual method for double-checking data especially for large n
when do we not consider our data normal for skewness and kurtosis
if skewness and kurtosis are more than 2x their standard error
when would we consider alternate tests for skewness and kurtosis
3x standard error
what is correlation
describes a relationship between two variables
how should we do correlation by hand
arrange data in order of one of the quantitative variables
correlations are a descriptor….
of how reliably a change in one variable predicts change in another variable
what are positive relationships
ones where an increase in one variable predicts an increase on the other
what are negative relationships
ones where the an increase in on variable predicts a decrease in the other
is there always a relationship in correlation
no
correlation alone cannot be used to make a
definitive statement about causation
correlation can be found in almost
everything
what is the most effective way of presenting relationship data
scatterplots
what are relationships best described by lines
linear relationships
what relationships are best described with curves
curvilinear relationships
how can we quantify a correlation
by the pearson product moment correlation
the pearson correlation varies from
-1 to 1
what is the number in pearson with the weakest/ no correlation
0
whats the number for the strongest correlation
1
what indicates the direction of the correlation in pearson
the sign
positive sign means
positive correlation
negative sign means
negative correlation
assumptions of the pearson correlation (5)
- uses two variables
- variables are both quantitative (ratio/ interval)
- variable relationships are linear
- minimal skew/ no large outliers
- must observe the whole range for each variable
what should you not do when working with correlation
do not bin data, use the raw scores/ values
in correlation set up with will be comparing two different variables for the..
same set of cases
in pearson correlation output p< = 0.05 means
significant
parametric analysis includes
pearson
both variables are ratio/ interval and normal
pearson
nonparametric analysis includes (3)
- spearman’s rank
- kendall’s tau-b
- ETA
: appropriate for ordinal and skewed data
spearman’s rank
appropriate for ordinal and skewed data, generally
considered superior to Spearman (especially for small groups) and is less
affected by error
kendall’s tau-b
a special coefficient used for curvilinear relationships, particularly good for nominal by interval analyses
ETA
an entire, comprehensive group
population
a subset of the population, used to infer things about the population
sample
random samples are not casual or haphazard, getting truly random samples requires care
sampling
for characteristics of populations in regression, we might the true
n
do you know the mean in population in regression
you might be able to estimate
do you know the standard deviation in regression for population
we probably dont
for samples in regression we always know
n, mean, and st
what do regression and correlation have in common
both are about relationships between variables and work best with quantitative variables
regression differs from correlation in that we have explicit “………” variables used to estimate the value of some target variable
predictor
do you need strong evidence of causality that with correlation
yes
regression is primarily calculated by analyzing
error
a key to calculating regression is to look at the
predictive error for the y-axis variable
the what of what is the key to calculating the regression line
sum of squares of the error
the goal of regression is to find a best fit line that minimizes the
squares of the error
assumptions of linear regression
- requires 2 or m ore scalar variables
- there is one dependent variable and one or more independent variables
- the relationships between the independent variables and dependent must be linear
- the data must be homescedastic
property of a dataset having variability that is similar across it’s whole range
homoskedasticity
opposite of homoscedastic is
heteroskedastic
symbol for number of observations for a sample and a population
sample- n
population- n/N
symbol for a datum for a sample and a population
sample- x
population- X
symbol for mean of a sample and a population
sample- x bar
population- lu (mew)
symbol for variance for a sample and a population
sample- s2/SD2
population- sigma squared
symbol for standard deviation for a sample and a population
sample- s/SD
population- sigma
what does R mean in a linear regression?
correlation between the observed values, and the ones the model predicts
what does R2 mean in a linear regression ?
the amount of variability in the dependent variable that is accounted for by changes in ALL the independent variables
what does unstandard B represent in a linear regression?
tells you the unit change in the dependent per unit change in the independent
what does std err represent in a linear regression?
used in calculating the t
what does beta tell you in a linear regression?
how strongly this variable predicts the dependent
t and sig in a linear regression
tells you if the variable was a significant predictor of the dependent
adjusted R^2 in a linear regression
If you have a lot of independent variables, you’ll get some relationships due to chance. This tries to correct for that
std error of the regression in a linear regression
A measure of how accurately the model predicts the dependent variable
population -
an entire comprehensive group
sample-
a subset of the population
used to infer things about the population
sampling-
random samples are not casual or haphazard, getting truly random samples requires care
random sampling-
used when surveying
obtains a “snapshot” of the population
just because sampling is random doesn’t mean that your sample is perfectly represenattive
random assignment-
a process used in an experiment to minimize bias in your experiment groups
in both random sampling and random assignment, what does increasing n do?
it will decrease the likelihood of seeing a non-representative or biased sample
what does probability tell us?
when events are common, vs when events are rare
are common outcomes statistically significant ?
no
what outcomes are considered “statistically significant”?
rare outcomes
is probability arbitrary ?
yes - 100%
the central limit theorem
“Regardless of the shape of the population, the shape of the sampling distribution of the mean approximates a normal curve if the sample size is large enough”
does the sample tell us everything about the population?
no
what is the criteria for the probability of obtaining any specific sample from a population to fit a normal curve?
if the sample is sufficiently large
is it likely to get a very extreme sample?
no but it is possible
you will most likely get a mean somewhere near the actual population mean.
what does the sampling distribution of the mean refer to?
the probability distribution of means for all possible random samples of size n for a population
standard error of the mean (SEM)
describes the average amount of variability sample means have around the true population mean
what does a z-test do?
converts a mean to a z-score (typically sufficiently rare)
what magnitude is considered sufficiently rare?
greater than +_ 1.96
alternative hypothesis / research hypothesis (H1)
states there is something special about the population being observed
null hypothesis (H0)
states there is nothing special about the population being observed
do we ever accept H1?
NO
we can only reject H0
what do decision rules define
precisely when you reject H0 or not
what do the design rules depend on?
types of study
the variables
tests performed
your field
what is the significance level (alpha)?
the proportion of area under the curve considered “rare” for the purposes of your decision rule
originally set as a= 0.05
what do we say when we do not have a significant result?
“we fail to reject” H0
this is a weak result
what do we say when we of have a significant result ?
we definitely “reject H0”
this is a strong result
what do we say when we keep or reject the null
keep: H0 could be true
reject: H0 is most likely false
one tailed vs two tailed tests-
one tail- used not very often, retain H0 for all except for one side of the curve
two-tail- used more frequently, retain H0 for only middle of the curve, reject H0 for both ends
when should you choose a one tail test?
-if you are positive that your hypothesis could only possibly result in a change in one direction
- if you are only interested in a change in one direction
- must be established as an experimental and analytical protocol before any analysis occurs
-if the consequences of being different in one
why shouldn’t you choose a one-tailed test?
-if you do not have very strong justification, reviewers will be critical of your choice
- sometimes seen as a sketchy way of making something look significant
in general, what is alpha ?
a trade off between two types of mistake
the choice of what alpha is is mostly arbitrary
what is a type 1 error
false positive
equal to alpha, decreases as alpha decreases
what is a type 2 error
miss/ mistake
what does it mean for the null if p is greater than or equal to alpha?
we retain the null
it is not significant
what does it mean when the p is less than or equal to alpha?
we reject the null
the data is significant