STA8170 Flashcards
Data
systematically recorded values (numbers or labels) together with their context
Categorical/qualitative variable
variable that names categories with words or numbers
Context (info required for?) (x6)
who was measured what was measured how data was collected where data was collected when and why study was done
Rows in a data table hold…
individual cases, eg respondents, participants, subjects, units, records
Columns in a data table hold…
variables that give info about each individual case
Quantitative variable
an amount or degree, measured in meaningful numbers eg scale
Identifiers
variable that assigns unique value to each individual/case - cannot be analysed
Relational database
large data bases that link data tables together by matching identifiers
Ordinal variable
categorical variable with ordering of values
Data table
an arrangement of data in which each row represents a case, and each column a variable
Case
individual about whom/which we have data
Record
info about an individual/case in a database
Sample (x2)
representative subset of population
analysed to estimate/learn about the population
Population
the collection of all individuals or
items or objects of interest
Nominal variable
variable whose values are only names of categories
Units
quantity or amount used as standard of measurement
Parameter (and greek letter)
any numerical characteristic of a population - μ (meuw)
Distribution (x2)
description of all the values a variable can take, and how often those values occur
Three important things pictures can do in data analysis?
reveal things not able to be seen in data tables, helping to think about patterns/relationships
show important features in the data
tell others about the data
Area principle (for graphing data)
the area occupied by a part of the graph should correspond to the magnitude of value it represents
Frequency table (x3)
organises the cases according to their variable
rows are category names
also records totals
describes the distribution of a categorical variable
Relative frequency table (x2)
displays percentages, rather than counts, of values in each category
describes the distribution of a categorical variable
Bar chart (x3)
Display distribution of a categorical variable
Categories on the x, counts on the 7
spaces between the bars indicate that freestanding bars can be placed in any order
Relative frequency bar chart
shows the percentage/proportion of values (y) falling under each category (x)
Pie charts are used to display…?
Plus one disadvantage
categorical data
visual comparisons between categories are more difficult than in eg a bar chart
Contingency table
how cases are distributed along each variable, dependent on the other variable
Marginal distribution
the totals displayed (as counts or %) in the bottom row and last column of contingency tables
Conditional distribution
show the distribution of one variable for just those cases that satisfy a condition on another variable
Independent variables in a contingency table are when… (x2)
the distribution of one variable is the same for all categories of another
ie there is no association between them
Histogram (x3)
Bar chart for quantitative data
Counts (y) grouped into bins (x) that make up the bars
No gaps between bars - or gap indicates no values for that bin
Relative frequency histogram
Use percentage on y-axis instead of counts
Stem and leaf plot (x3)
Similar to histogram, but shows the individual values
Useful for doing by hand or in Word, for <100 values
Stem values on the vertical axis, leaves across the horizontal
Dotplots
Like a stem and leaf, but with dots
Can be vertical (like stem plot) or horizontal
Categorical data condition (for deciding on how to display data) (x2)
Data is counts or percentages of individual cases in categories
Categories do not overlap
Quantitative data condition (for deciding on how to display data)
Data ar values of a quantitative variable whose units are known`
Four components for descriptions of distribution (plus egs)
that mean you should be able to…
shape - symmetry, skew, gaps
outliers
centre - median
spread - range, interquartile range
roughly sketch the distribution
Modes (plus 3 types)
the peaks in distributions
unimodal
bimodal
multimodal
A distribution with no modes is described as…
uniform
Skew (x2)
a distribution with longer tail on one side
skew is described as to the side with the longer tail
Median (x3, plus how to find, x2)
the middle value that divides a histogram into two equal areas
appropriate description of centre for skewed distributions or with outliers
always pair with the IQR
if n is odd, median is the middle value
if n is even, median is the average of the two middle values
Range
difference between min and max values in a distribution
Quartile
the dividing points of the number of values/cases in a distribution divided by four
Interquartile range (x2)
= upper quartile - low quartile
the data between the 25th and 75th percentile
Percentile (plus eg x1)
the value that leaves that percentage of data below it
eg, 25th percentile has 5% of data below it
Five number summaries of distribution include…
minimum q1 median q3 maximum
Boxplots (x7)
display of the five number summary
vertical axis from min to max of data
box around q1 and q3
horizontal line inside box at the median
‘fences’ at 1.5 IQRs beyond lower and upper quartiles (not displayed, just for working)
whiskers from box to most extreme data values found within the fences
add dots for any values found outside the fences
Mean (x4)
average of all values in a distribution
appropriate description of centre for roughly symmetrical/normal data sets
always pair with SD
notation - a bar above the symbol, eg ū = the mean of u, pronounce u-bar
Standard deviation (x3)
describes the spread of a distribution
root of the average of squared deviation of each value from the mean
(average of deviations would cancel each other out)
Variance
the average of the squared deviations of each value from the mean
A calculated summary is described as resistant if… (x2)
outliers only have a small effect on it
eg median and IQRs
Timeplot (x2)
a display of values (y) against time (x)
discern patterns by applying the lowess method - makes a smooth trace line of best fit
Moving average (plus method x2)
method for smoothing timeplots to identfiy trends
find the average value for a given time window, then move the window along by one timepoint and take a new average
Exponential smoothing
method for smoothing timeplots to identify trends
more sophisticated than moving average method
gives more weight to recent values, and less as they recede into the past
Re-expressing/transforming data is… (x3)
applying a simple function to make a skewed distribution more symmetrical
enables better use of centre and spread distribution descriptors
can facilitate the comparison of groups with very different distributions of scores
Rules of thumb for transformations of skewed data (x2)
variables that skew to the right often helped by square roots, logs, reciprocal
Skew to the left often helped by squaring the data
When comparing distributions consider their… (x3)
shape
centre
spread
When comparing boxplots, consider their…(x4)
shape - symmetric, skewed, diffs between groups
medians - which group has higher centre, any pattern to medians
IQRs - groups with more spread, patterns to change in IQRs
outliers - identify, consider, check for errors
For outliers, consider… (x2)
context - what is extreme in one context may be normal in another
Order of median, mode and mean in a positive/right skewed distribution
mean>median>mode
Order of median, mode and mean in a negative/left skewed distribution
mean
Positive skew is… (x2)
skew to the right,
ie longer tail to the right
Negative skew is… (x2)
skew to the left,
ie longer tail to the left
How do you standardise a value? (calculate a z-score) (x2)
Subtract the mean form the value,
Divide the difference by the standard deviation
What does a z-score represent?
the distance of a value from the mean in standard deviations
Greek letters are used for…
model parameters
Latin letters are used for…
statistics
What is the standard normal model/distribution? (x2)
a normal distribution with mean = 0 and SD = 1
ie after you’ve standardised/calculated z-scores
Nearly normal data condition, and how to check
shape of distribution is unimodal and symmetric
check with histogram or Normal probability plot
How much of a normal distribution fits with 1, 2 and 3 SDs of the mean?
68%
95%
99.7%
Shifting a distribution… (x2)
is adding a constant to each value,
does not change SD or IQR
Rescaling a distribution…(x2)
is multiplying each value by a constant
also multiplies mean, median, quartiles, SD and IQR by the constant
Parameter
a numerically valued attribute of a model
Statistic
a value calculated to summarise data
Normal percentile
that corresponds to a z-score gives the percentage of values found at that z-score and below
Normal probability plot (x3)
plots actual vs expected score
if straight, distribution is normal
Called P-P plots in SPSS
σ (x2)
sigma
standard deviation
μ (x2)
meuw
mean
N(μ, σ)
x2
Normal model
Parameters are mean and SD
Formula for finding a value from a z-score
y = μ + z * σ
Scatterplot (and how to describe x4)
dot point graph of two variables on x and y axes
describe with positive/negative direction/trend,
form/shape of dots (straight, curved, no pattern?),
strength of relationship (how close together dots are) and
unusual features/outliers
How to choose x and y axis for scatterplot vars?
put the variable of interest (DV), that you want to predict and responds to levels of the other var, on y-axis
put the explanatory or predictor var (IV) on x-axis
What assumptions/conditions must be met before using a correlation? (x3)
quantitative variables condition - can’t use categorical data
straight enough condition - check the scatter plot for linear relationship
no outliers condition - can distort strength or direction of a correlation
What is Spearman’s Rho (ρ) useful for?
Calculating non-parametric association (correlation) when distribution is not straight enough or has outliers
What is Kendall’s tau (τ) useful for? (plus eg x1)
Calculating trend (monotonic relationship - correlation) when relationship is not linear eg when data not truly quantitative
What is a lurking variable?
A hidden variable that influences both variables in our relationship/correlation
Transformation through squaring is useful when (x2)
unimodal distriubution is skewd to the left
scatterplot bends downwards
Transformation through finding the root is useful when
data is a count of something
Transformation using log is useful when (plus one note)
measurements cannot be negative, or grow by percentage increases
nb, if there are zeros in the data try adding a small constant first
Transformation through negative reciprocal square root (-1 divided by the root of y) is useful when
you want to preserve the direction of the relationship
Transformation through negative reciprocal (-1 divided by y) is useful when (plus one note)
your data is the ratio of two quantities, eg miles per hour
nb, if there are zeros in the data try adding a small constant first
What is the ladder of powers? (x2, plus the 6 steps)
order that the effects of transformations have on data if transformation make data worse, move in the other direction on ladder
Power 2 - squaring the data
Power 1 - no change, going further down or up from here increases effect
Power 1/2 - square root
Power 0 - we place log in this spot
Power -1/2 - negative reciprocal root (-1 over root of y)
Power -1 - negative reciprocal (-1 over y)
What is y-hat (y ̂ )?
the value predicted by a regression equation/line of best fit
What is a residual (in regression)?
the difference between predicted (y-hat) and observed/actual (y) value
residual = observed value - predicted value
The least squares line is…
the line of best fit in regression/scatterplot
the line for which the sum of the squared residuals is smallest
Why must residuals be squared when calculating line of best fit/least squares?
because some of them will be negative
What does b represent in the linear model?
coefficients
What is the slope in the linear model (x2, plus notation)?
always measured/interpreted as units of y per unit of x
how rapidly y-hat responds to changes in x
b1 (1 is subscript)
What is the intercept in the linear model (x2, plus notation)?
where the line hits the y-axis
the starting point/baseline for our predictions
b0 (0 is subscript)
Equation for the linear model? (notation and in words)
y-hat = b0 + b1x
predicted y = intercept plus slope times x
What is the equation for finding slope in linear regression? (notation and in words)
b1 = r x (SDy/SDx) slope = correlation times (standard deviation of y over the standard deviation of x)
What is the equation for finding the intercept in linear regression? (notation and in words)
b0 = meany - b1 x meanx intercept = the mean of y minus the (slope times the mean of x)
Define regression
the linear model fit by least squares
What are the conditions/assumptions that must be met before we can use regression?
same as for correlation:
quantitative data
straight enough relationship
no outliers
Explain regression to the mean (x3)
You can never predict that y will be further away from the mean than x was
because equation for predicting z-scores is z-hat of y = r times the z of x
and r can only be between -1 and 1
What is the slope of a line of best fit for the z-scores of any two variables?
r (the correlation coefficient)
Formula for standard deviation of the residuals
find the root of (sum of error squared over (n - 2))
What does R-squared represent?
the variation/portion accounted for by the linear model
What does 1 - R-squared represent?
the variation/portion not accounted for by the linear model (the residuals/error)
Conditions that must be met for regression (x4)
quantitative variables condition
For both data and residuals, check:
straight enough condition (linear relationship on scatterplot)
does the plot thicken? condition (even scatter around the line of best fit, or across scatterplot for residuals)
outlier condition investigate any - they strongly affect r)
Assumptions of the linear model (x4)
variables are quantitative
their relationship is linear
error is approximately normally distributed
variance of the error is constant
Leverage in regression models refers to…
the fact that the further a given point is from the meanX, the more strongly they pull on the regression line
A data point is ‘influential’ in regression models if…
removing it from the analysis makes a meaningful difference to the model
Best graphical display for exploring two categorical variables
Two-way table
Best graphical display for exploring two quantitative variables
Scatterplot
Best graphical display for exploring one categorical and one quantitative variable
Boxplot