STA8170 Flashcards

1
Q

Data

A

systematically recorded values (numbers or labels) together with their context

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Categorical/qualitative variable

A

variable that names categories with words or numbers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Context (info required for?) (x6)

A
who was measured
what was measured 
how data was collected
where data was collected
when and why study was done
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Rows in a data table hold…

A

individual cases, eg respondents, participants, subjects, units, records

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Columns in a data table hold…

A

variables that give info about each individual case

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Quantitative variable

A

an amount or degree, measured in meaningful numbers eg scale

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Identifiers

A

variable that assigns unique value to each individual/case - cannot be analysed

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Relational database

A

large data bases that link data tables together by matching identifiers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Ordinal variable

A

categorical variable with ordering of values

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Data table

A

an arrangement of data in which each row represents a case, and each column a variable

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Case

A

individual about whom/which we have data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Record

A

info about an individual/case in a database

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Sample (x2)

A

representative subset of population

analysed to estimate/learn about the population

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Population

A

the collection of all individuals or

items or objects of interest

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Nominal variable

A

variable whose values are only names of categories

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Units

A

quantity or amount used as standard of measurement

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Parameter (and greek letter)

A

any numerical characteristic of a population - μ (meuw)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Distribution (x2)

A

description of all the values a variable can take, and how often those values occur

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Three important things pictures can do in data analysis?

A

reveal things not able to be seen in data tables, helping to think about patterns/relationships
show important features in the data
tell others about the data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Area principle (for graphing data)

A

the area occupied by a part of the graph should correspond to the magnitude of value it represents

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Frequency table (x3)

A

organises the cases according to their variable
rows are category names
also records totals
describes the distribution of a categorical variable

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Relative frequency table (x2)

A

displays percentages, rather than counts, of values in each category
describes the distribution of a categorical variable

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Bar chart (x3)

A

Display distribution of a categorical variable
Categories on the x, counts on the 7
spaces between the bars indicate that freestanding bars can be placed in any order

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Relative frequency bar chart

A

shows the percentage/proportion of values (y) falling under each category (x)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

Pie charts are used to display…?

Plus one disadvantage

A

categorical data

visual comparisons between categories are more difficult than in eg a bar chart

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

Contingency table

A

how cases are distributed along each variable, dependent on the other variable

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

Marginal distribution

A

the totals displayed (as counts or %) in the bottom row and last column of contingency tables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

Conditional distribution

A

show the distribution of one variable for just those cases that satisfy a condition on another variable

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

Independent variables in a contingency table are when… (x2)

A

the distribution of one variable is the same for all categories of another
ie there is no association between them

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

Histogram (x3)

A

Bar chart for quantitative data
Counts (y) grouped into bins (x) that make up the bars
No gaps between bars - or gap indicates no values for that bin

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

Relative frequency histogram

A

Use percentage on y-axis instead of counts

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

Stem and leaf plot (x3)

A

Similar to histogram, but shows the individual values
Useful for doing by hand or in Word, for <100 values
Stem values on the vertical axis, leaves across the horizontal

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

Dotplots

A

Like a stem and leaf, but with dots

Can be vertical (like stem plot) or horizontal

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

Categorical data condition (for deciding on how to display data) (x2)

A

Data is counts or percentages of individual cases in categories
Categories do not overlap

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

Quantitative data condition (for deciding on how to display data)

A

Data ar values of a quantitative variable whose units are known`

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

Four components for descriptions of distribution (plus egs)

that mean you should be able to…

A

shape - symmetry, skew, gaps
outliers
centre - median
spread - range, interquartile range

roughly sketch the distribution

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

Modes (plus 3 types)

A

the peaks in distributions
unimodal
bimodal
multimodal

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
38
Q

A distribution with no modes is described as…

A

uniform

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
39
Q

Skew (x2)

A

a distribution with longer tail on one side

skew is described as to the side with the longer tail

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
40
Q

Median (x3, plus how to find, x2)

A

the middle value that divides a histogram into two equal areas
appropriate description of centre for skewed distributions or with outliers
always pair with the IQR
if n is odd, median is the middle value
if n is even, median is the average of the two middle values

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
41
Q

Range

A

difference between min and max values in a distribution

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
42
Q

Quartile

A

the dividing points of the number of values/cases in a distribution divided by four

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
43
Q

Interquartile range (x2)

A

= upper quartile - low quartile

the data between the 25th and 75th percentile

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
44
Q

Percentile (plus eg x1)

A

the value that leaves that percentage of data below it

eg, 25th percentile has 5% of data below it

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
45
Q

Five number summaries of distribution include…

A
minimum
q1
median
q3 
maximum
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
46
Q

Boxplots (x7)

A

display of the five number summary
vertical axis from min to max of data
box around q1 and q3
horizontal line inside box at the median
‘fences’ at 1.5 IQRs beyond lower and upper quartiles (not displayed, just for working)
whiskers from box to most extreme data values found within the fences
add dots for any values found outside the fences

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
47
Q

Mean (x4)

A

average of all values in a distribution
appropriate description of centre for roughly symmetrical/normal data sets
always pair with SD
notation - a bar above the symbol, eg ū = the mean of u, pronounce u-bar

48
Q

Standard deviation (x3)

A

describes the spread of a distribution
root of the average of squared deviation of each value from the mean
(average of deviations would cancel each other out)

49
Q

Variance

A

the average of the squared deviations of each value from the mean

50
Q

A calculated summary is described as resistant if… (x2)

A

outliers only have a small effect on it

eg median and IQRs

51
Q

Timeplot (x2)

A

a display of values (y) against time (x)

discern patterns by applying the lowess method - makes a smooth trace line of best fit

52
Q

Moving average (plus method x2)

A

method for smoothing timeplots to identfiy trends

find the average value for a given time window, then move the window along by one timepoint and take a new average

53
Q

Exponential smoothing

A

method for smoothing timeplots to identify trends
more sophisticated than moving average method
gives more weight to recent values, and less as they recede into the past

54
Q

Re-expressing/transforming data is… (x3)

A

applying a simple function to make a skewed distribution more symmetrical
enables better use of centre and spread distribution descriptors
can facilitate the comparison of groups with very different distributions of scores

55
Q

Rules of thumb for transformations of skewed data (x2)

A

variables that skew to the right often helped by square roots, logs, reciprocal
Skew to the left often helped by squaring the data

56
Q

When comparing distributions consider their… (x3)

A

shape
centre
spread

57
Q

When comparing boxplots, consider their…(x4)

A

shape - symmetric, skewed, diffs between groups
medians - which group has higher centre, any pattern to medians
IQRs - groups with more spread, patterns to change in IQRs
outliers - identify, consider, check for errors

58
Q

For outliers, consider… (x2)

A

context - what is extreme in one context may be normal in another

59
Q

Order of median, mode and mean in a positive/right skewed distribution

A

mean>median>mode

60
Q

Order of median, mode and mean in a negative/left skewed distribution

A

mean

61
Q

Positive skew is… (x2)

A

skew to the right,

ie longer tail to the right

62
Q

Negative skew is… (x2)

A

skew to the left,

ie longer tail to the left

63
Q

How do you standardise a value? (calculate a z-score) (x2)

A

Subtract the mean form the value,

Divide the difference by the standard deviation

64
Q

What does a z-score represent?

A

the distance of a value from the mean in standard deviations

65
Q

Greek letters are used for…

A

model parameters

66
Q

Latin letters are used for…

A

statistics

67
Q

What is the standard normal model/distribution? (x2)

A

a normal distribution with mean = 0 and SD = 1

ie after you’ve standardised/calculated z-scores

68
Q

Nearly normal data condition, and how to check

A

shape of distribution is unimodal and symmetric

check with histogram or Normal probability plot

69
Q

How much of a normal distribution fits with 1, 2 and 3 SDs of the mean?

A

68%
95%
99.7%

70
Q

Shifting a distribution… (x2)

A

is adding a constant to each value,

does not change SD or IQR

71
Q

Rescaling a distribution…(x2)

A

is multiplying each value by a constant

also multiplies mean, median, quartiles, SD and IQR by the constant

72
Q

Parameter

A

a numerically valued attribute of a model

73
Q

Statistic

A

a value calculated to summarise data

74
Q

Normal percentile

A

that corresponds to a z-score gives the percentage of values found at that z-score and below

75
Q

Normal probability plot (x3)

A

plots actual vs expected score
if straight, distribution is normal
Called P-P plots in SPSS

76
Q

σ (x2)

A

sigma

standard deviation

77
Q

μ (x2)

A

meuw

mean

78
Q

N(μ, σ)

x2

A

Normal model

Parameters are mean and SD

79
Q

Formula for finding a value from a z-score

A

y = μ + z * σ

80
Q

Scatterplot (and how to describe x4)

A

dot point graph of two variables on x and y axes
describe with positive/negative direction/trend,
form/shape of dots (straight, curved, no pattern?),
strength of relationship (how close together dots are) and
unusual features/outliers

81
Q

How to choose x and y axis for scatterplot vars?

A

put the variable of interest (DV), that you want to predict and responds to levels of the other var, on y-axis
put the explanatory or predictor var (IV) on x-axis

82
Q

What assumptions/conditions must be met before using a correlation? (x3)

A

quantitative variables condition - can’t use categorical data
straight enough condition - check the scatter plot for linear relationship
no outliers condition - can distort strength or direction of a correlation

83
Q

What is Spearman’s Rho (ρ) useful for?

A

Calculating non-parametric association (correlation) when distribution is not straight enough or has outliers

84
Q

What is Kendall’s tau (τ) useful for? (plus eg x1)

A
Calculating trend (monotonic relationship - correlation) when relationship is not linear
eg when data not truly quantitative
85
Q

What is a lurking variable?

A

A hidden variable that influences both variables in our relationship/correlation

86
Q

Transformation through squaring is useful when (x2)

A

unimodal distriubution is skewd to the left

scatterplot bends downwards

87
Q

Transformation through finding the root is useful when

A

data is a count of something

88
Q

Transformation using log is useful when (plus one note)

A

measurements cannot be negative, or grow by percentage increases
nb, if there are zeros in the data try adding a small constant first

89
Q

Transformation through negative reciprocal square root (-1 divided by the root of y) is useful when

A

you want to preserve the direction of the relationship

90
Q

Transformation through negative reciprocal (-1 divided by y) is useful when (plus one note)

A

your data is the ratio of two quantities, eg miles per hour

nb, if there are zeros in the data try adding a small constant first

91
Q

What is the ladder of powers? (x2, plus the 6 steps)

A

order that the effects of transformations have on data if transformation make data worse, move in the other direction on ladder
Power 2 - squaring the data
Power 1 - no change, going further down or up from here increases effect
Power 1/2 - square root
Power 0 - we place log in this spot
Power -1/2 - negative reciprocal root (-1 over root of y)
Power -1 - negative reciprocal (-1 over y)

92
Q

What is y-hat (y ̂ )?

A

the value predicted by a regression equation/line of best fit

93
Q

What is a residual (in regression)?

A

the difference between predicted (y-hat) and observed/actual (y) value
residual = observed value - predicted value

94
Q

The least squares line is…

A

the line of best fit in regression/scatterplot

the line for which the sum of the squared residuals is smallest

95
Q

Why must residuals be squared when calculating line of best fit/least squares?

A

because some of them will be negative

96
Q

What does b represent in the linear model?

A

coefficients

97
Q

What is the slope in the linear model (x2, plus notation)?

A

always measured/interpreted as units of y per unit of x
how rapidly y-hat responds to changes in x
b1 (1 is subscript)

98
Q

What is the intercept in the linear model (x2, plus notation)?

A

where the line hits the y-axis
the starting point/baseline for our predictions
b0 (0 is subscript)

99
Q

Equation for the linear model? (notation and in words)

A

y-hat = b0 + b1x

predicted y = intercept plus slope times x

100
Q

What is the equation for finding slope in linear regression? (notation and in words)

A
b1 = r x (SDy/SDx)
slope = correlation times (standard deviation of y over the standard deviation of x)
101
Q

What is the equation for finding the intercept in linear regression? (notation and in words)

A
b0 = meany - b1 x meanx
intercept = the mean of y minus the (slope times the mean of x)
102
Q

Define regression

A

the linear model fit by least squares

103
Q

What are the conditions/assumptions that must be met before we can use regression?

A

same as for correlation:
quantitative data
straight enough relationship
no outliers

104
Q

Explain regression to the mean (x3)

A

You can never predict that y will be further away from the mean than x was
because equation for predicting z-scores is z-hat of y = r times the z of x
and r can only be between -1 and 1

105
Q

What is the slope of a line of best fit for the z-scores of any two variables?

A

r (the correlation coefficient)

106
Q

Formula for standard deviation of the residuals

A

find the root of (sum of error squared over (n - 2))

107
Q

What does R-squared represent?

A

the variation/portion accounted for by the linear model

108
Q

What does 1 - R-squared represent?

A

the variation/portion not accounted for by the linear model (the residuals/error)

109
Q

Conditions that must be met for regression (x4)

A

quantitative variables condition
For both data and residuals, check:
straight enough condition (linear relationship on scatterplot)
does the plot thicken? condition (even scatter around the line of best fit, or across scatterplot for residuals)
outlier condition investigate any - they strongly affect r)

110
Q

Assumptions of the linear model (x4)

A

variables are quantitative
their relationship is linear
error is approximately normally distributed
variance of the error is constant

111
Q

Leverage in regression models refers to…

A

the fact that the further a given point is from the meanX, the more strongly they pull on the regression line

112
Q

A data point is ‘influential’ in regression models if…

A

removing it from the analysis makes a meaningful difference to the model

113
Q

Best graphical display for exploring two categorical variables

A

Two-way table

114
Q

Best graphical display for exploring two quantitative variables

A

Scatterplot

115
Q

Best graphical display for exploring one categorical and one quantitative variable

A

Boxplot