Analysis of a Single Variable (Exam 1) Flashcards
what are the 4 steps in a statistical analysis?
1) identify a population of interest and a question about that population
2) collect sample data from the population
3) perform preliminary data analysis to summarize data (descriptive statistics)
4) draw conclusions from data (inferential statistics)
distribution
- tells us what values it takes and how often it takes those values
- ways of describing a variable’s distribution (graphically/numerically) from a sample depend on the variable type (categorical/quantitative)
categorical variables
variables that put the individual into one of several groups/categories
distribution of a categorical variable
- lists the categories and gives either the count or percentage of individuals who fall in each category
- 2 methods:
- pie chart, bar graph
pie chart
- shows how a whole group (the variable) is subdivided into smaller groups (categories)
- the size of the slice is proportional to the fraction of the sample/population in that category
- the sum of the %’s shown by each slice must add up to 100% (every individual must be represented, uses an “other” category)
bar graph
- represents each category as a bar
The height of the bar shows the category count% - don’t have to plot every individual in the sample
- to transform it into a pie chart, we’d have to know the “other” category to make it add up to 100%
- has space between each bar because the categories aren’t in order or directly adjacent to each other
distribution of a quantitative variable
- tells us what values the variable takes and how often it takes these values
- 4 methods:
- histogram, stemplot, boxplot, time plot
quantitative variables
variables that take values for which arithmetic operations make sense
histogram
- a graph of the distribution of a quantitative variable whose values are grouped together
- take the full range and divide it up
- bars are directly adjacent to each other (touch), “classes” are all close to each other, and ordered
- nobody is excluded
stemplot
- “sideways” histogram that shows the actual numbers of the distribution
- stem: consists of all but the final digit
- leaf: the final digit
- good for small datasets (<40-50)
- unlike histograms, stemplots show the actual values of the data
- less flexible than histograms
shape
- symmetric distribution: the right & left sides of the histogram are approx mirror images of each other
- skewed to the right: the right side of the histogram extends much further out than the left
- skewed to the left: the left side of the histogram extends much further out than the right
- unimodal: 1 peak
- bimodal: 2 peaks
- multimodal: multiple peaks
center
- where is the middle?
- mean: average of all observations of the variable, but highly influenced by outliers & extreme data values (preferred unless the distribution is strongly skewed/presents outliers)
- median: resistant to outliers & extreme values, midpoint (50th percentile) of the distribution of the variable
spread
- what is the variability of the data from the center?
these are not resistant to outliers…
- variance: measures the dispersion (spread about the mean) of all the observations
- standard deviation: the square root of the variance, indicates the extent of deviation for a group as a whole
these are better to describe the spread when the data is strongly skewed/present outliers…
- 5-number summary (percentiles): min, Q1, median, Q3, max
- interquartile range (IQR): Q3 – Q1, can be used to identify outliers
boxplot
- graphical rendition of statistical data based on the min, Q1, median, Q3, max
- the info in the 5-number summary can be graphically displayed in a boxplot
- central box spans Q1 & Q3
- line marks the median within the central box
- lines extend from the box to mark the min & max
- special symbols can denote outliers
time plot
- for some quantitative data, we are measuring 1 subject at many time points (rather than measuring our variable across many subjects)
- plots each observation against the time at which it was measured
- 2 patterns: cycle (regular up & down movements over time) & trend (a long-term upward/downward movement over time)