module 4 Flashcards
contingency tables
- shows frequency of sampling units
- tables of data frequencies within diff levels of categorical data
types of contingency tables
one-way and two-way tables
calculate marginal distributions as frequencies
- Row: sum frequencies across all columns for each row
- Column: sum frequencies across all rows for each column
calculate marginal distributions as proportions
- Table total: sum all frequencies in the table
- Row: sum frequencies across all columns for each row and divide by table total
- Column: sum frequencies across all rows for each column and divide by table total
marginal distribution
- row and column sums of a two-way contingency table.
- can be shown as frequencies or proportions
what do marginal distributions show
- how many sampling units are in each level of a categorical variable w out the need of other categorical variables
- they describe overall patterns in the sample
conditional distributions
- two-way tables that show the proportion of sampling units for one variable within each level of the second variable
- shows the relationship between two variables
- shown as a separate table
how do you create a conditional distribution table
- calculated from contingency table and marginal distribution
- select one of the categorical variables to be primary and one to be secondary (aka conditional)
- take the frequency from the contingency table and divide it by the marginal distribution of the primary variable
- basically: take the value and divide it by the sum of the row/column
how do you choose primary and secondary variables in calculating a conditional distribution
- depends on the question being asked
- ex. are there more _____(primary) than _____ (primary) in the _____(secondary) category? or how many ppl like _____(secondary) when doing _____(primary)
what do the primary and secondary variables in calculating conditional distributions determine
- if you use the row or column marginal distribution
what do conditional distributions show
- relative frequency of secondary variables within each level of the primary variable
- shows how the secondary variable changes across the primary variable
t or f: bar graphs are only used to visualize single variable categorical data
false, single and two variable categorical data
t or f: bar graphs are good at visualizing numerical data
- false, only acceptable in one case as it only shows average numerical value
- acceptable: stat datasets have categorical info on many sampling units, data is not statistical in nature
- not acceptable: if data is from a statistical population w one numerical and one categorical value
t or f: bar graphs can be horizontal or vertical
true, choice depends on focus of research question with more relevant info on the horizontal axis
how do you display data w two categorical measurement variables in a bar graph
- designate one variable as grouping variable (base of the figure, level of other variable are shown within it) it is whichever variable shows the info more clearly
- decide whether to create the figure as a grouped or stacked bar graph
what are the two types of two variable bar graphs
- grouped: variables are separate but shown beside each other in groups for each variable on the x axis
- stacked: variables are stacked on top of each other, just one bar per variable on the x axis
when should you not use stacked bar graphs
when two different variables you want to compare have the same value
each bar or group of bars in a bar graph should be separated by a ______
gap
steps in making a histogram
- divide the numerical variable into bins of equal size
- count how many sampling units fit within each bin (frequency)
- create a plot w each bin having a bar w a height equal to that bins frequency
t or f: histograms show the separation of variables through a gap between the bars
false
histograms
- split numerical data into bins and display the number of sampling units of each bin
- for numerical data
what are the lines drawn from the edge of the box to the last data point within the extreme threshold in box plots called
whisker plots
box plot
- based on quartiles
- shows 5 descriptive stats (minimun, 1st quartile, median, 3rd quartile, and maximum)
- shows interquartile range
- shows how numerical variables change across multiple categorical groups
- equally spaced categorical groups across the x axis with a box for each group drawn
t or f: box plots can sometime show extreme values
true
what are the four parts of a box plot
- a box (drawn between the 1st and 3rd quartile ranges, showing the interquartile range)
- a solid line (drawn at the median)
- whiskers (drawn from edge of box to the last data point within the extreme threshold)
- extreme values (symbols drawn overtop data points outside the extreme threshold
extreme threshold
- temporary reference line used to draw the whiskers and extreme values
- The thresholds are drawn at 1.5 X the interquartile range above the top of the box and below the bottom of the box. They are removed in the final graph.
in an observational study, the categorical group is a ________, and in an experimental study the categorical group is _______
measured categorical value, the treatment factors
secondary variable for box plots
- shown within each level of the grouping variable
- levels often shown in a legend
what to do when there are two categorical groups for a box plot
- designate one categorical variable as the grouping variable and one as the secondary variable
- draw a boxplot using numerical data within each level of the two categorical variables
- grouping variables have large gaps between levels
grouping variable for box plots
- shown on the x axis of grouped box plots
- levels shown often on the x axis
when to use box plot vs histogram
- boxplot: if you have many categorical groups or arent interested in shape of data (show median, quartiles, and quartile ranges, easy to compare across categorical groups)
- histogram: if you have a small number of categorical groups and want to see the shape of the data (info abt how data is distributed, shows the shape of distribution, difficult to look at numerical variables across categorical groups)
scatterplot
- used to show pattern between two numerical variables collected from different sampling units
line plot
- used when data is collected repeatedly from the same sampling unit
each point on a scatter plot is a ____ _____
sampling unit
name of the axis in scatterplots for experimental studies
- when one variable is treatment and the other is response, x-axis=independent variable (treatment) and y-axis=dependent variable (response)
- when both variables are measured quantities, both axes are called covariates (evaluating pattern)
name of the axis in scatterplots for observational studies
- both numerical variables are measured quantities called covariates
scatterplot: for both observational and experimental studies, when the goal of a test is to evaluate whether one variable can predict the other, the x-axis is typically called the _____ and the y-axis the ______
predictor variable, response variable
scatterplot: for both experimental and observational studies when the goal of a test is to evaluate the association between numerical variables, both axes are called ______
covariates
in scatterplots, if extra variables are categorical they are differentiated using _____ but if they are numerical you use _____ to show the difference. both of these are shown in a ______
different symbols, different size or colour, legend
discrete vs continuous numerical values
- discrete= exact figure you can count
- continuous= range of info, growing