Exploring Data Flashcards
What is a categorical varialbe?
variables that take on values that are names or labels, such as color, or breed of dog.
What is a quantitative variable?
variables that are that are numerical, and represent a measurable quantity, like salary or height.
How do we represent categorical variables?
With bar charts or pie charts
How do we represent quantitative variables?
With histograms, stem and leaf plots, or boxplots
When do we use the mean to describe a distribution?
When the distribution is unimodal and symmetric
When do we use the median to describe a distribution?
When the distribution is not unimodal and symeetric.
What is standard deviation?
The average distance from the mean
What is a z score?
The number of standard deviations away from the mean
What is percentile?
percent to the left
What is the five number summary?
min, Q1, median, Q3, max
What is IQR
interquartile range: Q3 - Q1
What is the empirical rule?
mean-68-95-99.7–yes!!
What percent of data lies above the median?
50%
How do you determine “outliers”
1.5 IQR’s above Q3 or below Q1
If a distribution is skewed right, which is higher, the median or mean?
the mean–the mean chases the tail!
How do you know whether to use the mean and s.d. to describe data, or median and IQR?
If the data is unimodal and symmetric, use the mean and s.d., otherwise use median and IQR
What should you remember when making graphs?
Label your axes, give a key if needed, and give the graph a name!
How much data is between Q1 and Q3?
50%
what is a contingency table
shows distributions across 2 variables like gender and music pref. AKA 2-way table
How can you tell if variables in a contingency table are independent?
If the distributions are the same across the variables.. Then it doesn’t DEPEND.. so INDEPENDENT
marginal distribution
overall distributions of a single variable in contingency table (out in margins)
conditional distribution
A distribution within the table, along only one row or one column? NOT IN THE MARGINS
How do you describe distributions (histograms)?
Shape-Cener-Spread- and STRANGE (Outliers and gaps) some say GSOCS. where’s yo GSOCS?
If asked to compare distributions, what should you write about?
Compare Shapes, Centers, Spreads, and Stranges.. The GSOCS
Give a simple example showing that adding a constant doesn’t change the spread, but changes the center. (this always happens)
Data set: 1,2,3,4,5 Spread(range): 5-1=4, Center: 3
add three and get new data set: 3,4,5,6,7 spread: still 4 Center: 5 (center went up, spread stayed the same). The IQR and SD will stay the same, but median and mean +3
Give a simple example showing that multiplying by a constant changes both the spread and the center. (this always happens)
Data set: 1,2,3,4,5 Spread(range): 5-1=4, Center: 3
mult by three and get new data set: 3,6,9,12,15 spread:12 Center:9 (both center and spread were multiplied by three) IQR and SD will be multiplied by 3 and all values including Q1, median, etc.
How do you describe center?
Talk about the mean (balance), median (splits area in half), mode (peaks? if bimodal, talk about both modes) or simply say: “centered around ____”
How do you describe shape?
unimodal, bimodal, multimodal, uniform AND symmetric, skewed
Spread description?
range, IQR, stand dev, variance, or simply say: “ From here to about here”
what happens if you ADD a constant to each value in a data set?
it is SHIFTED only. This effects all of the data values and measures of center (mean, med) and quartiles, deciles, etc… IT DOES NOT CHANGE THE SPREAD! (IQR, St Dev, Range all stay the SAME).
what happens if you multiply all of a data set by a constant?
it is scaled.. Everything is effected. Mean/ median/ stand dev/ iqr/ quartiles all multiplied by that constant. Center, spread and all individual values are changed.
If you want to calculate % above a value, what do you put into normcdf(? ?)
find z score for value, and then normcdf (Z left, 999)
Which calculator function gives you a z score?
invnorm(%ile)… YOU MUST USE PERCENTILE (%to left)
What does normcdf do?
It gives you the area under the normal curve between any two z scores