AFM 112 - Chp 3 Flashcards
Define a frequency table
number of observations in each class/category/data
define a relative frequency table
define a relative frequency table
percentage of observations that fall in each class/category (frequency/total number of observations)
when to use a bar graph vs pie chart
bar graphs uses a bar to represent each category + height of the bar equal to the frequency or relative frequency of the class/category (more specific)
pie chart - each class/category is represented by a slice and the size of each slice is proportional to the relative frequency of the class/category (more general)
What do we use to summarize + describe quantitative data
measures of central tendency + measures of variability of dispersion
define measures of central tendency
capture the tendency of the data to cluster around some central values
define measure of variability
captures the spread of data
what’s the use of a histogram in graphing?
to capture the shape of distribution
what are the histogram distributions described as? and define them
- symmetric - looks the same on both sides from the centre
- skewed left - majority of the data is on the right but there is a little bit of data on the left
- skewed right - majority of the data is on the left but there is a little bit of data on the right
4.bimodal - 2 spikes
5.multimodal - multiple spikes
how do we use mean + median to predict the shape of distribution
- if mean = median, distribution = likely symmetric
- if mean > median, distribution = likely skewed to the right - if difference is significant, it indicates there’s are outliers at the upper end (right side)
- if mean < median, distribution = likely skewed to the left. if the difference is significant, there are outliers at the lower end (left side)
define the interquartile range
distance between the first and third quartile
what’s the spreadsheet formula for first and third quartile?
1st quartile =percentile (e1:e39, 0.25)
3rd quartile = percentile (e1:e39, 0.75
define variance
averaged squared deviation from the mean
define standard deviation
square root of the variance - higher the value, higher the variability
what are the 3 assumptions we can predict if the shape of distirbution is bell shaped and symmetric
- 68% of the observations will fall within 1 standard deviation from the mean - (range = x-1s to x+1s)
- 95% of the observations will fall within 2 standard deviation from the mean, (range = x-2s to x+2s)
- 99.7% of the observations will fall within 3 standard deviation from the mean (range = x-3s to x+3s)
What’s the importance of data understanding?
structure of the data + data captured in each variable
define accuracy/consistency in terms of data quality
data set = accurate + consistent if it is as free as possible from intentional/unintentional errors
data values = accurate if they capture what the decision maker would consider as the actual value
data values = consistent if they do not change across occurrences
Define timeliness in data quality context
if it contains info that is time relevant to the business problem that it will be used to address
define completeness in terms of data quality
data set is complete if all the data points needed to capture a transaction are available
define outliers + the importance
observations in a data set far away from the bulk of the observations
outliers are important as they may indicate various info - eg. potential fraud
what’s the rule of distribution for detecting outliers in a bell shaped distribution
any value which is more than 3 standard deviation above the mean or below the mean.
what’s the rule for detecting outliers in any distribution
any value which is above the end of the upper whisker (Q3 + 1.5*IQR) or is below the end of the lower whisker (Q1 - 1.5IQR)
what are the 3 count functions in excel and the differences?
count - counts any cell with numerical data
count A - counts any cell with any data type
count blank - counts any cell with a blank value
What are some issues with data quality?
missing values (using count functions), erroneous data
what’s the function to categorize data?
unique function