Ch.14, Descriptive Statistics Flashcards
Define descriptive stats
Describe data in ways rear give us a better idea of their charachteristics; Number that summarizes a set of data
NOT a correlation statistic (correlations are inferential)
What is the simplest measure of dispersion?
Range: take maximum — minimum
What are data matrices?
Putting data into a grid: a matrix
Opportunity to exam all data in one place
Histograms
graphical display of values where each bar indicates the frequency of the range or value
LIMITATIONS: The more “accessible” a data set is, the less information/less complexity you’re conveying
Advantages: identifies mode, helps to identify potential outliers
Binning
Binning in data mining is a data preprocessing technique that involves grouping data into smaller, more manageable categories or bins. It can be used for both numerical and categorical data and can help improve the efficiency and accuracy of data analysis.
Stem-Plots
both a graph and a chart that displays each score in a data set so that it visually represents the distribution/ frequency of scores
Stem: leading numbers
Leaves: trailing numbers
What does sigma mean and what it is its symbol?
Σ= sum of all scores
What does –
x. mean?
mean
Mean, advantages/disadvantages
Advantages: very common, takes into account every entry of a data set
Disadvantages: extremely influenced by outliers, knowledge about individual cases is completely lost with average
Population vs. sample mean
CAN NEVER REALLY KNOW THIS, whatever you’re trying to make a generalization about ; Population Mean: (mu greek symbol is the population mean) mean of the entire population (on charts)
Sample Mean: mean of your sample (on charts)
Median, advantages/disadvantages
Middle (from lowest to highest)
At the median half the data set is below that number and half the data set is above that number
Position of Median = number of entries + 1/ 2
Odd Number of Entries: median is the middle data entry
^^Even Number of Entries: median is the mean of the two middle data entries
Advantages: not influenced by outliers, reasonable estimate of what most people mean by the center of a distribution “reasonable” average salary in Canada not including billionaires
Disadvantages: may not be good to ignore extreme values in all cases;
Disadvantages, Advantages, Mode
LEAST USED, NOMINAL/CATEGORICAL VARIABLE Mode
Most frequently occurring; if there is no entry that is repeated there is no mode
Data can be bimodal, 3 OR MORE MODES= MULTI-MODAL)
Elections use this often to represent who said what party the most/ ask what most popular dish at a cafe
Advantages: most frequently obtained score which can be useful, not influenced by extreme scores and works when outliers aren’t relevant
Disadvantages: may not represent a large proportion of the scores, there’s still a bunch of answers that might be very frequent as well and it completely ignores those
Advantages/disadvantages of range
Range can never be negative: ALWAYS HAS TO BE ABSOLUTE VALUES
Advantages: includes all the data, simple,
Disadvantages: sensitive to small sample sizes, if you have a small sample of a broader population you wouldn’t get the full range in your small (small samples = less range), small samples = not a representative range, doesn’t tell you anything about where the bulk of the values are and is affected by outliers
What are interquartiles?
INTERQUARTILES SHOW DISPERSION AROUND MEDIAN
What is a quartile?
Quartiles: positions in a range of values representing multiples of 25%
What is the first and third quartile?
First Quartile: 25% of scores fall below the first quartile, 75% above (Q1: splitting bottom half in half)
Third Quartile: 75% of scores fall below the third quartile, 25% fall above (Q3: splitting top half in half)
What does the second quartile do?
Measure of distance between the first and third quartile (special kind of range that includes just the middle 50% of values) WHERE THE MIDDLE HALF OF THE DATA IS; TELLS YOU WHERE THE MIDDLE IS (25%, 30%, 30%, 25%) Interquartile Range would be 30%