Week 1 Flashcards
Population
Consists of all the members of a group about which you want to draw a conclusion
Sample
The portion of the population selected for analysis
Parameter
A numerical measure that describes a characteristic of a population.
Statistic
A numerical measure that describes a characteristic of a sample.
Descriptive Statistics
Collecting (e.g. survey), summarizing and presenting data (e.g. tables and graphs). Characterize (e.g. sample mean)
Inferential Statistics
Drawing conclusions about a population based on sample data (i.e. estimating a parameter based on a statistic).
Example of Inferential Statistics
Estimate the population mean weight (parameter) using the sample mean (statistic).
Hypothesis testing - e.g. Test the claim that the population mean weight is 100 pounds.
Types of Data
Categorical, Numerical Discrete, Numerical Continuous
Categorical Data
Simply classifies data into categories (e.g. marital status, hair color, gender)
Numerical Discrete
Counted items - finite number of items (e.g. number of children, number of people who have type-O blood)
Numerical Continuous
Measured characteristics - infinite number of items (e.g. weight, height)
Levels of Measurement and Measurement Scales
Highest level - Ratio Data
*Differences between measurements, true zero exists (Height, weight, age, weekly food spending)
Interval Data
*Differences between measurements but no true zero (temperature in Celsius, standardized exam scores)
Ordinal Data
*Ordered categories (rankings, order or scaling - tournament rankings, student letter grades, Likert scales)
Lowest level - Nominal Data
*Categories (no ordering or direction - marital status, type of car owned, gender, hair color)
Categorical Data (tables and charts)
Summary table
Graphing data -bar charts, pie charts
Numerical Data (tables and charts)
Ordered array, stem and leaf display, histogram, frequency and cumulative distributions
Examples of describing central tendency
Mean, Median, Mode, Geometric mean
Examples of describing variation
range, interquartile range, variance, standard deviation, coefficient of variation
Examples of describing shape
Skewness
Median
Main advantage over mean is that it is not affected by extreme values
Mode
- Not affected by extreme values
- Unlike for mean and median, there may be no unique (single) mode for a given
data set - Used for either numerical or categorical (nominal) data
- least least of the 3
Mean
Generally used most often, unless extreme (outliers) exist.
Quartiles
Split the ranked data into four segments, with an equal number of values per segement
The first quartile (Q1)
The value for which 25% of the observations are smaller and 75% are larger.
Q1 position = (n+1)/4
The second quartile (Q2)
Q2 is the same as the median (50% are smaller, 50% are larger)
Q2 position =(n+1)/2 (median)
The third quartile (Q3)
Only 25% of the observations are greater than the third quartile.
Q3 position = 3(n+1)/4
Measures of variation
give information on the spread or variability of the data values
Range
Simplest measure of variation
Disadvantages - ignores the distribution of the data, it is sensitive to outliers
Interquartile Range (IQR)
Like the median and Q1 and Q2, the IQR is a resistant summary
measure (resistant to the presence of extreme values)
- Eliminates outlier problems by using the interquartile range, as
high- and low-valued observations are removed from calculations - IQR = 3rd quartile – 1st quartile
Sample Variance (S^2)
Measures average scatter around the mean, units are also squared
Sample Standard Deviation - S
Most commonly used measure of variation, shows variation about the mean, has the same units as the original data
Variance and Standard deviation - Advantages
-Each value in the data set is used in the calculation
* Values far from the mean are given extra weight as deviations
from the mean are squared
Variance and Standard deviation - Disadvantages
Sensitive to extreme values (outliers)
* Measures of absolute variation not relative variation
The Z Score
The difference between a given observation and the mean, divided by the standard deviation
A z score above 3.0 or below -3.0 is considered an outlier
Shape of a Distribution
Describes how data are distributed
-Left-skewed
-Symmetric
-Right-skewed
Population summary measures
Parameters
The population mean is the sum of the values in the population divided by the population size, N
Population Variance
The average of the squared deviations of values from the mean
Population Standard Deviation
Shows variation around the mean
-the square root of the population variance
-has the same units as the original data
The Empirical Rule
If the data distribution is approximately bell-shaped,
then the interval u+-1o contains about 68% of
the values in the population
u+-2o = contains about 95% of the values in the population
u+-3o = contains about 99.7% of the values in the population
Determining Outliers
Using the empirical rule
-over or under (1% extreme values) u+-3o
Using Z scores
-above 3 or below -3
Exploratory Data Analysis - Box-and-Whisker Plot
A graphical display of data using the 5 number summary
-can determine skewness