Week 1 Flashcards
What is statistics?
Statistics is the science of collecting, organizing, interpreting and learning from data.
What are the three aspects of statistics?
Design: Planning how to obtain data to answer the question of interest.
Description: Summarizing the data that are obtained.
Inference: Using sample data to learn about the population.
Population
The population is a collection of units of interest, such as all adults in the United States, alligators in the everglades, iPads from a factory.
Subject
Subjects are the individual units of a population, such as an adult, an alligator, an iPad.
Sample
A sample is a subset of the units of a population.
What makes a good sample?
A sample should be representative of the population. This can be obtained by selecting sample subjects randomly.
Where do statistical methods come in?
- Use DESIGN to obtain an appropriate sample from the population. 2. DESCRIBE the sample data with graphical and numerical summaries. 3. Perform STATISTICAL INFERENCE.
Statistical Inference
The procedure of using a sample to learn about a population is called statistical inference.
Parameter
A parameter is a number that describes a population. It is usually unknown.
Statistic
A statistic is a number that describes a sample. It can be computer from data, therefore, it is known.
We use a ____ to estimate a ______.
We use a sample statistic to estimate a population parameter.
Variable
A variable is any characteristic of a subject in a population.
Categorical (Qualitative) Variable
Classifies subjects as belonging to a certain group/category. For example, gender, race, political party, issue positions, etc.
Quantitative Variable
Takes on numerical values that represent different magnitudes. For example, height, weight, age, IQ, income, temperature, etc.
A quantitative variable can either be _____ or _____.
A quantitative variable can either be discrete or continuous.
Discrete
The possible values of a discrete quantitative variable form a set of separate numbers that can be listed or counted. For example, age in years, number of tattoos, etc.
Continuous
The possible values of a continuous quantitative variable form an interval. That is, there is an infinite continuum of possible values. For example, height, weight, income, time, etc.
Graphical Summaries for Categorical Variables
Graphical summaries of categorical variables help us visualize the distribution of the data among the separate categories. Before constructing the graphical summary, we first organize the categorical data into a frequency table.
Frequency Table
A frequency table is a listing of possible values for a variable, together with the number of observations for each value. (Note that we can also construct frequency tables for quantitative variables.)
Proportion
A proportion of observations that fall in a certain category is the count of observations in that category divided by the total number of observations.
Two Graphical Summaries for Categorical Variables
Pie Charts and Bar Graphs
Pie Chart
A circle is drawn with a “slice of pie” representing each category’s % of observations
Bar Graph
A bar is drawn for each category with the bar’s height representing the % or count of observations
Pie Charts vs. Bar Graphs
- Pie charts emphasize a category’s relation to the whole, but make it difficult to compare categories to each other!
- Bar graphs compare the sizes of each group of a categorical variable (not in relation to the whole).
- Bar graphs are easier to read and more flexible than pie charts.
Distribution
A distribution of data shows the values a variable takes and how often they occur.
Features of distributions visualized by graphical summaries:
- Overall Pattern (bell-shaped, skewed, bimodal, etc.)
- Center and spread
- Outliers (unusually large or small observations)
Two graphical summaries for quantitative variables:
Stem-and-leaf Plots and Histograms
Histogram
Histograms break up the range of values of a variable into classes and display the count (or percent) of the observations that fall into each class
Steps to Construct a Histogram
- Divide the range of data into intervals of equal width. (We want to choose a width that gives us a good picture of the distribution of the data. The number of intervals should not be too many or too few.)
- Count the number of observations that fall into each interval.
- On the horizontal axis, mark the scale of the variable. On the vertical axis, mark the scale for counts or percents.
- Above each interval, draw a bar whose height is either the corresponding count or percent for that interval.
Common Distribution Shapes
symmetric (normal, unimodal, bell-shaped), e.g. IQ, height, weight; right-skewed, e.g. income; left-skewed, e.g. lifespan, product failure rate; bimodal, e.g., height of men AND women (two populations); uniform, e.g. commute time
n
The number of observations in a sample
Mean
(x bar) the average of all observations. sum the observations and divide by n.
Median
(M) the middle number when measurements are ordered from smallest to largest (the 50th percentile; when n is odd, M = the middle value; when n is even, M = the average of the two middle values
Which measure of center is resistant to outliers?
The median.
Resistant
A numerical summary of the observations is resistant if extreme observations have little, if any, influence on its value. The mean is affected by outliers, while the median is resistant to the skewing affects of outliers.
Mean vs. Median
In symmetric distributions, the mean and median are approximately equal. In right-skewed distributions, the mean is greater than the median. In left-skewed distributions the mean is less than the median.
Measures of Spread
It’s important to look at measures of spread in addition to measures of center to get a better understanding of the data.
What are three measures of spread?
Range, interquartile range and standard deviation.
Range
The range is the difference between the largest and smallest observations. That is, the maximum value - the minimum value = the range. While the range is a simple measure of spread that is easy to calculate, it is only calculated using the most extreme values of a data set. Therefore, it can be misleading and is not resistant to outliers.
Interquartile Range
The interquartile range is the difference between the first and third quartiles. That is, it captures the middle 50% of the data.
Percentile
The pth percentile of a distribution is the value below which p% of the observations fall.
Notes about IQR
- The larger the IQR, the more spread out the data is.
- IQR is resistant to outliers since it’s calculated using only the middle 50% of the data set (outliers tend to be outside this range).
5-number summary
The 5-number summary is a brief numerical description of the center and spread of a distribution. It is the max, Q3, median, Q1, and min values. It can be displayed in R with summary() and fivenum()
Detecting Potential Outliers
As a rule of thumb, an observation is marked as a potential outlier if it falls more than 1.5xIQR below Q1 or 1.5xIQR more than Q3.
Box Plot
The box plot is a plot of the five number summary. Not only do box plots provide a picture of the center and spread of a distribution, they also give us an idea as to the shape or skew of the distribution.