Week 1 Flashcards

1
Q

What is statistics?

A

Statistics is the science of collecting, organizing, interpreting and learning from data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are the three aspects of statistics?

A

Design: Planning how to obtain data to answer the question of interest.

Description: Summarizing the data that are obtained.

Inference: Using sample data to learn about the population.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Population

A

The population is a collection of units of interest, such as all adults in the United States, alligators in the everglades, iPads from a factory.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Subject

A

Subjects are the individual units of a population, such as an adult, an alligator, an iPad.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Sample

A

A sample is a subset of the units of a population.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What makes a good sample?

A

A sample should be representative of the population. This can be obtained by selecting sample subjects randomly.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Where do statistical methods come in?

A
  1. Use DESIGN to obtain an appropriate sample from the population. 2. DESCRIBE the sample data with graphical and numerical summaries. 3. Perform STATISTICAL INFERENCE.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Statistical Inference

A

The procedure of using a sample to learn about a population is called statistical inference.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Parameter

A

A parameter is a number that describes a population. It is usually unknown.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Statistic

A

A statistic is a number that describes a sample. It can be computer from data, therefore, it is known.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

We use a ____ to estimate a ______.

A

We use a sample statistic to estimate a population parameter.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Variable

A

A variable is any characteristic of a subject in a population.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Categorical (Qualitative) Variable

A

Classifies subjects as belonging to a certain group/category. For example, gender, race, political party, issue positions, etc.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Quantitative Variable

A

Takes on numerical values that represent different magnitudes. For example, height, weight, age, IQ, income, temperature, etc.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

A quantitative variable can either be _____ or _____.

A

A quantitative variable can either be discrete or continuous.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Discrete

A

The possible values of a discrete quantitative variable form a set of separate numbers that can be listed or counted. For example, age in years, number of tattoos, etc.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Continuous

A

The possible values of a continuous quantitative variable form an interval. That is, there is an infinite continuum of possible values. For example, height, weight, income, time, etc.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Graphical Summaries for Categorical Variables

A

Graphical summaries of categorical variables help us visualize the distribution of the data among the separate categories. Before constructing the graphical summary, we first organize the categorical data into a frequency table.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Frequency Table

A

A frequency table is a listing of possible values for a variable, together with the number of observations for each value. (Note that we can also construct frequency tables for quantitative variables.)

20
Q

Proportion

A

A proportion of observations that fall in a certain category is the count of observations in that category divided by the total number of observations.

21
Q

Two Graphical Summaries for Categorical Variables

A

Pie Charts and Bar Graphs

22
Q

Pie Chart

A

A circle is drawn with a “slice of pie” representing each category’s % of observations

23
Q

Bar Graph

A

A bar is drawn for each category with the bar’s height representing the % or count of observations

24
Q

Pie Charts vs. Bar Graphs

A
  1. Pie charts emphasize a category’s relation to the whole, but make it difficult to compare categories to each other!
  2. Bar graphs compare the sizes of each group of a categorical variable (not in relation to the whole).
  3. Bar graphs are easier to read and more flexible than pie charts.
25
Q

Distribution

A

A distribution of data shows the values a variable takes and how often they occur.

26
Q

Features of distributions visualized by graphical summaries:

A
  1. Overall Pattern (bell-shaped, skewed, bimodal, etc.)
  2. Center and spread
  3. Outliers (unusually large or small observations)
27
Q

Two graphical summaries for quantitative variables:

A

Stem-and-leaf Plots and Histograms

28
Q

Histogram

A

Histograms break up the range of values of a variable into classes and display the count (or percent) of the observations that fall into each class

29
Q

Steps to Construct a Histogram

A
  1. Divide the range of data into intervals of equal width. (We want to choose a width that gives us a good picture of the distribution of the data. The number of intervals should not be too many or too few.)
  2. Count the number of observations that fall into each interval.
  3. On the horizontal axis, mark the scale of the variable. On the vertical axis, mark the scale for counts or percents.
  4. Above each interval, draw a bar whose height is either the corresponding count or percent for that interval.
30
Q

Common Distribution Shapes

A

symmetric (normal, unimodal, bell-shaped), e.g. IQ, height, weight; right-skewed, e.g. income; left-skewed, e.g. lifespan, product failure rate; bimodal, e.g., height of men AND women (two populations); uniform, e.g. commute time

31
Q

n

A

The number of observations in a sample

32
Q

Mean

A

(x bar) the average of all observations. sum the observations and divide by n.

33
Q

Median

A

(M) the middle number when measurements are ordered from smallest to largest (the 50th percentile; when n is odd, M = the middle value; when n is even, M = the average of the two middle values

34
Q

Which measure of center is resistant to outliers?

A

The median.

35
Q

Resistant

A

A numerical summary of the observations is resistant if extreme observations have little, if any, influence on its value. The mean is affected by outliers, while the median is resistant to the skewing affects of outliers.

36
Q

Mean vs. Median

A

In symmetric distributions, the mean and median are approximately equal. In right-skewed distributions, the mean is greater than the median. In left-skewed distributions the mean is less than the median.

37
Q

Measures of Spread

A

It’s important to look at measures of spread in addition to measures of center to get a better understanding of the data.

38
Q

What are three measures of spread?

A

Range, interquartile range and standard deviation.

39
Q

Range

A

The range is the difference between the largest and smallest observations. That is, the maximum value - the minimum value = the range. While the range is a simple measure of spread that is easy to calculate, it is only calculated using the most extreme values of a data set. Therefore, it can be misleading and is not resistant to outliers.

40
Q

Interquartile Range

A

The interquartile range is the difference between the first and third quartiles. That is, it captures the middle 50% of the data.

41
Q

Percentile

A

The pth percentile of a distribution is the value below which p% of the observations fall.

42
Q

Notes about IQR

A
  1. The larger the IQR, the more spread out the data is.
  2. IQR is resistant to outliers since it’s calculated using only the middle 50% of the data set (outliers tend to be outside this range).
43
Q

5-number summary

A

The 5-number summary is a brief numerical description of the center and spread of a distribution. It is the max, Q3, median, Q1, and min values. It can be displayed in R with summary() and fivenum()

44
Q

Detecting Potential Outliers

A

As a rule of thumb, an observation is marked as a potential outlier if it falls more than 1.5xIQR below Q1 or 1.5xIQR more than Q3.

45
Q

Box Plot

A

The box plot is a plot of the five number summary. Not only do box plots provide a picture of the center and spread of a distribution, they also give us an idea as to the shape or skew of the distribution.