Week 1 Flashcards

Question

Distribution

Answer 1

A distribution of data shows the values a variable takes and how often they occur.

Answer 2

1. Overall Pattern (bell-shaped, skewed, bimodal, etc.) 2. Center and spread 3. Outliers (unusually large or small observations)

Answer 3

Stem-and-leaf Plots and Histograms

Answer 4

Histograms break up the range of values of a variable into classes and display the count (or percent) of the observations that fall into each class

Answer 5

1. Divide the range of data into intervals of equal width. (We want to choose a width that gives us a good picture of the distribution of the data. The number of intervals should not be too many or too few.) 2. Count the number of observations that fall into each interval. 3. On the horizontal axis, mark the scale of the variable. On the vertical axis, mark the scale for counts or percents. 4. Above each interval, draw a bar whose height is either the corresponding count or percent for that interval.

Answer 6

symmetric (normal, unimodal, bell-shaped), e.g. IQ, height, weight; right-skewed, e.g. income; left-skewed, e.g. lifespan, product failure rate; bimodal, e.g., height of men AND women (two populations); uniform, e.g. commute time

Answer 7

The number of observations in a sample

Answer 8

(x bar) the average of all observations. sum the observations and divide by n.

Answer 9

(M) the middle number when measurements are ordered from smallest to largest (the 50th percentile; when n is odd, M = the middle value; when n is even, M = the average of the two middle values

Answer 10

The median.

Answer 11

A numerical summary of the observations is resistant if extreme observations have little, if any, influence on its value. The mean is affected by outliers, while the median is resistant to the skewing affects of outliers.

Answer 12

In symmetric distributions, the mean and median are approximately equal. In right-skewed distributions, the mean is greater than the median. In left-skewed distributions the mean is less than the median.

Answer 13

It's important to look at measures of spread in addition to measures of center to get a better understanding of the data.

Answer 14

Range, interquartile range and standard deviation.

Answer 15

The range is the difference between the largest and smallest observations. That is, the maximum value - the minimum value = the range. While the range is a simple measure of spread that is easy to calculate, it is only calculated using the most extreme values of a data set. Therefore, it can be misleading and is not resistant to outliers.

Answer 16

The interquartile range is the difference between the first and third quartiles. That is, it captures the middle 50% of the data.

Answer 17

The pth percentile of a distribution is the value below which p% of the observations fall.

Answer 18

1. The larger the IQR, the more spread out the data is. 2. IQR is resistant to outliers since it's calculated using only the middle 50% of the data set (outliers tend to be outside this range).

Answer 19

The 5-number summary is a brief numerical description of the center and spread of a distribution. It is the max, Q3, median, Q1, and min values. It can be displayed in R with summary() and fivenum()

Answer 20

As a rule of thumb, an observation is marked as a potential outlier if it falls more than 1.5xIQR below Q1 or 1.5xIQR more than Q3.

Answer 21

The box plot is a plot of the five number summary. Not only do box plots provide a picture of the center and spread of a distribution, they also give us an idea as to the shape or skew of the distribution.

Week 1 Flashcards

(45 cards)