CRP 109 Stats Lecture 1 Flashcards
Data definition
Collections of observations, such as measurements, or survey
responses
Statistics definition
The science of planning studies and experiments; obtaining data; and organizing, summarizing, presenting, analyzing, and interpreting those data and then drawing conclusions
based on them
Population definition
The complete collection of all measurements or data that are
being considered. Typically, it is the complete collection of data that we would like to make inferences about
Census definition
The collection of data from every member of the population
Sample definition
The subcollection of members selected from a population
Variable definition
A characteristic that varies (changes) across individuals in a
population. The values (observations) recorded collectively
make up the data
parameter definition
A numerical measurement describing some characteristic of a
population
statistic definition
A numerical measurement describing some characteristic of a
sample
Discrete data
result when the data values are quantitative and the number of
values is finite (countable)
Continuous data
result from infinitely many possible quantitative values
(not countable). They can be measured, but not counted.
Missing completely at random
The likelihood of the data value being
missing is independent of its value or any of the other values in
the data set (any data value is just as likely to be missing as
any other data value)
Missing not at random
The missing value is related to the reason that it is missing. Ignoring these could lead to bias in the remaining
values and the results may then become misleading
Simple random sample (SRS)
A sample of n subjects selected in such a way that every possible sample of the same size n has the
same chance of being chosen
Designed Experiment
We apply some treatment and then proceed to observe its effects on the individuals. The individuals in
designed experiments are called experimental units, and they
are often called subjects when they are people
Observational Study
We observe and measure specific characteristics, but
we do not attempt to modify the individuals being studied
Random Sampling Error
Occurs when the sample has been selected with a
random method, but there is a discrepancy between a sample
result and the true population result; such an error results from
chance sample fluctuations
Non-Sampling Error
The result of human error, including such factors as wrong data entries, computing errors, questions with biased wording, false data provided by respondents, forming biased
conclusions, or applying statistical methods that are not
appropriate for the circumstances
Non-Random Sampling Error
The result of using a sampling method that is not random, such as using a convenience sample or a
voluntary response sample
Frequency (of a class)
The number of original values that fall into that class.
Frequency Distribution/Table
Shows how data are partitioned among
several categories/classes by listing the categories along with
the number (frequency) of data values in each of them
-used to summarize large data sets, see the distribution and identify outliers, and/or have a basis for constructing
graphs.
Lower Class Limits
The smallest numbers that can belong to each of the
different classes
Upper Class Limits
The largest numbers that can belong to each of the
different classes
Class Boundaries
The numbers used to separate the classes, but without
the gaps created by class limits.
Class Midpoints
The values in the middle of the classes. Each class midpoint
is computed by taking the average of the lower and upper class
limits
Class Width
The difference between two consecutive lower class limits (or
boundaries) in a frequency distribution.
class width = (max value - min value) / number of classes
Relative (or Percentage) Frequency Distribution
relative freq = freq of class / sum of all freq
*100 to get percentage freq
Cumulative Frequency Distribution
frequency for each class is the sum of the frequencies for that
class and all previous classes
- class limits are replaced by “less than” expressions that describe
the new ranges of values
Histogram
A graph consisting of bars of equal width drawn adjacent to
each other (unless there are gaps in the data). The horizontal scale represents classes of quantitative data values, and the vertical scale represents frequencies. The heights of the bars
correspond to frequency values.
Relative Frequency Histogram
A graph that has the same shape and
horizontal scale as a histogram, but the vertical scale uses relative frequencies instead of actual frequencies (i.e. proportion or percent)
Correlation
Exists between two variables when the values of one variable
are somehow associated with the values of the other variable.
Correlation does not imply causation.
Linear Correlation
Exists between two variables when there is a correlation
and the plotted points of paired data result in a pattern that
can be approximated by a straight line
Linear Correlation Coefficient, r
Measures the strength of the linear
correlation between the paired quantitative x and y values in a
sample. It is sometimes referred to as the Pearson product
moment correlation coefficient
Scatterplot
A plot of paired (x , y ) quantitative data with a horizontal x
-axis and a vertical y -axis
Properties of r
-The value is always between −1 and 1, inclusive
-If r is close to −1, there appears to be a strong negative correlation.
-If r
is close to 1, there appears to be a strong positive correlation.
-If r is close to 0, there appears to be a weak or no linear correlation.
-A value of exactly −1 or 1 implies that all of the data fall exactly on a line (perfect correlation)
-If all values of either variable are converted to a different scale, the
value of r does not change
- Interchange all x values and y values, and the value of r will not change
- not designed to measure the strength of a relationship that is not linear
-sensitive to outliers
Regression
Given a collection of paired sample data, the regression line or line of
best fit or least-squares line is the straight line that “best” fits the
scatterplot of the data
Descriptive Statistics
Methods and tools that summarize or describe
relevant characteristics of data
Inferential Statistics
Methods and tools that make inferences, or
generalizations, about populations
Mean
The measure of centre found by adding all of the data values and
dividing the total by the number of data values
-not resistant to outliers
Median
The measure of centre that is the middle value when the original data
values are arranged in order of increasing (or decreasing) magnitude.
-resistant to outliers (only changes slightly)
Mode
-The value(s) that occurs with the greatest frequency.
-The mode can be found with qualitative data.
-A data set can have no mode, one mode (unimodal), or multiple
modes
s
sample standard deviation
s2
sample variance
σ
population standard deviation
σ2
population variance
Range
-The difference between the maximum data value and the minimum
data value
-Very sensitive to outliers (not resistant)
- does not truly
reflect the variation among all of the data values
Standard Deviation of a Sample (s)
A measure of how much data values deviate away from the mean
T-he value is never negative. It is zero only when all of the data values
are exactly the same.
-Larger values indicate greater amounts of variation.
-not resistant to outliers
- units are the same as the units of the original data values
Variance
-A measure of variation equal to the square of the standard deviation.
-The units are the squares of the units of the original data values
-not resistant to outliers
-The value is never negative. It is zero only when all of the data values
are the same number.
-s2 is an unbiased estimator of σ2
Chebychev’s Rule
for any data set:
-at least 75% of data lies within 2 standard deviations of the mean.
-at least 89% of data lies within 3 standard deviations of the mean
Empirical Rule
The empirical rule states that for bell-shaped data sets,
-approximately 68% of data lies within 1 standard deviation of the
mean.
-approximately 95% of data lies within 2 standard deviations of the
mean.
-approximately 99.7% of data lies within 3 standard deviations of the
mean
Percentiles
Measures of location, denoted P1, P2, . . . , P99, which divide a set of
data into 100 groups with about 1% of the values in each group
-The 50th percentile, P50, has about 50% of the data values below it
and about 50% of the data values above it, corresponding to the
median
Finding the Percentile of a Data Value
percentile of value x = (number of values less than x) / (total number of values) *100
k
percentile being used
L
locator that gives the position of a value in a sorted list
Pk
kth percentile
Converting a Percentile to a Data Value
- arrange values lowest to highest
- L = (k/100)n
- if L whole number, kth percentile is midway between Lth value and the next value in the sorted set of data. i.e. Pk = (Lth value + next value) / 2
- if L is not whole number, round L up. Pk is the Lth value counting from the lowest in the data set.
Quartiles
Measures of location, denoted Q1, Q2, and Q3 which divide a set of
data into four groups with about 25% of the values in each group
Q1 = P25
Q2 = P50
Q3 = P75
Interquartile range (IQR)
(IQR) = Q3 − Q1
-another measure of spread
that is less sensitive to outliers
5-Number Summary
For a set of data, consists of these five values:
1. Minimum
2. First quartile, Q1
3. Second quartile, Q2 (same as the median)
4. Third quartile, Q3
5. Maximum
Constructing a Boxplot
Can be used to identify skewness
1. Find the 5-number summary.
2. Construct a line segment extending from the minimum data value to
the maximum data value.
3. Construct a box (rectangle) extending from Q1 to Q3, and draw a line in
the box at the median.
Identifying Outliers for Modified Boxplots
- Find the quartiles.
- Find the IQR.
- Evaluate 1.5×IQR.
- In a modified boxplot, a data value is an outlier if it is:
above Q3 by an amount greater than 1.5×IQR; or below Q1 by
an amount greater than 1.5×IQR.
-A special symbol (such as an asterisk or point) is used to identify
outliers as defined previously.
-The solid horizontal line extends only as far as the minimum data
value that is not an outlier and the maximum data value that is not
an outlier