CRP 109 Stats Lecture 1 Flashcards

1
Q

Data definition

A

Collections of observations, such as measurements, or survey
responses

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Statistics definition

A

The science of planning studies and experiments; obtaining data; and organizing, summarizing, presenting, analyzing, and interpreting those data and then drawing conclusions
based on them

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Population definition

A

The complete collection of all measurements or data that are
being considered. Typically, it is the complete collection of data that we would like to make inferences about

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Census definition

A

The collection of data from every member of the population

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Sample definition

A

The subcollection of members selected from a population

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Variable definition

A

A characteristic that varies (changes) across individuals in a
population. The values (observations) recorded collectively
make up the data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

parameter definition

A

A numerical measurement describing some characteristic of a
population

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

statistic definition

A

A numerical measurement describing some characteristic of a
sample

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Discrete data

A

result when the data values are quantitative and the number of
values is finite (countable)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Continuous data

A

result from infinitely many possible quantitative values
(not countable). They can be measured, but not counted.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Missing completely at random

A

The likelihood of the data value being
missing is independent of its value or any of the other values in
the data set (any data value is just as likely to be missing as
any other data value)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Missing not at random

A

The missing value is related to the reason that it is missing. Ignoring these could lead to bias in the remaining
values and the results may then become misleading

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Simple random sample (SRS)

A

A sample of n subjects selected in such a way that every possible sample of the same size n has the
same chance of being chosen

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Designed Experiment

A

We apply some treatment and then proceed to observe its effects on the individuals. The individuals in
designed experiments are called experimental units, and they
are often called subjects when they are people

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Observational Study

A

We observe and measure specific characteristics, but
we do not attempt to modify the individuals being studied

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Random Sampling Error

A

Occurs when the sample has been selected with a
random method, but there is a discrepancy between a sample
result and the true population result; such an error results from
chance sample fluctuations

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Non-Sampling Error

A

The result of human error, including such factors as wrong data entries, computing errors, questions with biased wording, false data provided by respondents, forming biased
conclusions, or applying statistical methods that are not
appropriate for the circumstances

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Non-Random Sampling Error

A

The result of using a sampling method that is not random, such as using a convenience sample or a
voluntary response sample

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Frequency (of a class)

A

The number of original values that fall into that class.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Frequency Distribution/Table

A

Shows how data are partitioned among
several categories/classes by listing the categories along with
the number (frequency) of data values in each of them
-used to summarize large data sets, see the distribution and identify outliers, and/or have a basis for constructing
graphs.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Lower Class Limits

A

The smallest numbers that can belong to each of the
different classes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Upper Class Limits

A

The largest numbers that can belong to each of the
different classes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Class Boundaries

A

The numbers used to separate the classes, but without
the gaps created by class limits.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Class Midpoints

A

The values in the middle of the classes. Each class midpoint
is computed by taking the average of the lower and upper class
limits

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

Class Width

A

The difference between two consecutive lower class limits (or
boundaries) in a frequency distribution.
class width = (max value - min value) / number of classes

26
Q

Relative (or Percentage) Frequency Distribution

A

relative freq = freq of class / sum of all freq

*100 to get percentage freq

27
Q

Cumulative Frequency Distribution

A

frequency for each class is the sum of the frequencies for that
class and all previous classes
- class limits are replaced by “less than” expressions that describe
the new ranges of values

28
Q

Histogram

A

A graph consisting of bars of equal width drawn adjacent to
each other (unless there are gaps in the data). The horizontal scale represents classes of quantitative data values, and the vertical scale represents frequencies. The heights of the bars
correspond to frequency values.

29
Q

Relative Frequency Histogram

A

A graph that has the same shape and
horizontal scale as a histogram, but the vertical scale uses relative frequencies instead of actual frequencies (i.e. proportion or percent)

30
Q

Correlation

A

Exists between two variables when the values of one variable
are somehow associated with the values of the other variable.
Correlation does not imply causation.

31
Q

Linear Correlation

A

Exists between two variables when there is a correlation
and the plotted points of paired data result in a pattern that
can be approximated by a straight line

32
Q

Linear Correlation Coefficient, r

A

Measures the strength of the linear
correlation between the paired quantitative x and y values in a
sample. It is sometimes referred to as the Pearson product
moment correlation coefficient

33
Q

Scatterplot

A

A plot of paired (x , y ) quantitative data with a horizontal x
-axis and a vertical y -axis

34
Q

Properties of r

A

-The value is always between −1 and 1, inclusive
-If r is close to −1, there appears to be a strong negative correlation.
-If r
is close to 1, there appears to be a strong positive correlation.
-If r is close to 0, there appears to be a weak or no linear correlation.
-A value of exactly −1 or 1 implies that all of the data fall exactly on a line (perfect correlation)
-If all values of either variable are converted to a different scale, the
value of r does not change
- Interchange all x values and y values, and the value of r will not change
- not designed to measure the strength of a relationship that is not linear
-sensitive to outliers

35
Q

Regression

A

Given a collection of paired sample data, the regression line or line of
best fit or least-squares line is the straight line that “best” fits the
scatterplot of the data

36
Q

Descriptive Statistics

A

Methods and tools that summarize or describe
relevant characteristics of data

37
Q

Inferential Statistics

A

Methods and tools that make inferences, or
generalizations, about populations

38
Q

Mean

A

The measure of centre found by adding all of the data values and
dividing the total by the number of data values
-not resistant to outliers

39
Q

Median

A

The measure of centre that is the middle value when the original data
values are arranged in order of increasing (or decreasing) magnitude.
-resistant to outliers (only changes slightly)

40
Q

Mode

A

-The value(s) that occurs with the greatest frequency.
-The mode can be found with qualitative data.
-A data set can have no mode, one mode (unimodal), or multiple
modes

41
Q

s

A

sample standard deviation

42
Q

s2

A

sample variance

43
Q

σ

A

population standard deviation

44
Q

σ2

A

population variance

45
Q

Range

A

-The difference between the maximum data value and the minimum
data value
-Very sensitive to outliers (not resistant)
- does not truly
reflect the variation among all of the data values

46
Q

Standard Deviation of a Sample (s)

A

A measure of how much data values deviate away from the mean
T-he value is never negative. It is zero only when all of the data values
are exactly the same.
-Larger values indicate greater amounts of variation.
-not resistant to outliers
- units are the same as the units of the original data values

47
Q

Variance

A

-A measure of variation equal to the square of the standard deviation.
-The units are the squares of the units of the original data values
-not resistant to outliers
-The value is never negative. It is zero only when all of the data values
are the same number.
-s2 is an unbiased estimator of σ2

48
Q

Chebychev’s Rule

A

for any data set:
-at least 75% of data lies within 2 standard deviations of the mean.
-at least 89% of data lies within 3 standard deviations of the mean

49
Q

Empirical Rule

A

The empirical rule states that for bell-shaped data sets,
-approximately 68% of data lies within 1 standard deviation of the
mean.
-approximately 95% of data lies within 2 standard deviations of the
mean.
-approximately 99.7% of data lies within 3 standard deviations of the
mean

50
Q

Percentiles

A

Measures of location, denoted P1, P2, . . . , P99, which divide a set of
data into 100 groups with about 1% of the values in each group
-The 50th percentile, P50, has about 50% of the data values below it
and about 50% of the data values above it, corresponding to the
median

51
Q

Finding the Percentile of a Data Value

A

percentile of value x = (number of values less than x) / (total number of values) *100

52
Q

k

A

percentile being used

53
Q

L

A

locator that gives the position of a value in a sorted list

54
Q

Pk

A

kth percentile

55
Q

Converting a Percentile to a Data Value

A
  1. arrange values lowest to highest
  2. L = (k/100)n
  3. if L whole number, kth percentile is midway between Lth value and the next value in the sorted set of data. i.e. Pk = (Lth value + next value) / 2
  4. if L is not whole number, round L up. Pk is the Lth value counting from the lowest in the data set.
56
Q

Quartiles

A

Measures of location, denoted Q1, Q2, and Q3 which divide a set of
data into four groups with about 25% of the values in each group
Q1 = P25
Q2 = P50
Q3 = P75

57
Q

Interquartile range (IQR)

A

(IQR) = Q3 − Q1
-another measure of spread
that is less sensitive to outliers

58
Q

5-Number Summary

A

For a set of data, consists of these five values:
1. Minimum
2. First quartile, Q1
3. Second quartile, Q2 (same as the median)
4. Third quartile, Q3
5. Maximum

59
Q

Constructing a Boxplot

A

Can be used to identify skewness
1. Find the 5-number summary.
2. Construct a line segment extending from the minimum data value to
the maximum data value.
3. Construct a box (rectangle) extending from Q1 to Q3, and draw a line in
the box at the median.

60
Q

Identifying Outliers for Modified Boxplots

A
  1. Find the quartiles.
  2. Find the IQR.
  3. Evaluate 1.5×IQR.
  4. In a modified boxplot, a data value is an outlier if it is:
    above Q3 by an amount greater than 1.5×IQR; or below Q1 by
    an amount greater than 1.5×IQR.
    -A special symbol (such as an asterisk or point) is used to identify
    outliers as defined previously.
    -The solid horizontal line extends only as far as the minimum data
    value that is not an outlier and the maximum data value that is not
    an outlier