Chapter 1 Flashcards

1
Q

Population

A

The entire collection of individuals or objects about which information is desired

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Census

A

When all of the desired information is avaiable for all objects or individuals in the population

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Sample

A

A subset of the population because of limited time, resources, money, etc.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Types of Variables

A

1) Categorical 2) Quantitative or Numerical

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Categorical Variable

A

A categorical variable places an individual or object into one of several groups or categories Ex) Gender, race, type-of-job, hair color

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Quantitative or Numerical Variable

A

A Quantitative or Numerical Variable takes numerical values for which arithmetic operations such as adding and calculating an average value makes sense Ex) Age, salary

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Discrete numerical data

A

Numerical data is discrete if it’s set of possible values are finite Ex) Your year in college (1, 2, 3 or 4)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Continous numerical data

A

Numerical data is continuous if it’s set of possible values form an entire interval on the number line Ex) Weight/height of an individual

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Univariate Data

A

Observations made on a single variable for each object in the dataset Ex) The unemployment rate of each state (state = object and unemployment rate = single variable)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Multivariate Data

A

Observations made on multiple variables for each object in the dataset Ex) Each person -> age, gender, race, salary, job-type

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Bivariate Data

A

Bivariate data is a special case of multivariate data, where observations are made on two variables for each object in the dataset

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Branches of Statistics

A

Descriptive Statistics; Inferential Statistics; Probability

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Descriptive Statistics

A
  • Objective is to merely summarize and describe important features of the data that is collected - Graphical approach: stem-and-leaf plots, histograms, box-plots, pie-charts, scatter-plots etc. - Numerical approach: calculation of numerical summary measures such as arithmetic mean, median, mode, standard deviation, correlation coefficient etc.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Inferential Statistics

A
  • Objective is to use information in the sample to make some sort of a conclusion (or inference) about the population from which the sample was selected - Includes Point-estimation, Hypothesis Testing, Confidence Interval Estimation, ANOVA, Linear Regression, etc.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Probability

A
  • Forms a bridge between descriptive and inferential statistics - Probability makes assumptions about the structure of the population, and then asks questions about what might result from selecting a sample from the population (deductive reasoning)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Stem-Leaf Plot Construction

A

Separate each observation into a ‘stem” consisting of all but the final (rightmost) digit and a “leaf,” the final digit

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Stem-Leaf Plot: Pros and Cons

A

Pros: - a quick way of describing and ordering data - generally easy to construct - displays the actual data values - easy way to obtain a general idea of the distribution of the data (e.g. symmetric, skewed, bimodal) - to be able to describe the data from a stem-and-leaf plot, look for: a typical or a representative value (e.g. median); extent of spread about the typical value; presence of any gaps in the data; number and location of peaks; outliers Cons: - not always easy to construct an appropriate stem

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Relative Frequency

A

The proportion of times the value occurs in the sample For continuous data -> we have to create class-intervals and find the relative frequency for each class-interval

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Construction of Histograms for Discrete Data

A

Area of each rectangle is proportional to the relative frequency of the value

20
Q

Positive Skew of Histogram

A

A unimodal histogram with the right or upper tail stretched out more compared to the left or lower tail

21
Q

Symmetric Histogram

A

Left half of the histogram is mirror image of right half

22
Q

Unimodal Histogram

A

Histogram rises to a single peak and then declines

23
Q

Bimodal Histogram

A

Histogram has two different peaks

24
Q

Multimodal Histogram

A

A histogram with more than two peaks

25
Q

Number of Peaks in a Histogram

A

The larger the number of class intervals, the more likely it is that bimodality or unimodality will manifest itself

26
Q

Negative Skew of Histogram

A

A unimodal histogram with the left or lower tail stretched out more compared to the right or upper tail

27
Q

Density Estimate

A

A smoothed histogram

28
Q

Sample Size

A
  • The number of observations in a sample
  • Notation = n
  • If two samples are simultaneously under consideration, either m and n; or n1 and n2
29
Q

The Mean

A
  • The most familiar and useful measure of the center is the mean, or arithmetic average of the set - The only point at which a fulcrum can be placed to balance the system of weights is the point corresponding to the value of x-bar.
  • Notation: Sample Mean = x-bar
  • Notation: Population Mean = µ​
    • µ = (sum of the N population values)/N
30
Q

Deficiencies of The Mean

A
  • Outlier can greatly affect the mean and make the mean an inappropriate measure of the center
31
Q

Sample Median

A
  • Obtained by first ordering the n observations from smallest to largest (with any repeated values included so that every sample observation appears in the ordered list)
32
Q

Median Distribution

A
33
Q

Quartiles

A
  • Quartiles divide the data set into 4 equal parts, with the observations above the 3rd quartile constituting the upper quarter of the data set, the 2nd quartile being identical to the median, and the 1st quartile separating the lower quartile from the upper 3 quartiles
34
Q

Percentiles

A
  • Similarly, a data set (sample or population) can be even more finely divided using percentiles; the 99th percentile separates the highest 1% from the bottom 99%
35
Q

Trimmed Mean

A
  • a compromise between the sample mean and the sample median
  • a 10% trimmed mean, for example, would be computed by eliminating the smallest 10% and the largest 10% of the sample and then averaging what remains
36
Q

Range

A
  • A measure of variability
  • Computed by finding the difference between the largest and smallest sample values
  • Defect = Range only depends on the two most extreme observations and disregards the positions of the remaining n-2 values
37
Q

Deviations from The Mean

A
  • A deviation will be positive if the observation is larger than the mean (to the right of the mean on the measurement axis) and negative if the observation is smaller than the mean
  • If all the deviations are small in magnitude, then all xis are close to the mean and there is little variability.
  • Alternatively, if some of the deviations are large in magnitude, then some xis lie far from the mean suggesting a greater amount of variability.
38
Q

Sum of Deviations

A
  • The average deviation is always zero
39
Q

Sample Variance

A
  • denoted by s2
40
Q

Sample Standard Deviation

A
  • denoted by s
  • Note that s2 and s are both nonnegative. The unit for s is the same as the unit for each of the xis.
41
Q

Population Variance

A
  • Denoted by σ2
  • For the population, the divisor is N and not N-1
  • Note that σ2 involves squared deviations about the population mean µ.
  • If we actually knew the value of µ, then we could define the sample variance as the average squared deviation of the sample xis about µ.
  • However, the value of µ is almost never known, so the sum of squared deviations about x-bar must be used.
  • But the xis tend to be closer to their average sample median than to the population average µ, so to compensate for this the divisor n – 1 is used rather than n
42
Q

Population Standard Deviation

A
  • Denoted by σ
43
Q

Five Number Summary

A
  • The five number summary consists of the smallest observation, the first quartile, the median, the third quartile, and the largest observation, written in order from smallest to largest
  1. Minimum
  2. Q1
  3. M
  4. Q3
  5. Maximum
44
Q

To Calculate The Quartiles

A
  1. Arrange the observations in increasing order and locate the 50th percentile, or the median M in the ordered list of obsercations
  2. The 25th percentile, or the first quartile Q1 is the median of the observations whose position in the ordered list is to the left of the overall median
  3. The 75th percentile, or the third quartile Q3 is the median of the observations whose position in the ordered list is to the right of the overall median
45
Q

Box-Plot

A
  • A box-plot is a graph of the five number summary
  • Box-plots are most useful for side-by-side comparison of several distributions
  • Describes location and spread of a sample
  • Procedure:
    1. Draw a rectangle with lower and upper edges at the 25th percentile, or 1st quartile, and the 75th percentile, or 3rd quartile
    2. Draw a horizontal line across the rectangle at the median
    3. Extend verticle lines, or whiskers, from the middle of the upper and lower edges of the rectangle to the minimum and maximum values
46
Q
A