Lesson 2.2: Numerically Summarizing Data Flashcards

Central Tendencies and Variability

1
Q

population mean

A
  • denoted by πœ‡
  • population size = 𝑁
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

sample mean

A
  • denoted by π‘₯ with line over it
  • sample size = 𝑛
  • R: mean(DataTable$Column)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

median

A
  • middle number of data in ascending order
  • if odd number of observaions: middle number
  • if even number of observations: mean of two middle numbers
  • not greatly affected by outliers
  • R: median(DataTable$Column)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

mode

A
  • most frequent observation
  • R: Mode <- function(x) {
    ux unique(x)
    ux [which.max (tabulate(match(x, ux)))]
    }
  • histogram: hist (data,breaks=seq (low, high, interval))
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Range
(R)

A
  • difference between largest and smallest value
    -** R: range(DataTable$Column)** shows lowest and highest values
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

population variance

A
  • denoted by Οƒ2
  • sum of the squared deviations about
    the population mean (x-ΞΌ)2
  • Divided by number of observations in population (N)

> (n=length(data$Score))
(x =data$Score)
(v.population = sum((x-mean(x))^2) / n)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

sample variance

A
  • denoted by s2
  • sum of the squared deviations about
    the sample mean (x-sample mean)2
  • Divided by number of observations in sample minus 1 (n-1)

R: var(DataTable$Column)
R var() only gives sample variance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

standard deviation

population and sample

A

square root of the variance
- denoted by Οƒ (population) or s (sample)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Coefficient of Variation
(CV)

A
  • measures the scatter in the data relative to mean
  • CV = standard deviation (s) / sample mean (x)
  • always expressed as percentage
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Computing standard deviation

Excel and R

A
  • Excel: = STDEV(data range)
  • R: sd(vector)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Empirical Rule

A
  • Approximately 68% data between πœ‡βˆ’1𝜎 π‘Žπ‘›π‘‘ πœ‡+1𝜎
  • Approximately 95% data between πœ‡βˆ’2𝜎 π‘Žπ‘›π‘‘ πœ‡+2𝜎
  • Approximately 99.7% data between πœ‡βˆ’3𝜎 π‘Žπ‘›π‘‘ πœ‡+3𝜎

left to right percentages under curve:
0.15 + (2.35 + (13.5 + (34 + 34) + 13.5) + 2.35) + 0.15

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Standardized values

z value

A
  • Compute Probabilities
  • Compare two different distributions
  • We compute standardized values
  • z = 2 : Data value is 2 standard deviations above the mean
  • z = -1.6 : Data value is 1.6 standard deviations below the mean

z = (Data value (y) - mean(ΞΌ)) / Standard deviation (Οƒ)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

p-value

A
  • Represents the area under the normal distribution curve towards left side
  • 1.0 z-value = 0.15 + 2.35 + 13.50 + 34.00 + 34.00 = 84% p-value (0.8413)
  • area under normal distribution curve = 1
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

p-value calculation for specific range

example

A

probability of student scoring b/n 450 and 600 on SAT
mean = 500, sd = 100
z = (600-500)/100 = 1.0 = p-value 0.8413
z = (450-500)/100 = -0.50 = p-value 0.3085
0.8413 - 0.3085 = 0.5328 or 53.28%

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Standard normal curve
(πœ‡=0, 𝜎=1)

A
  • Symmetric about its mean πœ‡=0,𝜎=1
  • Mean = Median = Mode
  • Single peak at z=0
  • Inflection point at βˆ’1π‘Žπ‘›π‘‘+1
  • Area under the curve = 1
  • Area of left ( π‘šπ‘’π‘Žπ‘› πœ‡=0) = Area of right = Β½
  • Follows the Empirical Rule
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Normalizing Data

Computing z-values

Excel and R

A
  • Excel: =STANDARDIZE(data value, mean, sd)
  • R: scale(data, mean, sd)
17
Q

Computing p-values
𝑃(𝑧)π‘₯ < 𝑧

Excel and R

A

β€˜Left’ Area (Probability) under Standard Normal Curve
- Excel: =NORMSDIST(z-value) Normal standard dist
- R: pnorm(z-value)

β€˜Right’ Area (Probability) under Standard Normal Curve
- Excel: =1-NORMSDIST(z-value)
- R: 1 - pnorm(z-value)

β€˜In Between’ Area (Probability) under Standard Normal
Curve
- Excel: =NORMSDIST(high z) - NORMSDIST(low z)
- R: pnorm(high z) - pnorm(low z)

18
Q

Converting p-value (β€˜left’ area) to z-value

Excel and R

A
  • Excel: =NORMSINV(p-value)
  • R: qnorm(p-value)
19
Q

Skewness

types (3)

A
  • Mean < Median < Mode: Negative / left skewed distribution
  • Mean = Median = Mode: Symmetrical distribution with zero skewness
  • Mean > Median > Mode: Positive / right skewed distribution
20
Q

Percentile

A
  • The kth percentile of a set of data is a value such that k percent of the observations are less than or equal to the value
  • eg. P2 = 2% of observations are <= value
21
Q

Quartile

A

The quartiles divide the data into 4 equal parts

*First quartile: Q1 *
- Bottom 25% (25 percentile)

Second quartile: Q2
-Bottom 50% = 50 percentile (median)

Third quartile: Q3
- Bottom 75% = 75 percentile

22
Q

Boxplot

features and R command

A
  • Line = median
  • box = Q1 and Q3
  • viscus (brackets) = min and max
  • dot = outlier
  • R: boxplot(data vector)
23
Q

Skewness and Boxplots

A

Normal Distribution
- (Q3-Q2) = (Q2-Q1)

Positive Skew
- (Q3-Q2) > (Q2-Q1)

Negative Skew
- (Q3-Q2) < (Q2-Q1)

24
Q

Data Standardization and Scaling

A

Standardization Data Variation (z-value)
- Range: -3 to +3

**Scaling Data Variation **
- (value - min value) / (max value - min value)
- Range: 0 to 1
- not effective with outliers bc will suppress scaling values of other data elements