Lec 1 & TB Flashcards
2 main types of data
- Continuous
- Discrete
- Define discrete data or 2 types of discrete data
- 3 types of categorical data
Discrete data:
- counts, # of times smth happens
- Categorical: put in categories
- Binary (0/1)
- Ordered/ordinal: the order is meaningful; but the distance b/w each is not
- E.g. S,M,L; agree, n0, disagree
- IOW: no meaningful distance b/w S to M
- Unordered/nominal: names; no logical order
- race, sex
Define continuous data
- 2 types of cont data
- Cont data: measured quantities that can be measured to infinite prevision (eg height, weight, BP); difference b/w the intervals are meaningful
- Sometimes data are technically discrete are treated like cont data (eg SF-26 QoL instrument)
- Eg: likert scale is ordinal
- Goal: add them up -> total #
- Since the stuff you add up are diff v, it ends up LIKE a cont v
- Eg: likert scale is ordinal
- 2 types of cont data
- Ratio: has TRUE meaningful zero (eg height, weight)
- Interval: zero is arbitrary (eg scale data, temp)
- Eg 0 in dC: freezing pt h2o
- Eg 0 in dF: freezing pt of salt h2o
define “X”
- “X”: variable of interest
- “xi”: the subscript number is a specific item in the data set
- N: number in pop
- “n”: lower case is # in sample
- fi = frequency of xi
- f = total # of observations in an interval
- ∑ = sum
- X
- Greek letters represent pop characteristics (parameters)
- µ = pop mean
- σ2 = pop variance
- σ = pop SD
- Roman letters rep sample characteristics (stats)
- x̄ = sample mean
- s2: sample variance
- s = sample sd
Type of data described by descriptive statistics
What does the distribution of data tells us?
- Descriptive stats describe characteristics relating to distribution of data
- The most appropriate descriptive stats depend on data distribution
- Distribution of data = pattern of observations
mean
- Where is the mean if the graph is right skewed?
- Where is the mean if the graph is left skewed?
- Geometric mean
- Arithmetic mean
median
- define
- odd vs even # of observations
mode
mean = avg
- -ve or right skewed, mean is shifted to right
- +Ve or left skewed, mean is shifted to left
- Geometric mean: multiple then root
- Arithmetic mean: add then divide
Median
- (Q2): middle data value
- Odd # of observations: median = middle #
- Even # of observations: median = avg of 2 middle vales
Mode
- The # is most frequently occurring in the data set
- 5 number summary
- How to get Q1 and Q3
- 5 # summary: min, Q1, Q2, Q3, max
- Quartiles
- Sort the data
- Q1 = (n+1)/4 th ordered observation
- Q3: 3(n+1)/4 th ordered observation
Formulas
- sample variance
- coefficient of variation
Sample variance: Right formula of image
Sample standard deviation: s = √s2
Interquartile range: IQR = Q3 − Q1
Range: Max − Min
Coefficient of variation: CV = (s/x)(100)% (Only valid for ratio data)
degree of freedom
Why do we √ the variance
When is the empirical rules used?
Empirical rule
x
How do we determine outliers if data is asymmetrical, not normally distributed
Degrees of freedom
- df of an estimate is the # of independent pieces of info used to obtain the estimate
- x
- √ the variance gives us the sd, and restores the original unit
- X
- If the freq distribution is symmetrical and bell-shaped = normal distribution; empirical rule is used
- Empirical rule: for a normal distribution, all the data lies in 3 sd of the mean
- 68% of data lie w/in interval µ +/- σ
- 95% of data in µ +/- 2σ
- 99.7% in µ +/- 3σ
- IOW: 0.3% of data are outliers
- x
- When we do not have a normal distribution/ data distribution is asymmetric, outliers are identified as
- < Q1 – (1.5 x IQR)
- > Q3 + (1.5 x IQR)
Graphs that show distribution
Graph that show association
Models for inferences
- distributional: histogram, density plot, box-whisker plot, quantile-quantile (Q-Q) plot
- Association: scatter plot
- Inferences
- t-tests (parametric), Wilcoxon (non-parametric)
- Linear regression, analysis of variance (ANOVA)
Graphs
- con
- most common graph
Descriptive stats
- what is displayed
- how do we display relationships
- Graphs
- Challenging to display at times
- Usually use dotplots
- Descriptive stats
- # or prop (%) of each category
- Crosstabulations b/w categorical v (have multiple v) to display relationships
Inference and stat models
- binomial test
- fischer’s exact or X^2 test
Inference and stat methods
- Binary data: Binomial (prop) test (single sample)
- Fischer’s exact or X^2 test (chi-square test) (comparing samples)
- More than 2 categories, use chi-squared
- Population
- Population parameters
- Sample
- Sample stats
Population vs samples
- Pop
- Collection of all possible subjects
- Parameters:
- µ = mean[AL1]
- sigma sq: variance
- π = proportion w/ characteristic
- Parameters: unknown constant to estimate
- Sample
- Subset of pop (estimates of pop using sample)
- x bar = mean
- s^2 = sample variance
- p = proportion in the sample w/ characteristic
- Sample stats: variable b/c it depends on a particular sample
- Used to estimate pop parameters
- Subset of pop (estimates of pop using sample)
[AL1]“m” “mu” “mean”
“s” “sigma” “sd”
“p” “pi” “prop”
Describe box plot
Box whisker