Final Of Everythin Flashcards
Data Matrix
A convenient way to store data (eg spread sheet, table). Each row is a unique case (observational unit). Each column corresponds to a variable.
The two types of variables
Numerical or Categorical
Numerical Variables
Can be discrete or continuous
Categorical Variables
Can be ordered or nominal
What type of variable is “Number of Siblings”?
Numerical (discrete)
What type of variable is “Student Height”?
Numerical (continuous)
What type of variable is “Previous Stats Courses Taken”?
Categorical (nominal)
Explanatory variables might affect
Response variable
Two types of data collection
Observational Studies and Experiments
Researchers collect data passively they merely observe
Observational studies
Researchers actively control the data collection trying to establish causation
Experiments
Sampling principles and strategies
1st step: Identify topics and questions to be investigated
2nd: clearly laid out research questions is important to identify important subjects/causes and what variables are important
3rd: Consider how data are collected
Example: suppose we want to estimate household size where a household is defined as people living together in the same dwelling and sharing living accommodation. If we selected students at random at an elementary school and asked them what their family size is, wilk this be a good measure of house hold size
- Average will be biased
- Only measuring households with children, not single people or people without children.
- Would likely estimate a higher number than the true number.
Relationship between Sample and Population
Sample is a subset of population:
Population- people
Sample- a group of selected people
Three sampling methods
1) simple random sample
2) stratified sample
3) cluster sample
Simple random sample
Randomly selected from population
What type of sample is cars passing through intersections in Kelowna
Simple random sample
Stratified sample
Cases grouped into strata, then simple random sampling
Cluster sample
Divide into clusters and sample all
Multistage sampling
Clusters are sampled randomly
Scatterplot
A way to provide case by case view of data. Can visualize relationship between two numerical variables.
Dot plot
Visualize one numerical variable
Sample mean (sample average formula)
x̄ = (x1 + x2 + x3 +… +xn)/n
What is the unit of sample mean
The same as the sample
Symbol for population mean
μ
Histograms
Provides a view of the data density (ie the data distribution)
Unimodal histogram distribution
A single prominent peak
Bimodal/ multimodal histogram distribution
Several prominent peaks
Uniform histogram distribution
No apparent peaks
Types of skewness
Right skewed (tail on right), left skewed (tail on left) or symmetric
Deviation
Distance from the mean
Sample variance
S^2 = ((x1- x̄)^2 + (x2-x̄)+…+(xn-x̄)^2)/(n-1)
What are the units of sample variance?
Squared of the units of the sample
Sample standard deviation formula
S =sqrt(s^2)
Population variance formula
σ^2 = ((x1-x̄)^2 +… (xn-x̄)^2)/n
Population standard deviation
σ = sqrt (σ^2)
Main components of a box plot
- Median Q2
- First quartile Q1 (median of half)
-third quartile Q3 (median of other half)
-Max and min wiskers Q3 + 1.5IQR and Q1-1.5IQR - IQR is Q3-Q1
IQR formula
Q3-Q1
Steps to draw a box plot
1) Draw a thick line for the median (Q2)
2) Draw rectangle with bounds Q1 and Q3
3) Draw a dotted line for Q1-1.5IQR and Q3+1.5IQR
4) Label outliers and draw T shape upper/lower whiskers ( they only go as far as highest or lowest data points)
Robust Statistics
Median and IQR are more robust than mean and standard deviation (less affected by outlier behavior)
Common practices
-Symmetric distributions-> mean and SD
-Skewed distributions -> median and IQR
What type of plot would be most useful for visualizing the data density
Histogram
Suppose a data set only has two values. What can you say about the relationship between mean and median?
Mean= median
Consider a population of [1,2,3,4,10]. What are three mean and variance (VAR)?
Mean =4 Var=10
Consider a population of [1,2,3,4,10]. What are three mean and variance (VAR)?
Mean =4 Var=10
A company records the commute distances of all 42 of its employees. By mistake the smallest commute was measured at 1 mile instead of 10. compre recorded median to actual median
The recorded median will be the same as the actual median
Suppose we are interested in estimating the malaria rate known as a dense tropical portion of a southeastern country. We learn there are 30 villages, each more or less similar to the next. Our goal is to test 150 individuals. What sampling method should be used
Cluster sampling
What are the odds of rolling a 1 with a fair dice
1/6
Probability Definition
The probability of an outcome is the proportion of times the outcome would occur if we observed the random process an infinite number of times
Mutually exclusive or disjoint
Have no outcomes in common
Outcome
Random result from an experiment
Event
Set of outcomes has probability assigned to it
Sample space
All possible outcomes
Complement
Probability that the event does not occur
There are 18 balls in a box. Five are white, thirteen are black. Choose two balls at random, on after another find the probability that both chosen balls are white
20/306
A fair coin is flipped twice what is the probability at least one flip is tails
3/4
Twenty students including Miriam and Rachel are to be placed in four classes of equal size at random. What is the probability they end up in different classes?
15/19
If two events are independent then P(A|B) = P(B|A)
No
Random Variable
An assignment of numbers to outcomes in some sample space
Dataset
Mean and variance
Random variable
Expected value (similar to mean) and variance.
Expected value equation
E(x) = x1 P(x=x1) + x2 P(x=x2) +…+ xn P(x=n)
Expected value symbol
E(x) or μ
Variance of Random Variables (RV)
Var(x) = (x1-μ)^2 p(x=x1) +…+ (x2-μ)^2P(x=xn)
Variance of X symbols
Var(x) or σ^2
Standard deviation notiation
SD(x) or σ
E(ax)=
aE(x)
E(ax+b)
aE(x) +b
SD(ax) =
|a| SD(x)
SD(ax+b) =
|a| SD(x)
Var(ax) =
a^2 Var(x)
Dependent Events probability notation
P( A n B) = P(B) P(A|B)
Independent events probability notation
P(A n B) = P(B) * P(A)
Area under the gaussian curve
Area = 1
Normal distribution Parameter notation
N( μ, σ)
Mean- μ
Standard deviation- σ
What is a Z score
A z score does the conversion to N(0,1)
A z score is a way to describe the relationship of a value to the mean of a group of values
Z score formula
z = (x- μ)/σ
Quantile
A quantile os an equal distribution of the probability distribution eg quartile 4 groups, percentile (100 groups)
Q-Q plot of symmetric distribution
Straight line following y=x
Q-Q plot of T shaped distribution
Starts lower than the line y=x then meets the line at the origin then slowly goes above the line.
Q-Q plot of a right skew distribution
Concave up curve, curve points right
Q-Q plot of left skew data
Concave down curve, pointing left
Geometric distributions
- goes until something happens (ie successful outcome)
- a series of independent trials with two outcomes
Binomial distribution:
- # of successes in a set # of trials-two variables success or failure
4 conditions of binomial distribution
1) trials are independent
2) # of trials, n, is fixed
3) each trial is success or failure
4) probability of success, p, is same for each trial.
Confidence Intervals
A confidence interval is the range of values to which we are a certain percentage confident (95%) that pur sample measurement represents the actual population mean.
Point estimate
A point estimate is the calculation of a single value which is the best guess as to the population parameter which is unknown (eg mean, proportion in support of a statement)
Population proportion notation
P
Sample proportion notation
p̂
Central limit theorem
-When many sample means are taken, the distribution of these sample means look like a normal distribution (particularly for larger sample sites)
- The populations distribution (even when skewed) does not actually change this normal distribution appearance of the sample means.
How large is large enough when it comes to sample size?
Generally n= 30
Success failure condition
np>= 10 and n(1-p) >=10
95% confidence interval of containing the mean
Point estimate +- 1.96 *SE
Standard Error SE
SE = σ/sqrt(n)
σ- population standard deviation
n- sample size
The 95% confidence interval means:
Roughly 95% of the time, the interval sample mean +- 1.96 σ/sqrt(n) will contain the population mean
99% confidence interval
Point estimate +- 2.58 σ/sqrt(n)
Consider the case for finding confidence intervals without population standard deviation
- Use sample SD instead of population SD
- Use t-tables instead of z table
t formula
t = (x̄- μ)/(s/sqrt(n))
Proof by contradiction
If the prob is very small we should reject the claim and accept our conjecture. Either you are observing a rare event or something is wrong about the original claim
Four steps of proof by contradiction
1) state hypothesis:
- null hupothesis Ho : μ =
- alternative hypothesis Ha: μ …
2) compute z score from the sample mean
3) find the pvalue: area to the right of z score
4) make the decision:
- reject the null hypothesis and accept alternative or accept null hypothesis based on alpha value
When do you use z tables
Population SD is given and you are trying to estimate population mean
When do you use t tables
Population SD is not given and you are trying to estimate population mean
When do you use chi squared tables (X^2)
Population SD is not given and you are trying to estimate population variance.
Using chi squared tables
- Examine a row for distributions with degree of freedom
-Identify a range for the area (eg 0.025 to 0.05)
-Chi squared table provides upper tail values which is different than z- and t distribution tables
Population variance confidence interval
[ (n-1)s^2/x2^2 , (n-1)s^2/x1^2]
What is instrumentation?
Term to describe the instruments used to measure physical quantities eq, pressure temperature, voltage
Active instruments
-Have external power
- expensive (complicated)
- resolution can be very small
Passive instruments
-Do not have external power
- inexpensive (simple)
-resolution is limited
Null type instruments
-No display
- null pressure gauges have weights coming on/off to measure pressure( cumbersome) weights are balanced until reference mark is reached.
Deflection type instruments
-Display,
-previous pressure gauges conveniently has a pointer against a scale
Analog instruments
- has output vary continuously. Resolution is determined by what your eye can distinguish
Digital instruments
-Has discrete steps in resolution
- requires analog to digital converter (A/D)
-Expensive
- Slow, not good for fast processes
Smart instruments
Has a microprocessor
Non-smart instruments
Does not have a microprocessor
Inaccuracy
The extent to which a reading might be wrong and is often quotes as a percentage of the full scale(f.s) (max value) reading of an instrument.
Tolerance
Describes the maximum deviation of a manufactured component from some specified value.
Range or span
Defines the min and max values of quantity that instruments can measure
Threshold
The input will have to reach a certain level before then change in the instruments output is large enough to be detectable
Resolution
The lower limit on the magnitude of change in the input measured quantity that produces an observable change in the instrument output.
Nonlinearity
Maximum deviation of any of the output readings marked x from the straight line.
Linear-regression line
Estimate of y is z-score = R any given x in z-score
Linear regression line:
(ŷ-ȳ)/sy = R (x-x̄)/sx
ȳ- ave pf y data points
ŷ- the line of best fit
Sensitivity
-Slope
- the measure of change in instrument output that occurs when the quantity measured changes by a given amount
-scale deflection/ value of measurement producing deflection
Zero drift
Bias zero reading of instrument modified by change in ambient conditions
Sensitivity drift
- slope (ie sensitivity) drifts because of change in ambient conditions
-eg modulus of elasticity in spring changing as a function of temperature
Sensitivity drift coefficient
= sens drift/ change in environment
Zero drift coefficient
= Zero drift/ change in environment
Reasons for incorrect or inaccurate measurements
- Behavior will gradually dicerge from the stated specifications
-effects of dust dirt fumes and chemicals in the environment
Several factors impacting rate of divergence
- type of instrument
- the frequency of usage
- severity of the operating conditions
Systemic errors
Mean is wrong (accuracy)
Random errors
Standard deviation is large (precision)
What are static characteristics
1) linearity
2) tolerance
3) sensitivity
What is the strength of a linear fit model?
The R squared value
How many outcomes in a Bernoulli trial.
There can only be 2 outcomes
Expected number and SD given probability
1/p - expected value
Sqrt( (1-p)/p^2) - standard deviation