Final Of Everythin Flashcards
Data Matrix
A convenient way to store data (eg spread sheet, table). Each row is a unique case (observational unit). Each column corresponds to a variable.
The two types of variables
Numerical or Categorical
Numerical Variables
Can be discrete or continuous
Categorical Variables
Can be ordered or nominal
What type of variable is “Number of Siblings”?
Numerical (discrete)
What type of variable is “Student Height”?
Numerical (continuous)
What type of variable is “Previous Stats Courses Taken”?
Categorical (nominal)
Explanatory variables might affect
Response variable
Two types of data collection
Observational Studies and Experiments
Researchers collect data passively they merely observe
Observational studies
Researchers actively control the data collection trying to establish causation
Experiments
Sampling principles and strategies
1st step: Identify topics and questions to be investigated
2nd: clearly laid out research questions is important to identify important subjects/causes and what variables are important
3rd: Consider how data are collected
Example: suppose we want to estimate household size where a household is defined as people living together in the same dwelling and sharing living accommodation. If we selected students at random at an elementary school and asked them what their family size is, wilk this be a good measure of house hold size
- Average will be biased
- Only measuring households with children, not single people or people without children.
- Would likely estimate a higher number than the true number.
Relationship between Sample and Population
Sample is a subset of population:
Population- people
Sample- a group of selected people
Three sampling methods
1) simple random sample
2) stratified sample
3) cluster sample
Simple random sample
Randomly selected from population
What type of sample is cars passing through intersections in Kelowna
Simple random sample
Stratified sample
Cases grouped into strata, then simple random sampling
Cluster sample
Divide into clusters and sample all
Multistage sampling
Clusters are sampled randomly
Scatterplot
A way to provide case by case view of data. Can visualize relationship between two numerical variables.
Dot plot
Visualize one numerical variable
Sample mean (sample average formula)
x̄ = (x1 + x2 + x3 +… +xn)/n
What is the unit of sample mean
The same as the sample
Symbol for population mean
μ
Histograms
Provides a view of the data density (ie the data distribution)
Unimodal histogram distribution
A single prominent peak
Bimodal/ multimodal histogram distribution
Several prominent peaks
Uniform histogram distribution
No apparent peaks
Types of skewness
Right skewed (tail on right), left skewed (tail on left) or symmetric
Deviation
Distance from the mean
Sample variance
S^2 = ((x1- x̄)^2 + (x2-x̄)+…+(xn-x̄)^2)/(n-1)
What are the units of sample variance?
Squared of the units of the sample
Sample standard deviation formula
S =sqrt(s^2)
Population variance formula
σ^2 = ((x1-x̄)^2 +… (xn-x̄)^2)/n
Population standard deviation
σ = sqrt (σ^2)
Main components of a box plot
- Median Q2
- First quartile Q1 (median of half)
-third quartile Q3 (median of other half)
-Max and min wiskers Q3 + 1.5IQR and Q1-1.5IQR - IQR is Q3-Q1
IQR formula
Q3-Q1
Steps to draw a box plot
1) Draw a thick line for the median (Q2)
2) Draw rectangle with bounds Q1 and Q3
3) Draw a dotted line for Q1-1.5IQR and Q3+1.5IQR
4) Label outliers and draw T shape upper/lower whiskers ( they only go as far as highest or lowest data points)
Robust Statistics
Median and IQR are more robust than mean and standard deviation (less affected by outlier behavior)
Common practices
-Symmetric distributions-> mean and SD
-Skewed distributions -> median and IQR
What type of plot would be most useful for visualizing the data density
Histogram
Suppose a data set only has two values. What can you say about the relationship between mean and median?
Mean= median
Consider a population of [1,2,3,4,10]. What are three mean and variance (VAR)?
Mean =4 Var=10
Consider a population of [1,2,3,4,10]. What are three mean and variance (VAR)?
Mean =4 Var=10
A company records the commute distances of all 42 of its employees. By mistake the smallest commute was measured at 1 mile instead of 10. compre recorded median to actual median
The recorded median will be the same as the actual median
Suppose we are interested in estimating the malaria rate known as a dense tropical portion of a southeastern country. We learn there are 30 villages, each more or less similar to the next. Our goal is to test 150 individuals. What sampling method should be used
Cluster sampling
What are the odds of rolling a 1 with a fair dice
1/6
Probability Definition
The probability of an outcome is the proportion of times the outcome would occur if we observed the random process an infinite number of times
Mutually exclusive or disjoint
Have no outcomes in common
Outcome
Random result from an experiment
Event
Set of outcomes has probability assigned to it
Sample space
All possible outcomes
Complement
Probability that the event does not occur