Basics Flashcards
What is a variable?
- A characteristic or measurement that can be determined for each member of the population (age)
A numeric variable
- Continuous variable
- Discrete variable
Numeric: Variables that are expressed in numbers
Continuous variable: (we measure it): - Any value within a range (height (measured in cm.), weight (measured in kg.), income (measured in kr.))
Discrete variable: (we count it): - Values are whole numbers (number of cars sold, number of steps)
A Categorical variable
- Qualitative data (categorized data) (education, yes or no to a term deposit)
What is a population?
- The entire group of individuals or items of interest
What is a sample/sampling?
- A subset of individuals or observations from a population, in which we use to make inferences about the population
What are inferences?
- Conclusions drawn from data – goal: make predictions or understand patterns
What is a statistic?
- Numbers that describe the sample (sample mean, sample variance)
What is a parameter?
- Numbers that describe the entire population (population mean, population variance)
Variance
variability from the mean. Measures how much the values in a dataset, on average, deviate squared from their mean – it describes the dataset’s spread. Note: decreases with larger sample sizes.
Example: average height, 170 cm. 2 observations, 160 cm & 180 cm.
deviate squared: (160-170)^2 = 100, (180-170)^2 =100
Variance: (100+100) / 2 = 100
Standard deviation
measures how far each score lies from the mean (square root of variance). Eksempel (as before). Square root of 100 = 10. The observation deviate on average by 10 units from the mean.
Why samples
Cost are reduced and it is simpler to analyze compared to the whole population
Simple random sampling
Everyone has an equal chance of being selected. And the sample is selected independently of each other.
Stratified random sampling
The population is divided into subgroups and then randomly selected from each group
Cluster sampling
Population is divided into clusters, and then we randomly select clusters
Central limit theorem
if the sample is sufficiently large the sample’s mean will follow a normal distribution regardless of the population’s distribution
Standard error
Measures how much a sample statistic, like sample mean, is expected to vary from the true population value (due to sampling variability) Note: larger sample sizes yield smaller standard errors – inferences more precise
Point estimator
calculate an estimate of an unknown population parameter based on sample data
Sample mean: x ̅ is a point estimator of the population mean μ
Sample proportion: p ̂ is a point estimator of the population proportion p
Sample proportion
the ratio of individuals in a sample that possess a certain characteristic (40% smokes)
Unbiasedness (estimator):
the expected value of an estimator equals the true value of the population parameter it is estimating. No overestimation or underestimation of a population parameter. Biased (I forlængelse): finding maxes of a dataset & sample standard deviation (s) is a biased estimator of the population standard deviation
Efficiency
refers to how well an estimator uses the data to estimate a population parameter, relative to other estimators. If there are several unbiased estimators of a parameter, then the unbiased estimator with the smallest variance is called the most efficient estimator,
Sample distributions of sample variances
allow to make inferences of the population variance. Essential for quality control and understanding process variability