Basic Statistics Flashcards
Mode (Moda)
The value that occurs most frequently in a given data set.
Interquartile Range (IQR)
Rozstęp ćwiartkowy
Q3-Q1
Standard Deviation (SD)
Odchylenie standardowe: sqrt(variance)
Variance
Population: (mean - xi)^2 / N where xi is each element of set
Sample: Use n - 1 instead
How to describe a histogram
4 Main Aspects:
- Shape - Overall appearance of histogram. Can be symmetric, bell-shaped, left skewed, right skewed, etc…
- Center - Mean or Median
- Spread - How far our data spreads. Range, Interquartile Range (IQR), standard deviation, variance.
- Outliers - Data points that fall far from the bulk of the data.
Study design and types of study
Encompasses everything in preparation for data-driven research process.
Types:
- Confirmatory: Specify falsifiable hypothesis, then test it.
- Exploratory: Collect and analyze data without first pre-specifying question.
- Comparative: contrast one quantity to another.
Dependent (example when) vs. Independent Data
- Dependent data observations correlated due to feature of study design (cluster sampling or longitudinal measurement).
- Independent data observations completely independent of each other may/may not arise from common distribution.
i.i.d.
i = independent
id = identically distributed
Simple Random Samples (SRS)
Each sampling unit of a population has an equal chance of being included in the sample.
Longitudinal Data
Repeated measures of same variable, collected from same unit over time → likely correlated.
Repeated Measures Data: Wide and Long
Wide format: one row per subject, each measure in separate column.
Long format: one row per measurement.
Quantitative Variables types
- Continuous - could take on any value within an interval, many possible values.
- Discrete - countable value, finite number of values.
Categorical (or Qualitative) Variables
- Ordinal - groups have an order or ranking.
- Nominal - groups are merely names, no ranking.
Conducting a Population Census
Gather data from the whole population.
Probability Sampling
Probability sampling refers to the selection of a sample from a population, when this selection is based on the principle of randomization, that is, random selection or chance.
Probability of selection for each unit is known.
Types: SRS, Complex (anything beside SRS - cluster, stratification, etc…)
Stratification
Population divided into different strata, and part of sample is allocated to each stratum; → ensures sample representation from each stratum, and reduces variance of survey estimates.
Clustering
Clusters of population units (e.g., counties) are randomly sampled first (with known probability) within strata, to save costs of data collection (collect data from cases close to each other geographically)
Non-Probability Sampling
- Probabilities of selection can’t be determined for sampled units,
- Often cheap
- Examples: opt-in web surveys, volunteers
- Strong risk of sampling bias
Pseudo-Randomization
Combine non-probability sample with a probability sample, Estimate probability of being included in non-probability sample as a function of auxiliary information available in both samples,
Non-Probability Sampling Calibration
Compute weights for responding units in non-probability sample that allow weighted sampled to mirror a known population.
Example: If we got more responses from females than males (but population is 50/50), then down-weight females and up-weight males.