Week 1: Exploratory Data Analysis & Basic Statistics Flashcards
What are the primary sources of data?
Data sources include Direct/Primary Data (firsthand data) and Indirect/Secondary Data (e.g., government datasets, studies like the English Longitudinal Study of Ageing, or the Millennium Cohort Study)
Define variables, cases, and observations in a dataset.
A variable is a characteristic measured on each case. Cases (or units) are the individuals subjects (like patients or schools), and an observation is the recorded value of a variable.
What are the levels of measurement for variables?
Variables can be Nominal (unordered categories), Ordinal (ordered categories), Interval (numeric without a true zero), or Ratio (numeric with a true zero).
What distinguishes nominal, ordinal, interval, and ratio data?
- Nominal: Named categories without order.
- Ordinal: Ordered categories without meaningful intervals.
- Interval: Numeric with no true zero, like temperature.
- Ratio: Numeric with a meaningful zero, like age or income.
Note: Variables can be recoded and modified, but only from ratios to nominal.
What is Exploratory Data Analysis (EDA)?
EDA uses descriptive statistics to summarise data, often through numeric summaries (tables) and graphs. Summary measures attempt to describe a whole set of data with a single value.
How do tables vary based on data type?
For nominal/ordinal data, tables show counts, proportions, and percentages. For continuous data, tables display measures of central tendency and dispersion.
Define cross-tabulation.
Cross-tabulation shows the distribution of cases across categories of two variables, often identifying independent and dependent variables.
Note: In X-tabs, we have the independent variable as a “column” (and to present “column” percentages).
What are measures of central tendency?
These include the Mode (most frequent value), Median (middle value), and Mean (average), each representing different aspects of a data’s centre.
When should you use mode, median, or mean?
- Mode: Nominal data or high frequency. Can be used for interval/ratio variables with “modal group” (depending on the definition of the groups).
- Median: Ordinal and interval/ratio variables. Robust to outliers. Skewed distributions with extreme values.
- Mean: Interval/ratio variables. Data concentrated in the middle of their range. Symmetric data without outliers (not robust to extreme values).
Note: With the median, if N is off, the median is the middle observation. If N is even, the median is the average of the two middle observations.
Explain measures of dispersion.
Particularly for interval/ratio data, the mean and the median fail to describe the appearance of the data on their own. Dispersion measures spread and include the Range (max-min), Percentiles, Interquartile Range (IQR), Variance, and Standard Deviation.
Note: We are often interested to explain variation and differences in variation between variables.
What does the Interquartile Represent (IQR) represent?
IQR is the range between the 25th and 75th percentiles, showing the spread of the middle 50% of values, providing a robust measure against outliers.
What are Variance and Standard Deviation?
Variance is the square of standard deviation, measuring the spread of data around the mean. A lower standard deviation suggests data points are closer to the mean and the more the mean value is “indicative” of the whole dataset.
Note: You can think of the variance as a typical deviation from the mean; variability around the mean; precision of the mean.
How does a bar chart represent data?
A bar chart uses rectangles of equal width representing categories, with heights proportional to category frequencies, useful for nominal and ordinal data.
What is a clustered bar chart?
A clustered bar chart groups bars for different categories side-by-side, allowing for comparison across subcategories.
Describe the structure of a histogram.
A histogram displays continuous data by grouping values into bins. The width of the bars are equal to the size of that specific bin, and the area of the bars are proportional to the number of observations falling into that bin.