Week 1: Exploratory Data Analysis & Basic Statistics Flashcards

1
Q

What are the primary sources of data?

A

Data sources include Direct/Primary Data (firsthand data) and Indirect/Secondary Data (e.g., government datasets, studies like the English Longitudinal Study of Ageing, or the Millennium Cohort Study)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Define variables, cases, and observations in a dataset.

A

A variable is a characteristic measured on each case. Cases (or units) are the individuals subjects (like patients or schools), and an observation is the recorded value of a variable.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are the levels of measurement for variables?

A

Variables can be Nominal (unordered categories), Ordinal (ordered categories), Interval (numeric without a true zero), or Ratio (numeric with a true zero).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What distinguishes nominal, ordinal, interval, and ratio data?

A
  • Nominal: Named categories without order.
  • Ordinal: Ordered categories without meaningful intervals.
  • Interval: Numeric with no true zero, like temperature.
  • Ratio: Numeric with a meaningful zero, like age or income.

Note: Variables can be recoded and modified, but only from ratios to nominal.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is Exploratory Data Analysis (EDA)?

A

EDA uses descriptive statistics to summarise data, often through numeric summaries (tables) and graphs. Summary measures attempt to describe a whole set of data with a single value.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

How do tables vary based on data type?

A

For nominal/ordinal data, tables show counts, proportions, and percentages. For continuous data, tables display measures of central tendency and dispersion.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Define cross-tabulation.

A

Cross-tabulation shows the distribution of cases across categories of two variables, often identifying independent and dependent variables.

Note: In X-tabs, we have the independent variable as a “column” (and to present “column” percentages).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What are measures of central tendency?

A

These include the Mode (most frequent value), Median (middle value), and Mean (average), each representing different aspects of a data’s centre.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

When should you use mode, median, or mean?

A
  • Mode: Nominal data or high frequency. Can be used for interval/ratio variables with “modal group” (depending on the definition of the groups).
  • Median: Ordinal and interval/ratio variables. Robust to outliers. Skewed distributions with extreme values.
  • Mean: Interval/ratio variables. Data concentrated in the middle of their range. Symmetric data without outliers (not robust to extreme values).

Note: With the median, if N is off, the median is the middle observation. If N is even, the median is the average of the two middle observations.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Explain measures of dispersion.

A

Particularly for interval/ratio data, the mean and the median fail to describe the appearance of the data on their own. Dispersion measures spread and include the Range (max-min), Percentiles, Interquartile Range (IQR), Variance, and Standard Deviation.

Note: We are often interested to explain variation and differences in variation between variables.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What does the Interquartile Represent (IQR) represent?

A

IQR is the range between the 25th and 75th percentiles, showing the spread of the middle 50% of values, providing a robust measure against outliers.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What are Variance and Standard Deviation?

A

Variance is the square of standard deviation, measuring the spread of data around the mean. A lower standard deviation suggests data points are closer to the mean and the more the mean value is “indicative” of the whole dataset.

Note: You can think of the variance as a typical deviation from the mean; variability around the mean; precision of the mean.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

How does a bar chart represent data?

A

A bar chart uses rectangles of equal width representing categories, with heights proportional to category frequencies, useful for nominal and ordinal data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is a clustered bar chart?

A

A clustered bar chart groups bars for different categories side-by-side, allowing for comparison across subcategories.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Describe the structure of a histogram.

A

A histogram displays continuous data by grouping values into bins. The width of the bars are equal to the size of that specific bin, and the area of the bars are proportional to the number of observations falling into that bin.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is the purpose of binning in histograms?

A

Proper binning (number of classes) is crucial to reveal the data’s distribution features without oversimplification or over-detailing.

17
Q

What data requires which kind of graphical representation?

A

If data are nominal or ordinal, we use bar charts or clustered bar charts. If data are continuous, we use histograms or box plots.