Week 1: Exploratory Data Analysis & Basic Statistics Flashcards

Question 1

Q

What are the primary sources of data?

Answer

A

Data sources include Direct/Primary Data (firsthand data) and Indirect/Secondary Data (e.g., government datasets, studies like the English Longitudinal Study of Ageing, or the Millennium Cohort Study)

Question 2

Q

Define variables, cases, and observations in a dataset.

Answer

A

A variable is a characteristic measured on each case. Cases (or units) are the individuals subjects (like patients or schools), and an observation is the recorded value of a variable.

Question 3

Q

What are the levels of measurement for variables?

Answer

A

Variables can be Nominal (unordered categories), Ordinal (ordered categories), Interval (numeric without a true zero), or Ratio (numeric with a true zero).

Question 4

Q

What distinguishes nominal, ordinal, interval, and ratio data?

Answer

A

Nominal: Named categories without order.
Ordinal: Ordered categories without meaningful intervals.
Interval: Numeric with no true zero, like temperature.
Ratio: Numeric with a meaningful zero, like age or income.

Note: Variables can be recoded and modified, but only from ratios to nominal.

Question 5

Q

What is Exploratory Data Analysis (EDA)?

Answer

A

EDA uses descriptive statistics to summarise data, often through numeric summaries (tables) and graphs. Summary measures attempt to describe a whole set of data with a single value.

Question 6

Q

How do tables vary based on data type?

Answer

A

For nominal/ordinal data, tables show counts, proportions, and percentages. For continuous data, tables display measures of central tendency and dispersion.

Question 7

Q

Define cross-tabulation.

Answer

A

Cross-tabulation shows the distribution of cases across categories of two variables, often identifying independent and dependent variables.

Note: In X-tabs, we have the independent variable as a “column” (and to present “column” percentages).

Question 8

Q

What are measures of central tendency?

Answer

A

These include the Mode (most frequent value), Median (middle value), and Mean (average), each representing different aspects of a data’s centre.

Question 9

Q

When should you use mode, median, or mean?

Answer

A

Mode: Nominal data or high frequency. Can be used for interval/ratio variables with “modal group” (depending on the definition of the groups).
Median: Ordinal and interval/ratio variables. Robust to outliers. Skewed distributions with extreme values.
Mean: Interval/ratio variables. Data concentrated in the middle of their range. Symmetric data without outliers (not robust to extreme values).

Note: With the median, if N is off, the median is the middle observation. If N is even, the median is the average of the two middle observations.

Question 10

Q

Explain measures of dispersion.

Answer

A

Particularly for interval/ratio data, the mean and the median fail to describe the appearance of the data on their own. Dispersion measures spread and include the Range (max-min), Percentiles, Interquartile Range (IQR), Variance, and Standard Deviation.

Note: We are often interested to explain variation and differences in variation between variables.

Question 11

Q

What does the Interquartile Represent (IQR) represent?

Answer

A

IQR is the range between the 25th and 75th percentiles, showing the spread of the middle 50% of values, providing a robust measure against outliers.

Question 12

Q

What are Variance and Standard Deviation?

Answer

A

Variance is the square of standard deviation, measuring the spread of data around the mean. A lower standard deviation suggests data points are closer to the mean and the more the mean value is “indicative” of the whole dataset.

Note: You can think of the variance as a typical deviation from the mean; variability around the mean; precision of the mean.

Question 13

Q

How does a bar chart represent data?

Answer

A

A bar chart uses rectangles of equal width representing categories, with heights proportional to category frequencies, useful for nominal and ordinal data.

Question 14

Q

What is a clustered bar chart?

Answer

A

A clustered bar chart groups bars for different categories side-by-side, allowing for comparison across subcategories.

Question 15

Q

Describe the structure of a histogram.

Answer

A

A histogram displays continuous data by grouping values into bins. The width of the bars are equal to the size of that specific bin, and the area of the bars are proportional to the number of observations falling into that bin.

Question 16

Q

What is the purpose of binning in histograms?

Answer

Study These Flashcards

A

Proper binning (number of classes) is crucial to reveal the data’s distribution features without oversimplification or over-detailing.

Question 17

Q

What data requires which kind of graphical representation?

Answer

Study These Flashcards

A

If data are nominal or ordinal, we use bar charts or clustered bar charts. If data are continuous, we use histograms or box plots.

Week 1: Exploratory Data Analysis & Basic Statistics Flashcards

(17 cards)