data summary Flashcards

1
Q

what is quantitative data

A

Quantitative data measure some quantity resulting in a numerical value, e.g. weight, salary.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

what is qualitative data

A

Qualitative data measure the quality of something resulting in a value that does not have a numerical meaning, e.g. colour, religion, season.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

what is discrete quantitative data

A

Discrete: data with distinct values and possible values take only a distinct series of numbers (e.g. number of traffic accidents, number of children born to a women)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

what is continuous quantitative data

A

Continuous: a value that can be measured evermore precisely and hence become essentially continuous (e.g. height, speed).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

what is ordinal qualitative data

A

Ordinal: non-numeric value but the values have some natural ordering; e.g. poor, fair, good, excellent.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

what is nominal qualitative data

A

Nominal: unordered, distinct by name only; e.g. retail, construction, manufacturing.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

what are frequency distribution

A

A frequency distribution summarizes discrete variables or qualitative data by counting how often each value occurs.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

what is the mode

A

The mode is the most frequently occurring value in a dataset

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is a bimodal distribution?

A

A bimodal distribution has two distinct peaks in the frequency of values.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What are the 3 measures of centre in statistics?

A

mode
mean
median

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

4 measures of spread

A

range
interquartile range (IQR)
sample variance
standard deviation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Why is it important to know both the centre and spread of a dataset?

A

Knowing both provides a better understanding of the data’s behavior. The center gives us a “typical” value, while the spread tells us how much variability or dispersion exists in the data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

what is the population mean and sample mean

A

The population mean is a parameter (𝜇) which is typically unknown

we take a sample and obtain an estimate (𝜇̂), the sample mean

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

how to find the position of an even and odd sample median

A

even: (𝑛 + 2)/2
odd: (𝑛 + 1)/2
𝑛 - sample size

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

what is the range

A

The range is the difference between the maximum and minimum value.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

one disadvantage of range

A

can be misleading if one number is different to the rest. (outlier)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

what is an outlier

A

An outlier is a value that is very different to the other values recorded.

18
Q

What are percentiles and how are they used?

A

Percentiles: Values that divide the dataset into 100 equal parts.

25th percentile (lower quartile or 1st quartile): 25% of data lies below it.

75th percentile (upper quartile or 3rd quartile): 75% of data lies below it.

19
Q

what is the interquartile range

A

The difference between the 75th percentile and 25th percentile, representing the spread of the middle 50% of data.

20
Q

population variance formula

A

𝜎² = ∑(𝑦𝑖 - 𝜇)² / 𝑁

𝑁: population size
𝑦𝑖: each value.

21
Q

what does variance measure

A

Measures the spread of data from the population mean (𝜇).

22
Q

What is sample variance and how is it different from population variance?

A

Measures the spread of data from the sample mean (𝜇̂).
Sample variance divides by (𝑛 - 1) instead of 𝑁 to correct for bias in estimating population variance

23
Q

sample variance formula

A

𝑠² = ∑(𝑦𝑖 - 𝜇̂)² / (𝑛 - 1)

where n-1 is the degrees of freedom

24
Q

why do we use standard deviation

A

unit of variance give a squared answer so we want to root them

25
Q

standard deviation formula

A

𝑠 = √(𝑠²)

26
Q

What is a bar plot and when is it used?

A

A bar plot represents frequency information across discrete categories or groups.

The height of each bar corresponds to the count or proportion of observations

27
Q

why are pie charts useful

A

Pie charts are useful for displaying frequency distributions across different groups.

28
Q

What is a histogram and what does it show?

A

A histogram is used to display continuous data by grouping values into bins.

The x-axis represents data bins, and the y-axis represents frequency.

It helps visualize the center, spread, and skewness of the data.

29
Q

how to find the median in a histogram

A

the median is the point where 50% of the area of a histogram is to the left and 50% to the right

30
Q

what is skewness

A

skewness is a measure of asymmetry about the mean.

31
Q

How can you tell if data is skewed using a histogram?

A

Right (positive) skewed: Long right tail, mean > median.

Left (negative) skewed: Long left tail, mean < median.

Symmetric distribution: Mean = median.

32
Q

How do you convert frequency to density in a histogram?

A

Density in interval 𝑖 = Frequency in interval / (Bin interval × Total number of observations)

This standardizes the histogram so that the total area sums to 1, making it easier to compare different distributions.

33
Q

What information does a box plot convey?

A

the lower limit of the box is the 25th percentile, the upper limit is the 75th percentile

the box spans the IQR

Median is a line inside the box.

Whiskers extend to extreme values (or 1.5×IQR beyond the box).

Outliers are plotted beyond the whiskers.

34
Q

what do notched box plots include

A

Notched box plots show a confidence interval for the median.

35
Q

what are violin plots

A

A violin plot combines a box plot with a smoothed, sideways histogram:

Displays the median (red dot) and quartiles (box).

Shows the distribution shape to understand data spread.

36
Q

When should you use a cross-tabulation?

A

Used when both variables are qualitative or discrete with a small number of values.

Helps summarize relationships between categorical variables.

37
Q

How can histograms or box plots compare two variables?

A

If one variable is continuous and the other is discrete, use side-by-side histograms or box plots to compare groups.

38
Q

What is a scatter plot used for?

A

A scatter plot is used to visualize the relationship between two continuous variables by plotting:

Response variable on the y-axis

Explanatory variable on the x-axis

This helps identify trends, correlations, and patterns in data.

39
Q

What is a quilt plot?

A

A quilt plot is used for summarizing relationships between three continuous variables

The x and y axes form a grid of sections.

Each grid square is colored based on the average value of a third variable (e.g., water depth).

Useful for spatial analysis and heat maps.

40
Q

what can be seen from a random component

A

values might follow a recognisable distribution (e.g. Normal)

used to decide if the chosen fixed component is useful

41
Q

What are the two components of data partitioning in linear models?

A

Fixed Component

Represents the systematic part of the data
Can be complex (e.g., includes multiple predictors)
Random Component

Represents random variation or error
Often follows a recognizable distribution (e.g., Normal)
Helps assess whether the fixed component is useful

Measurement = Fitted Value ± Residual

42
Q

What are key visual summaries in data analysis?

A

Histograms (distribution)
Box plots (spread and outliers)
Scatter plots (relationships)