data summary Flashcards
what is quantitative data
Quantitative data measure some quantity resulting in a numerical value, e.g. weight, salary.
what is qualitative data
Qualitative data measure the quality of something resulting in a value that does not have a numerical meaning, e.g. colour, religion, season.
what is discrete quantitative data
Discrete: data with distinct values and possible values take only a distinct series of numbers (e.g. number of traffic accidents, number of children born to a women)
what is continuous quantitative data
Continuous: a value that can be measured evermore precisely and hence become essentially continuous (e.g. height, speed).
what is ordinal qualitative data
Ordinal: non-numeric value but the values have some natural ordering; e.g. poor, fair, good, excellent.
what is nominal qualitative data
Nominal: unordered, distinct by name only; e.g. retail, construction, manufacturing.
what are frequency distribution
A frequency distribution summarizes discrete variables or qualitative data by counting how often each value occurs.
what is the mode
The mode is the most frequently occurring value in a dataset
What is a bimodal distribution?
A bimodal distribution has two distinct peaks in the frequency of values.
What are the 3 measures of centre in statistics?
mode
mean
median
4 measures of spread
range
interquartile range (IQR)
sample variance
standard deviation.
Why is it important to know both the centre and spread of a dataset?
Knowing both provides a better understanding of the data’s behavior. The center gives us a “typical” value, while the spread tells us how much variability or dispersion exists in the data.
what is the population mean and sample mean
The population mean is a parameter (𝜇) which is typically unknown
we take a sample and obtain an estimate (𝜇̂), the sample mean
how to find the position of an even and odd sample median
even: (𝑛 + 2)/2
odd: (𝑛 + 1)/2
𝑛 - sample size
what is the range
The range is the difference between the maximum and minimum value.
one disadvantage of range
can be misleading if one number is different to the rest. (outlier)
what is an outlier
An outlier is a value that is very different to the other values recorded.
What are percentiles and how are they used?
Percentiles: Values that divide the dataset into 100 equal parts.
25th percentile (lower quartile or 1st quartile): 25% of data lies below it.
75th percentile (upper quartile or 3rd quartile): 75% of data lies below it.
what is the interquartile range
The difference between the 75th percentile and 25th percentile, representing the spread of the middle 50% of data.
population variance formula
𝜎² = ∑(𝑦𝑖 - 𝜇)² / 𝑁
𝑁: population size
𝑦𝑖: each value.
what does variance measure
Measures the spread of data from the population mean (𝜇).
What is sample variance and how is it different from population variance?
Measures the spread of data from the sample mean (𝜇̂).
Sample variance divides by (𝑛 - 1) instead of 𝑁 to correct for bias in estimating population variance
sample variance formula
𝑠² = ∑(𝑦𝑖 - 𝜇̂)² / (𝑛 - 1)
where n-1 is the degrees of freedom
why do we use standard deviation
unit of variance give a squared answer so we want to root them
standard deviation formula
𝑠 = √(𝑠²)
What is a bar plot and when is it used?
A bar plot represents frequency information across discrete categories or groups.
The height of each bar corresponds to the count or proportion of observations
why are pie charts useful
Pie charts are useful for displaying frequency distributions across different groups.
What is a histogram and what does it show?
A histogram is used to display continuous data by grouping values into bins.
The x-axis represents data bins, and the y-axis represents frequency.
It helps visualize the center, spread, and skewness of the data.
how to find the median in a histogram
the median is the point where 50% of the area of a histogram is to the left and 50% to the right
what is skewness
skewness is a measure of asymmetry about the mean.
How can you tell if data is skewed using a histogram?
Right (positive) skewed: Long right tail, mean > median.
Left (negative) skewed: Long left tail, mean < median.
Symmetric distribution: Mean = median.
How do you convert frequency to density in a histogram?
Density in interval 𝑖 = Frequency in interval / (Bin interval × Total number of observations)
This standardizes the histogram so that the total area sums to 1, making it easier to compare different distributions.
What information does a box plot convey?
the lower limit of the box is the 25th percentile, the upper limit is the 75th percentile
the box spans the IQR
Median is a line inside the box.
Whiskers extend to extreme values (or 1.5×IQR beyond the box).
Outliers are plotted beyond the whiskers.
what do notched box plots include
Notched box plots show a confidence interval for the median.
what are violin plots
A violin plot combines a box plot with a smoothed, sideways histogram:
Displays the median (red dot) and quartiles (box).
Shows the distribution shape to understand data spread.
When should you use a cross-tabulation?
Used when both variables are qualitative or discrete with a small number of values.
Helps summarize relationships between categorical variables.
How can histograms or box plots compare two variables?
If one variable is continuous and the other is discrete, use side-by-side histograms or box plots to compare groups.
What is a scatter plot used for?
A scatter plot is used to visualize the relationship between two continuous variables by plotting:
Response variable on the y-axis
Explanatory variable on the x-axis
This helps identify trends, correlations, and patterns in data.
What is a quilt plot?
A quilt plot is used for summarizing relationships between three continuous variables
The x and y axes form a grid of sections.
Each grid square is colored based on the average value of a third variable (e.g., water depth).
Useful for spatial analysis and heat maps.
what can be seen from a random component
values might follow a recognisable distribution (e.g. Normal)
used to decide if the chosen fixed component is useful
What are the two components of data partitioning in linear models?
Fixed Component
Represents the systematic part of the data
Can be complex (e.g., includes multiple predictors)
Random Component
Represents random variation or error
Often follows a recognizable distribution (e.g., Normal)
Helps assess whether the fixed component is useful
Measurement = Fitted Value ± Residual
What are key visual summaries in data analysis?
Histograms (distribution)
Box plots (spread and outliers)
Scatter plots (relationships)