Examining numerical data Flashcards
Scatterplot
A graphical representation of data for two numerical variables, where each point represents a single case. It helps visualize relationships between variables.
Dot Plot
A simple graphical display of a single numerical variable where each data point is represented by a dot, often stacked to show frequency.
Mean
The mean, or average, is a measure of the center of a data distribution. It is calculated by summing all observations and dividing by the number of observations. Represented as x̄ in a sample.
Histogram
A histogram is a graphical representation of data where observations are grouped into bins, and the frequency of observations in each bin is represented by the height of a bar. It provides an overview of the distribution of numerical data, especially useful for large datasets.
Data Density
The concentration of data in different regions of a histogram, where higher bars indicate where data points are more common.
Right Skewed
A data distribution with a longer tail on the right side, meaning most values are concentrated on the left.
Left Skewed
A data distribution with a longer tail on the left side, meaning most values are concentrated on the right.
Symmetric
A data distribution where the left and right sides are approximately mirror images, with no long tail on either side.
Mode
A prominent peak in a distribution, representing the most frequent value or range of values in a data set.
Unimodal
A distribution with a single prominent peak.
Bimodal
A distribution with two prominent peaks.
Multimodal
A distribution with more than two prominent peaks.
Deviation
The distance of an observation from the mean.
Standard Deviation
A measure of variability that describes how far the typical observation is from the mean, calculated as the square root of the variance.
Box Plot
A graphical summary of a data set using five statistics (minimum, first quartile, median, third quartile, and maximum) while also plotting unusual observations.
Median
The median is the value that splits the data in half, with 50% of the observations falling below and 50% falling above it. If the number of observations is even, the median is the average of the two middle values. If the number of observations is odd, the median is the middle value itself.
Interquartile Range (IQR)
The interquartile range (IQR) is the distance between the first quartile (Q1) and the third quartile (Q3) and represents the middle 50% of the data. It measures the variability in the central portion of the data set.
First Quartile (Q1)
The first quartile (Q1) is the 25th percentile of the data, meaning that 25% of the data points fall below this value.
Third Quartile (Q3)
The third quartile (Q3) is the 75th percentile of the data, meaning that 75% of the data points fall below this value.
Whiskers
Whiskers extend out from the box plot and attempt to capture the data outside the interquartile range (IQR), but their reach is limited to 1.5 × IQR. They help show the spread of the data, but any data points beyond the whiskers are considered outliers.
Outliers
Outliers are data points that lie beyond the whiskers of a box plot, meaning they are unusually distant from the rest of the data. They are often identified as points that fall outside of 1.5 × IQR from the first or third quartile.
Robust Statistics
Robust statistics, such as the median and interquartile range (IQR), are resistant to the influence of extreme observations. These statistics are less affected by outliers or unusual data points, making them more stable in the presence of extreme values compared to the mean and standard deviation.
Transformation
A transformation is a rescaling of data using a function, such as taking the logarithm, square root, or inverse of a variable. This technique is particularly useful for adjusting strongly skewed data, reducing the impact of outliers, and making it easier to build statistical models. For example, applying a logarithmic transformation can make data more symmetric and reveal hidden patterns, like relationships in a scatterplot. Transformations help in visualizing data differently, straightening nonlinear relationships, or improving the accuracy of statistical models.
Intensity Map
An intensity map is a geographic visualization where colors represent varying values of a numerical variable across different locations. It is used to display trends and patterns in data, particularly for variables that have spatial characteristics, like poverty rate, unemployment rate, or homeownership rate. Intensity maps help identify geographic trends and generate research hypotheses, though they are not ideal for pinpointing precise values for specific locations.