Exploring Data Flashcards
Descriptive Methods
Ways that data are organized and summarized
Categorical/Discrete Variable
A variable that can only take on a set number of values
Continuous Variable
A variable that can take on any value
Univariate Data
Data that only represents one measurement
Bivariate Data
Data that only represents two measurements
Frequency (f)
The number of times an observation occurs
Relative Frequency (rf)
The ratio of a frequency to the total number of observations (n) (rf = f/n)
Cumulative Frequency (cf)
Gives the number of observations less than or equal to a specified value
Frequency Distribution Table
A table giving all possible values of a variable and their frequencies
Center
Describes the point around which the data points are spread
Spread
Describes how the data points are spread (Broadness/Narrowness of the Distribution)
Shape
Describes what the distribution looks like
Symmetric Distribution
The left half of the distribution looks the same as the right half
Left-Skewed Distribution
When the left half of the distribution extends further from the center than its right half
Right-Skewed Distribution
When the right half of the distribution extends further from the center than its left half
Clusters/Gaps
Describing whether or not there are gaps in the data or if the data tends to cluster at a single point in the distribution
Outliers
An observation that is surprisingly different from the rest of the data
Population
The entire group of individuals or things that we are interested in
Sample
The part of the population that is being studied
Mean (mu or X bar)
The average of all data in a given set (Is affected by outliers)
Median (Q2 or M)
The point that divides the measurements in half (Not affected by outliers)
Range
The difference between the largest and smallest measurements in a data set (Is affected by outliers)
Interquartile Range (IQR)
The range of the middle 50% of the data (IQR = Q3 - Q1) and is used along with medians when describing distributions
Standard Deviation (Sigma or S)
Shows how far a point is away from the mean (Is also affected by outliers) and is used with the mean in describing distributions
Variance (Sigma Squared or S Squared)
The square of the standard deviation
Quartiles
Divide a data set into four equal parts (Q1, Q2, Q3)
Percentiles
Divide a set of values into 100 equal parts
Standardized Scores (z-scores)
Tell how many standard deviations away from the mean a specific data point is (z* = measurement - mean/standard deviation)
Pearson’s Correlation Coefficient
A numeric measure of the degree and direction of the linear relation between two quantitative variables
Linear Regression Model/Equation
An equation that gives a straight line relationship between two variables (Y = Beta 0 + Beta 1(X) + e)
Least Squares Regression Line
A line that minimizes the error sum of squares of the residuals
Coefficient of Determination
Measures the percent of variation explained by the linear relation between x and y values
Influential Observation
An observation that strongly affects a statistic
Residual Plot
A plot of residuals versus the predicted values of y (used to assess the fit of a model)
Transformation
Used to achieve linearity
Log Transformation
Used to linearize the regression model when the relationship between Y and X suggests a model with a consistently increasing slope (Z = ln(Y))
Square Root Transformation
Used when the spread of observations increases with the mean (Z = Square Root of Y = Y^1/2)
Reciprocal Transformation
Used to minimize the effect of large values of X (Z = 1/Y^1)
Square Transformation
Used when the slope of the relation consistently decreases as the independent variable increases (Z = Y^2)
Power Transformation
Used if the relation between dependent and independent variables is modeled by Y = aX^b (ln(Y) and ln(X))
Conditional Relative Frequency
The relative frequency of one category given that the other category has occurred
Association
Measures the degree of relation between two categorical variables