Week 3 - Scatterplots and Correlation and Least Squares Regression Flashcards
Response Variable
Measures an outcome of a study. The variable that is being measured or observed to see how it changes in response to another variable, called the explanatory variable; essentially, it’s the outcome or effect that researchers are trying to understand or predict, and is also often referred to as the dependent variable.
Explanatory Variable
May help explain or influence changes in the response variable. A variable that is used to explain or predict changes in another variable, often called the “response variable” - essentially, it’s the variable that is thought to be the cause in a cause-and-effect relationship, and is used to understand how it might influence the outcome of the study; it is also sometimes referred to as an “independent variable” or “predictor variable” depending on the context.
Scatterplot
Shows the relationship between two quantitative variables measured on the same individuals. The values of one variable are on the horizontal axis, the other on the vertical axis. Each individual appears as a point on the graph. a graphical representation that displays the relationship between two numerical variables by plotting each data point as a dot on a graph, where the position of the dot on the x and y axes corresponds to the values of the two variables, allowing for visual identification of trends, patterns, and potential correlations between them
Direction
The movement of the overall pattern of the graph. he overall trend or relationship between two variables, indicating whether they tend to increase or decrease together, essentially describing whether the correlation between them is positive (both variables move in the same direction) or negative (variables move in opposite directions) when analyzing data; it’s often represented by the sign of a correlation coefficient, with a positive sign signifying a positive direction and a negative sign indicating a negative direction.
Form
The shape of any pattern (linear, curved, etc.). The Form (Shape) of a Relationship: The form or shape of a relationship refers to whether the relationship is straight or curved. Linear: A straight relationship is called linear, because it approximates a straight line.
Strength
How closely the points follow a clear form. The degree or magnitude of the relationship between two variables, essentially indicating how closely the data points cluster around a line of best fit, and thus how well one variable can predict the other; it is usually measured using a correlation coefficient, with values closer to +1 or -1 signifying a stronger relationship between the variables.
DOFS
“Direction, Outliers, Form, and Strength,” which is an acronym used to remember the key aspects to consider when describing the relationship between two variables on a scatterplot, particularly focusing on the direction of the association, presence of outliers, the overall pattern (linear or curved), and the strength of the correlation between the variable. sed to help describe a scatterplot in context.
Direction:
Whether the relationship is positive (as one variable increases, the other increases) or negative (as one variable increases, the other decreases).
Outliers:
Data points that significantly deviate from the overall pattern.
Form:
Whether the relationship appears linear (forming a straight line) or non-linear (curved).
Strength:
How closely the data points cluster around the line of best fit, indicating the degree of association between the variables.
Positive Association
When above-average values of one variable tend to occur with above- average values of the other variable, and below-average values of the variables also tend to occur together. Two variables have a positive association when the values of one variable tend to increase as the values of the other variable increase.
Negative Association
When above-average values of one variable tend to occur with below- average values of the other variable, and vice versa. When one variable increases, the other variable tends to decrease, indicating a relationship where the two variables move in opposite directions; essentially, as one goes up, the other goes down
Correlation
Denoted by the letter r – Measures the direction and strength of the relationship between two variables.It can be described as either strong or weak, and as either positive or negative. Note: 1= Correlation does not imply causation. Expresses the extent to which two variables are linearly related (meaning they change together at a constant rate)
Regression Line
a line that describes how a response variable (y) changes as an explanatory variable (x) changes.A straight line that represents the best fit relationship between a dependent variable and an independent variable, allowing you to predict the value of the dependent variable based on the known value of the independent variable; essentially, it shows the trend of the data points in a scatterplot by minimizing the overall distance between the line and the data points.
Predicted Value (𝒚2) or (“y-hat”)
a value obtained by substituting a particular value of x into a regression equation. Refers to the estimated value of a dependent variable (y) based on a given value of an independent variable (x) using a regression line; essentially, it’s the value of y that the model predicts for a specific x value, calculated using the regression equation.
Slope
The amount y is predicted to change given a one-unit change in x. The rate of change in a linear relationship between two variables, calculated as the change in the dependent variable (y) divided by the change in the independent variable (x), essentially showing how much the y value changes for every unit increase in the x value; it is often represented as “rise over run” on a graph and is a key component of a regression line
Y-Intercept
The predicted value of y when x = 0. The value of the dependent variable (y) on a graph when the independent variable (x) is equal to zero, essentially representing the point where a line crosses the y-axis and indicating the predicted value of the dependent variable when there is no effect from the independent variable; it’s a key component in interpreting a regression line.
Extrapolation
Using a regression line to predict a value of y using a value of x far outside the interval of given/observed x values. Extrapolation is often inaccurate. The method of predicting values outside the range of known data by extending the patterns observed within the available data, essentially using existing trends to estimate future outcomes or unknown data points beyond the data set’s boundaries; it’s like forecasting future values based on historical data patterns. The primary function of extrapolation is to estimate values that are not directly measured within the data set by assuming the current trend will continue.
Residual
The difference between the observed value of the response variable and the value predicted by the regression line. The difference between the actual value of a data point and the value predicted by a statistical model, usually referring to the difference between the observed value of a dependent variable and the value predicted by a regression line in linear regression analysis; essentially, it measures how much a data point deviates from the model’s prediction.
Least-squares regression line
A line which minimizes the sum of the squared residuals. A straight line that best fits a set of data points on a scatter plot, determined by minimizing the sum of the squared vertical distances between the data points and the line, essentially finding the line that comes closest to all the data points overall; it is often used to predict the value of a dependent variable based on an independent variable. If the data shows a lean relationship between two variables, it results in a least-squares regression line. The term least squares is used because it is the smallest sum of squares of errors, which is also called the variance.
Residual Plot
A scatterplot plotting the residuals against the explanatory variable. displays the residuals on the vertical axis and the independent variable on the horizontal axis. displays the difference between the actual values of a dependent variable and the values predicted by a regression line (the “residuals”) on the y-axis, plotted against the independent variable on the x-axis; it is used to assess how well a linear regression model fits the data by visually examining the pattern of residuals, where a random scatter indicates a good fit and any noticeable patterns suggest potential problems with the model.
Coefficient of Determination (r 2 )
The fraction of the variation in the response variable y
which is accounted for by the regression line. measures how well a statistical model predicts an outcome. a measure that indicates how well a regression model explains the variation in a dependent variable based on the independent variable(s), essentially showing the proportion of variance in the dependent variable that can be attributed to the model, expressed as a value between 0 and 1 where 1 signifies a perfect fit; it is commonly referred to as “r-squared”.
Outlier
in regression, an observation which lies outside the overall pattern of the data. a data point that significantly differs from the majority of other values in a dataset, meaning it stands out as unusually high or low compared to the overall pattern of the data; essentially, it’s a value that is far away from the average or typical value in a set of data points. a single data point that goes far outside the average value of a group of statistics. Outliers may be exceptions that stand outside individual samples of populations as well. In a more general context, an outlier is an individual that is markedly different from the norm in some respect
Influential Observation
an observation which would change the results of the regression significantly if removed. Outliers in the x direction are often very influential on the regression
line. a data point within a dataset that, if removed, significantly alters the results of a statistical analysis, particularly in regression models, where its exclusion would noticeably change the estimated coefficients or the overall fit of the regression line; essentially, it’s a data point that exerts a disproportionate influence on the model due to its extreme values or location relative to other data points.