Data Unit 2 Test Flashcards
Scatter Plots:
Graph used to determine if there is a relationship between two variables. The independent variable is on the horizontal axis and the dependent variable is on the vertical axis.
Line of Best Fit (trend line)
A straight line that passes as close as possible to all the points in a scatter plot.
The line is drawn with the following criteria in mind:
- The line passes through as many points as possible.
- There are evenly distributed points above and below the line (the sum of the perpendicular distances for all the points above the line should equal that of the points below it).
- Ignore outliers, whenever possible.
- Consider the origin as a possible point (e.g. extrapolate to time 0)
Outliers:
Data or points that lie significantly away from the majority of the other data. They can skew a regression analysis, especially when the collected data is small. More information should be sought about the outlier before including or excluding it from the analysis.
Correlation:
in data analysis, one variable may be affected by another variable and when a change in the independent variable affects the dependent variable, there is a correlation between them.
To describe a correlation, we use 3 attributes:
- Linear (clear line), Non-linear (curved), No correlation (scattered points)
- Positive (positively sloped), Negative (negatively sloped)
- Strength; Strong (or high), Moderate (or medium), or Weak
Linear Correlation:
Variables have a linear correlation if the changes in one variable tend to be
proportional to changes in the other variable.
Correlation Coefficient:
Gives a quantitative measure of the strength of a linear correlation or a measure of how closely the points of a scatter plot is to the line of best fit. It is the covariance of the two variables divided by the product of the standard deviations of each variable.
Negative Linear Correlation Values
Strong is -1 to -0.67
Moderate is -0.67 to -0.33
Weak is -0.33 to 0
Positive Linear Correlation Values
Weak is 0 to 0.33
Moderate is 0.33 to 0.67
Strong is 0.67 to 1
Steps for using the Correlation Coefficient Formula
- Create a chart with 5 columns (x, y, x^2, y^2, xy)
- Calculate the sums of each column (by adding each value)
- Plug in the values and solve with BEDMAS
Linear Regressions in Desmos
- Click on the + sign and insert in a Table
- Type in your x and values
- On the next line type y1~mx1+b and the r and r^2 will appear
You can find the equation of the line of best fit by
subbing in the m and b that Desmos gives you. Just remember to write x after the m because ts y = mX +b.
A positive relationship between two variables will sound like this:
as the # of minutes increases, the cost also increases.
To find unknown y values that are within our data sets we can either
1) sub that x into the equation and solve or 2) we can hover over the x value on desmos and get the y.
Interpolation is
data within our data set.
Extrapolation is
data beyond the data set.
To find unknown x values that are within our data set we can either
1) sub in the y and solve or 2) hover over the trendline as close to the y value as possible.
A point can be rejected as an outlier if
1) There was an error collecting data, or outside factors affected the data,
ex. Kathy collected data on the heights of students and their arm spans.
When recording the data, she mistakenly recorded a high of 181cm for 101cm. (Thus, an error in data collection)
2) If outside factors affect the data.
ex. A company records its total revenue each month.
Last month, the workers at the company went on strike for two weeks.
The presence of an outlier can affect the
analysis of data
Other factors, such as
sample size and composition also need to be considered when analyzing data.
Linear regression is only appropriate
if the data appears to be linear.