Regression Flashcards
Regression
Regression can be defined as a method or an algorithm in Machine Learning that models a target value
based on independent predictors. It is essentially a statistical tool used in finding out the relationship
between a dependent variable and an independent variable. This method comes to play in forecasting
and finding out the cause and effect relationship between variables.
Regression techniques differ based on:
- The number of independent variables
- The type of relationship between the independent and dependent variable
data used
Regression is basically performed when the dependent variable is of a continuous data type. The
independent variables, however, could be of any data type — continuous, nominal/categorical etc.
regression methods do..
Regression methods find the most accurate line describing the relationship between the dependent
variable and predictors with least error. In regression, the dependent variable is the function of the
independent variable and the coefficient and the error term.
Correlation
is a measure of the strength of a linear relationship between two quantitative variables
(e.g. price, sales)
- Correlation is positive when the values increase together
- Correlation is negative when one value decreases as the other increases
Correlation can have a value
- 1 is a perfect positive correlation
- 0 is no correlation (the values don’t seem linked at all)
- -1 is a perfect negative correlation
cross tabs
Cross tabs help us establish a relationship between two variables. This relationship is exhibited in a tabular form
Column percentages
(these are percentages within the columns, so that each column’s
percentages add up to 100%
in cross tabs when the variables are not ordered..
where both variables are not ordered, we can simply refer to the strength of the
correlation without discussing its direction
Scatterplots
- A scatter plot (aka scatter chart, scatter graph) uses dots to represent values for two different numeric
variables. - The position of each dot on the horizontal and vertical axis indicates values for an individual
data point. - Scatter plots are used to observe relationships between variables.
What type of correlation is shown here?
This is a negative correlation. As we move along the x-axis toward the greater numbers,
the points move down which means the y-values are decreasing, making this a negative correlation.
Pearson’s r
- The Pearson correlation coefficient is used to measure the strength of a linear association between
two variables. - where the value r = 1 means a perfect positive correlation and the value r = -1 means a
perfect negative correlation.
Requirements for Pearson’s correlation coefficient are as follows: Scale of measurement should be
interval or ratio
- Variables should be approximately normally distributed
- The association should be linear
- There should be no outliers in the data
What does this test do?
Pearson’s r
- The Pearson product-moment correlation coefficient (or Pearson correlation coefficient, for short) is a
measure of the strength of a linear association between two variables and is denoted by ‘r’. - Basically,
a Pearson product-moment correlation attempts to draw a line of best fit through the data of two
variables, - Pearson correlation coefficient, r, indicates how far away all these data points are to
this line of best fit (i.e., how well the data points fit this new model/line of best fit)
What values can the Pearson correlation coefficient take?
- The Pearson correlation coefficient, r, can take a range of values from +1 to -1.
- A value of 0 indicates
that there is no association between the two variables. - A value greater than 0 indicates a positive
association; that is, as the value of one variable increases, so does the value of the other variable. - A value less than 0 indicates a negative association; that is, as the value of one variable increases, the
value of the other variable decreases.