VL 10 Flashcards
What’s the difference between Correlation and Regression?
Correlation measures the strength and direction of the linear relationship between two variables, while regression models the relationship between a dependent variable and one or more independent variables to make predictions.
Correlation
* description of an undi- rected relationship be- tween two or more variables
* how strong it is
* direction is not known, not
existing or we are simply
not interested
* phones in household and
baby deaths
Regression
* description of a directed relationship between two or more variables
* one variable influences the other
* smoking and cancer
* weight and height
* model to describe the
relationship
* model to predict one
variable
Name three different fits for numerical variables?
- linea
- Sigmoid
- Exponential
What are the aims of regression ?
- looking for a trend: linear, sigmoid, exponential
- curve fitting : which model ist most similar to the data
- prediction: predict response variable Y from X (and others) * standard curve: assays
Simple linear Regression?
Simple linear regression is a statistical method that models the linear relationship between two variables: one dependent variable and one independent variable. It finds the best-fitting straight line to represent this relationship and can be used for prediction.
- most common regression type
- method to find a best straigth line to a cloud of data points
- one variable (independent) is used to predict a second
(dependent)
remember:
Best fit line is for sample
best fit fit for the population is 95%CI
How to find the best Regression line?
- target: minimize deviations from line
- residuals are deviations of those points to the line
- sum of squared residuals for possible lines should be
minimized for
–> method of least squares
What are the Ordinary Least Squares? (OLS)
Ordinary Least Squares (OLS) is a method used in linear regression to find the best-fitting line by minimizing the sum of squared differences between observed data and predicted values. It provides the coefficients for the line that represents the relationship between the variables.
Theline: y=α+βx
Error and Residual
Error (or disturbance) of an observed value is the deviation of the observed value from the (unobservable) true value of a quantity of interest (for example, a population mean).
Residual of an observed value is the difference between the observed value and the estimated value of the quantity of interest (for example a sample mean)
–> Error is here not really an error, but just individual variation!
Residuals Plot looks like Punktediagramm (mehr zerstreut) and regression Plot supposed to look like ein Punktediagramm (in einer Linie)
What is regression fallacy?
The regression fallacy occurs when people assume that a relationship observed within a specific dataset will hold true outside the data range, leading to erroneous predictions and interpretations.
The regression fallacy or regressive fallacy assigns a cause- effect relationship even though none exists. Unfortunately, people sometimes use this fallacy to ”find” a causal expla- nation for extreme values generated by a process in which natural fluctuations exist. So the fallacy fails to account for the existence of natural fluctuations and regression toward the mean.
How to predict one variable (Y) from an other (X)?
- equation: Y = a + bX
– a is the intercept of the line at Y axis or regression constant
– b is the gradient, slope or regression coefficient
– Y is a value for the outcome
– X is a value for the predictor - multiple regression (multiple predictor variables P,Q,R):
– Y = a + bP + cQ + dR
More than two variables
Y =a+bP +cQ+dR…
* Y dependent variable
* a intercept
* b coefficient for variable P * c coefficient for variable Q
* d coefficient for variable R
what is Akaike’s“An Information Criterion (AIC) and when to use it?
AIC is based on the principle of finding the balance between model fit and model complexity. It quantifies the trade-off between the goodness of fit of a model and the number of parameters in the model. The main idea behind AIC is to select the model that best explains the data while penalizing models with a larger number of parameters, thus avoiding overfitting.
When to use AIC:
AIC is commonly used in situations where you have multiple candidate models, and you want to select the best one among them. It is particularly useful when dealing with complex models that may have more parameters than the data can reliably estimate. AIC helps prevent overfitting by favoring simpler models that still provide a good fit to the data.
Researchers and data analysts can use AIC to compare different models and choose the one that strikes the best balance between goodness of fit and model complexity. By doing so, they can obtain more reliable and interpretable results in their statistical analyses.
What is Crossvalidation and Overfitting?
If you put all your data into the process of model build- ing and then try to predict some of the data which were used to build the model you were overfitting. The model is adapted to your data but it is possi- bly not a general model. You must do cross-sampling. Some part of the data will be used for model build- ing (training set), some to test the reliability of your model (test set), you predict their values and com- pare them with the real ones.
How to do crossvalidation?
- leave one out: build the model without one sample, predict the sample, repeat this for every sample
- splitting: split your data into n-sets, build the model with n-1 sets, predict the one, repeat this for every set
- random splits: randomly split your data into training and test sets, repeat this again and again
- also cross validation can lead to overfitting if you refine your model again and again
- overfitting is a big problem in model fitting
What is a good model for a classification/regression task?
A decision tree is a machine learning algorithm that recursively divides data into subsets based on features, creating a tree-like structure. It uses a series of binary decisions to classify or predict outcomes for new data based on the path through the tree. The process continues until a stopping criterion is met, and predictions are made at the leaf nodes of the tree. Decision trees are interpretable but can overfit data if not pruned or optimized properly.
- decision trees work often only for few variables well (up to 10,12)
- more variables
–> random forests - library(randomForest)
–> support vector machines
–> neuronal networks
–> many others …
How to interprete a decision tree?
Interpreting the Decision Tree:
A decision tree is interpretable due to its tree-like structure. To interpret it, follow the path from the root node to the leaf node for a specific prediction.
Each internal node represents a decision based on a feature, and each branch corresponds to the outcome of that decision.
The leaf nodes represent the final prediction or decision for a given combination of feature values.
The path through the tree that leads to a leaf node tells you the decision process for a specific prediction.
Interpreting the decision tree involves understanding how the features are used to make decisions and predictions. The split points and thresholds in the internal nodes indicate the values of the features that led to certain branches. The more samples in a leaf node that belong to a specific class (in classification) or have similar target values (in regression), the more confident the model is in making predictions for that specific path.
In summary, building a decision tree involves preparing the data, training the model, and evaluating its performance. Interpreting the decision tree involves following the paths through the tree to understand how features contribute to decisions and predictions. Decision trees are valuable tools for interpretable and understandable machine learning models, making them especially useful for cases where interpretability is important.