Session 3.1 Flashcards
Data science tasks can be split into two groups
- Unsupervised Methods
there is no specific target variable - Supervised Methods
there is a specific target variable
Unsupervised Methods
- Affinity grouping
- Similarity Matching
- Clustering
- Sentiment Analysis
Supervised Methods
- Predictive Modeling
- Causal (Explanatory) Modeling
Predictive modeling
is the process of applying a statistical model or data mining algorithm to data for the purpose of predicting new or future observations.
Predictive modeling is a method for estimating an unknown value of interest, which is called target
Causal (Explanatory) Modeling
is the use of statistical models for explaining how the world works (by testing causal explanations)
Why Empirical Explanation and Empirical Prediction Differ?
- Explanatory models are based on underlying causal relationships between theoretical constructs while predictive models rely on associations between measurable variables.
- Explanatory modeling seeks to minimize model bias (i.e., specification error) to obtain the most accurate representation of the underlying theoretical model, predictive modeling seeks to minimize the combination of model bias and sampling variance (how much does the model change with new data; more on this in a future session)
Linear Regression is…
an approach for modeling the relationship between a dependent variable and one or more explanatory variables
Logistic regression can also be used for classification
- If the target variable is binary, we can interpret the dependent variable as a Yes/No outcome
- The outcome of a model resulting from a logistic regression can be interpreted as a probability
Decision Trees
- Trees create a segmentation of the data
- Each node in the tree contains a test of an attribute
- Each path eventually terminates at a leaf
- Each leaf corresponds to a segment, and the attributes and values along the path give the characteristics
- Each leaf contains a value for the target variable
The process of recursively segmenting the population stops when at least one of the following conditions is met:
- All elements of a segment belong to the same class (special case: the segment contains only one element)
- The maximum allowed tree depth has been reached
- Using more attributes does not “help”
How to evaluate if a model is a good model? How do we compare two models?
It is important to consider carefully what would we like to achieve when building a model
This seems simple but many times we see some statistics reported without a clear understanding why this is the right statistic
Now: Basic performance metrics
Later: More sensible performance metrics
Accuracy is …
the proportion of correct decisions made by the classifier.
Accuracy is a popular metric because …
it is very simple to calculate
Error rate is …
the proportion or wrong decisions made by the classifier.