Tidy Modeling Flashcards
Models are mathematical tools that can describe a system and capture
relationships in the data given to them
Predicting future events, determining between-group differences, map-based visualizations, and pattern discovery are all
Purposes for which models can be used
The utility of a model hinges on its ability to be
Reductive (reduce complex relationships to simpler terms)
Purpose of a descriptive model
Describe or illustrate characteristics of some data
Descriptive models need not have a purpose other than visually emphasizing an artifact in the data (T/F)
T
Producing a decision for a research question or to explore a particular hypothesis is the goal of
Inferential models
An inferential model starts with a predefined hypothesis about a population and produces a
Statistical conclusion (rejection of hypothesis, interval estimate, etc.)
Inferential modeling techniques typically produce a __________ output
Probabilistic (p-value, CI, posterior probability)
To compute probabilistic outputs, probabilistic assumptions must be made about the data and the underlying processes that generated the data because
The quality of statistical modeling is highly dependent on the pre-defined assumptions and how well the data fit them
The primary goal of predictive models is that the predicted values have
The highest possible fidelity to the true value of the new data
Problem type being resolved by predictive models is
Estimation
In predictive models, more interest is vested in the predicted value than
Why the predicted value is what it is
Predictive models can include measures of uncertainty (T/F)
True
Most important factor affecting predictive models…
How the model was developed
Predictive mechanistic models produce a model equation that
Depends on assumptions
In predictive mechanistic models, data are used to estimate…
Unknown parameters of the model equation to generate predictions
In predictive mechanistic models, differential equations are set based on
The model’s assumptions
Unlike inferential models, predictive mechanistic models allow for data-driven statements on how well the model performs based on
How well it predicts the existing data
Empirically-driven models are created with _____ assumptions
Vague
Empirically-driven modeling most associated with _______ learning
Machine
KNN modeling is an
Empirically-driven predictive model
How does KNN work?
Given reference data, a new sample is predicted by using the values of K most similar data in the reference set
In predictive models, if the structure of the model is good, then
The predictions would be close to the actual values
Three types of models
Descriptive, inferential, and predictive
Two types of predictive models
Mechanistic and empirically-driven
Ordinary Linear Regression model is descriptive when
Restricted smoothing splines (similar to LOESS) are used to describe trends in data using OLR with specialized terms
OLR is inferential when
Statistical results (p-values for ex) are used for inference
OLR is predictive when
A simple linear regression produces accurate predictions
KNN should not be used for inference because
Its nature makes the math required for inference impossible
The predictive capacities of descriptive and inferential models should not be ignored because of how they model how
How variables relate to the probability of outcomes
Predictive performance relates to how close the model’s
Fitted values are to the observed data
Whether a model is appropriate cannot be determined by ______ alone
Statistical significance
Unsupervised models learn patterns, clusters, or other characteristics of data (understand relationships between variables) but lack
An outcome (dependent variable)
Examples: principal component analysis (PCA), clustering, and autoencoders
Supervised models have an outcome variable. Examples are…
Linear regression, neural networks, etc.
Two sub-categories of supervised models
Regression (predictable numeric outcome)
Classification (predicts outcome based on ordered or unordered set of qualitative values)
Outcomes (what is being predicted) are also known as…
Labels, endpoints, or dependent variables
Independent variables (used to make predictions) also known as…
Predictors, features, or covariates
Exploratory data analysis shows
How variables are related to each other (distributions, typical ranges, etc.)
During EDA, two main questions should be answered, which are
…
How did I come by these data?
Are the data relevant to the problem?
Performance metrics should be identified prior to
The analysis process
Phases of Modeling
EDA (iterate between numerical analysis and visualization)
Feature engineering (use existent variables to create new variables)
Model tuning and selection (specifying or optimizing the structural parameters of models)
Model evaluation (assess Model performance, examine residual plots)
A main effect is a Model term that contains a
Single predictor variable
Root mean squared error (RMSE) is used in regression models by taking the difference (residuals) between the
Observed and predicted values in calculations
Primary approach for empirical model validation is to split the existing pool of data into two distinct sets WHICH ARE
Training set - majority of data; used to build model
Test set - determines whether model is successful (should only be looked at once, or it becomes part of the modeling process)
Simple random sampling is the most common method used to
Split data into training and test sets