UEUE7 Flashcards

Question 1

Q

HOW TO CONDUCT MODEL VALIDATION

Answer

A

Model validation is a critical step in assessing the performance and reliability of a prediction model. Here are steps you can follow to conduct model validation:

Splitting the Data: Divide your dataset into two parts: a training set and a validation (or testing) set. Typically, you allocate a larger portion for training (e.g., 70-80%) and the rest for validation.

Training the Model: Use the training set to train your prediction model. This involves feeding the algorithm your data and allowing it to learn the patterns and relationships between the input variables and the target variable.

Validation Metrics Selection: Choose appropriate metrics to evaluate the model’s performance. Common metrics include accuracy, precision, recall, F1-score for classification tasks, and metrics like mean squared error (MSE), root mean squared error (RMSE), and R-squared for regression tasks.

Model Evaluation: Apply the trained model to the validation set to make predictions. Compare these predictions with the actual values in the validation set using the selected evaluation metrics.

Cross-Validation (Optional): In cases where data is limited, consider techniques like k-fold cross-validation. This involves splitting the data into k subsets, training the model k times (each time using k-1 subsets as training data and the remaining subset as validation data), and averaging the performance to get a more robust evaluation.

Adjustment and Improvement: Analyze the model’s performance metrics. If the model is not performing well, consider adjusting hyperparameters, using different algorithms, or engineering features differently. Iterate this process until you achieve satisfactory performance.

Final Evaluation: Once you’re satisfied with the model’s performance on the validation set, you can assess its performance on completely unseen data, which simulates how it would perform in the real world. This could be a holdout dataset that hasn’t been used in any part of the training or validation.

Remember, model validation is an iterative process, and it’s essential to ensure that the model doesn’t overfit the training data (perform well on training but poorly on unseen data) or underfit (perform poorly on both training and validation).

Question 2

Q

WHAT IS BACKWARDS VARIABLE SELECTION

Answer

A

Backward Selection:
Start with All Features: Begin by including all potential features in the model.
Fit a Model: Train the model using all features.
Feature Elimination: Identify the least significant feature (based on p-values, AIC, BIC, or other criteria).
Remove Feature: Exclude the identified feature from the model.
Iteration: Repeat steps 2-4 until the stopping criterion is met (e.g., no more statistically insignificant features or reaching a predetermined number of features).
This process helps in eliminating less relevant features, resulting in a more parsimonious model.

Stepwise Selection (Forward and Backward Combined):
Forward Selection: Start with an empty model.

Iteratively add features one at a time, choosing the most significant feature based on predefined criteria (e.g., lowest p-value).
Continue adding features until a stopping criterion is met (e.g., no more statistically significant features to add).
Backward Elimination: Perform steps similar to backward selection:

Start with a model that includes all features.
Remove the least significant feature iteratively based on predetermined criteria.
Hybrid Stepwise: Combine forward selection and backward elimination.

Initially perform forward selection, adding significant features.
Then, switch to backward elimination, removing insignificant features.
Repeat until the stopping criterion is met.
These stepwise methods aim to iteratively add or remove features to improve the model’s performance or simplicity based on statistical criteria, such as p-values, AIC, BIC, or other measures of model fit.

However, it’s important to note that stepwise selection methods have limitations, including potential overfitting, and may not always yield the best model. They might miss important interactions between variables or choose variables based on chance correlations in the data.

While these techniques automate feature selection, manual assessment and domain knowledge can often complement these methods for more robust model building.

Would you like an example or further guidance on implementing these techniques in a specific programming language or using a particular regression algorithm?

User

Question 3

Q

G) Briefly give the outline of a study that could be used to assess the efficacy and safety of a personalized treatment strategy

Answer

A

Study Outline:
1. Objective:
Define the primary objective: Are you aiming to demonstrate improved efficacy or safety compared to standard treatments? Or are you focusing on individualized responses or biomarkers?
2. Participant Selection:
Inclusion Criteria: Define specific criteria for patient selection based on the treatment strategy’s personalized aspects (e.g., genetic markers, biomarkers, disease characteristics).
Exclusion Criteria: Exclude individuals who might confound the study results or pose risks due to specific conditions or treatments.
3. Randomization and Control:
Consider whether a randomized controlled trial (RCT) design is feasible or whether other study designs, such as an adaptive trial or observational study, might be more appropriate.
Implement control groups, which could involve standard treatment, placebo, or an alternative intervention.
4. Intervention:
Clearly define the personalized treatment strategy being evaluated.
Ensure proper implementation and monitoring of the personalized approach, which might involve tailored dosages, treatment schedules, or specific interventions based on individual characteristics.
5. Outcomes and Measures:
Determine primary and secondary endpoints that reflect both efficacy and safety aspects.
Efficacy measures could include disease progression, response rates, symptom relief, or survival rates.
Safety measures might encompass adverse events, toxicity, or side effects.
6. Data Collection and Analysis:
Collect relevant data points before, during, and after treatment. These might include biomarker levels, genetic profiles, clinical assessments, and patient-reported outcomes.
Use appropriate statistical methods to analyze the data, considering the personalized nature of the treatment and potential subgroup analyses.
7. Ethical Considerations and Informed Consent:
Ensure ethical approval and compliance with regulatory requirements.
Obtain informed consent from participants, clearly explaining the personalized nature of the treatment and potential risks and benefits.
8. Monitoring and Follow-Up:
Implement monitoring strategies to oversee treatment adherence, assess outcomes, and manage adverse events.
Plan for follow-up assessments to evaluate long-term efficacy and safety.
9. Interpretation and Reporting:
Interpret the study results, considering both efficacy and safety outcomes for the personalized treatment strategy.
Discuss findings in the context of individualized responses and implications for broader clinical practice.
10. Publication and Dissemination:
Share study findings through peer-reviewed publications and presentations at scientific conferences.
Ensure clear communication of results to healthcare providers, patients, and relevant stakeholders.
A well-designed study assessing the efficacy and safety of a personalized treatment strategy should address the unique characteristics of individual patients while providing rigorous evidence to support its effectiveness and safety in a broader clinical context.

Question 4

Q

Prognostic Factors

Answer

A

Prognostic factors are variables or characteristics that are associated with the predicted outcome or course of a condition or disease. Prognostic factors are used to estimate the likelihood of different outcomes and to help guide treatment decisions. They play an important role in determining the overall prognosis for a patient.

Question 5

Q

predicition models

Answer

A

Prediction models are statistical models that use various variables and factors to predict or estimate the likelihood of a certain outcome or event. These models are developed based on available data and can be used to make predictions about future outcomes or to assess the risk of a particular outcome occurring. In the context of prognostic factors, prediction models can be used to calculate the probability of different outcomes based on the identified prognostic factors.

Question 6

Q

Multivariable model

Answer

A

A multivariable model, also known as a multiple regression model, is a statistical model that includes multiple independent variables or predictors to estimate the relationship between those variables and a dependent variable. In other words, it examines how multiple factors or variables collectively contribute to predicting an outcome or event.

Question 7

Q

regreession mideling

Answer

A

In regression-based modeling, a mathematical equation, called a regression model, is developed to represent the relationship between the variables. The equation takes the form of Y = β₀ + β₁X₁ + β₂X₂ + … + βₚXₚ + ε, where Y is the dependent variable, X₁, X₂, …, Xₚ are the independent variables, β₀, β₁, β₂, …, βₚ are the regression coefficients that represent the effects of the independent variables, and ε is the error term.

The regression coefficients indicate how a unit change in an independent variable affects the dependent variable, holding other variables constant. The error term represents the variability or unexplained part of the dependent variable.

Regression-based modeling involves estimating the regression coefficients using statistical techniques such as ordinary least squares (OLS). The model is fitted to the data to find the best-fitting line or curve that represents the relationship between the variables. The goodness of fit of the model is assessed using various metrics, such as R-squared, adjusted R-squared, and p-values.

Regression-based modeling can be used for various purposes, including prediction, hypothesis testing, and understanding the relationships between variables. It is widely used in fields such as economics, social sciences, finance, and healthcare to analyze and interpret complex data and make informed decisions based on the relationships observed in the data.

Question 8

Q

Ordinary least square

Answer

A

Ordinary Least Squares (OLS) is a statistical method used in regression analysis to estimate the parameters of a linear regression model. The goal of OLS is to find the best-fitting line or curve that minimizes the sum of the squared differences between the observed values and the predicted values by the model.

In OLS, the dependent variable is assumed to be a linear combination of the independent variables, with an added error term. The method calculates the regression coefficients that represent the effect of each independent variable on the dependent variable. These coefficients are estimated by minimizing the sum of the squared residuals, which are the differences between the observed and predicted values.

The OLS method is called “ordinary” because it does not make any specific assumptions about the distribution of the error term or the independent variables. It is commonly used when the dependent variable and the independent variables are continuous and there is no violation of the assumptions of linearity, independence, homoscedasticity, and normality.

OLS provides estimates of the regression coefficients, as well as measures of their precision and significance. These estimates are used to make predictions, conduct hypothesis tests, and assess the overall fit of the model. OLS is widely used in various fields, including economics, social sciences, finance, and healthcare, for analyzing and interpreting data and making statistical inferences.

Avoid including every single available predictor, which can lead to noise and overfitting

Question 9

Q

complete case analysis

Answer

A

Complete case analysis

Discards missing data in predictors
Never good for any analysis
Can significantly lose sample size
Affects precision
Assumes about missingness leading to biased results

Complete case analysis, also known as listwise deletion or casewise deletion, is a method of handling missing data in statistical analysis. In this approach, any case or observation that has missing values on any variable of interest is completely excluded from the analysis.

With complete case analysis, only the cases that have complete information on all variables are retained for analysis. This can lead to a reduced sample size and potential loss of statistical power. It assumes that the missing data are missing completely at random (MCAR) and that the complete cases are a representative subset of the original sample.

One advantage of complete case analysis is its simplicity. It does not require imputation or other complex techniques to handle missing data. However, it may not be appropriate if the missingness is related to the variables being analyzed or if the missing data are not missing completely at random.

It is important to carefully consider the missing data mechanism and the potential implications of using complete case analysis before applying this method in data analysis.

************Avoid complete cases analysis- instead make a missing category************

Missing’ category

Linked to many problems
One missing indicator per predictor?
Overall missing indicator?
Affects sample size (additional predictors = more events needed)
Can produce more biased results than doing a complete case
Is often associated with the outcom

Question 10

Q

multiple imputation

Answer

A

Multiple imputation <- PREFERRED

Replaces missing values with plausible one using available data
Retains your sample size
No additional predictors added

Question 11

Q

Univariable screening

Answer

A

What issues might we be introducing by running multiple statistical tests between the outcome and each predictor?

Multiple testing!

Avoid omitting predictors based on lack of univariable/univariate/
unadjusted associations with the outcome

Important predictors can be excluded and unimportant predictors can be included

Type 1 error

Question 12

Q

Penalised methods

Answer

A

For very large predictor numbers, we can explore penalised methods such as LASSO and elastic net to help with predictor selection

These are shrinkage methods that include predictors into the model but shrink their coefficients toward 0

Penalized methods, such as LASSO (Least Absolute Shrinkage and Selection Operator) and elastic net, are statistical modeling techniques used for predictor selection and regularization. These methods are particularly useful when dealing with datasets that have a large number of predictors or variables.

In traditional regression analysis, all predictors are included in the model, which can lead to overfitting and unstable estimates, especially when the number of predictors is large compared to the sample size. Penalized methods address this issue by imposing a penalty on the size of the regression coefficients, effectively shrinking them towards zero. This helps in selecting a subset of predictors that are most relevant for predicting the outcome of interest.

LASSO is a penalized regression method that performs both variable selection and coefficient shrinkage. It encourages sparsity by forcing some regression coefficients to exactly zero, effectively removing irrelevant predictors from the model. This makes LASSO particularly useful when there is a suspicion that only a subset of predictors is truly associated with the outcome.

Elastic net is a combination of LASSO and ridge regression, which introduces a second penalty term to the objective function. The elastic net penalty allows for variable selection while also handling correlated predictors more effectively than LASSO alone.

Penalized methods can help improve prediction accuracy, enhance model interpretability, and reduce overfitting in the presence of high-dimensional data. They are widely used in various fields, including healthcare, finance, genomics, and social sciences, where datasets often have a large number of predictors.

It is important to note that penalized methods require careful tuning of the penalty parameter to achieve optimal results. Cross-validation techniques are commonly used to select the appropriate value of the penalty parameter and assess the performance of the penalized model.

Question 13

Q

**WE WANT TO KEEP CONTIONUS DATA CONTINOUS AND NOT CHANGE IT AT ALL.**

Keep continuous predictors continuous, but do not assume linearity!

Consider transformations

Answer

A

Polynomials are mathematical expressions that involve variables and coefficients, raised to a power. In the context of regression analysis, polynomials are used to model non-linear relationships between predictors and the outcome variable.

A polynomial term is created by raising a predictor variable to a power. For example, a quadratic polynomial includes terms like x^2, x^3, and so on, where x is the predictor variable. These terms allow for curved or non-linear relationships to be captured in the regression model.

Polynomials can be useful when there is a suspicion that the relationship between a predictor and the outcome is not linear. By including polynomial terms in the model, we can capture more complex relationships and improve the model’s fit to the data.

In addition to quadratic (second-degree) polynomials, other types of polynomials can also be used, such as cubic (third-degree) polynomials or higher-order polynomials. The choice of polynomial degree depends on the nature of the relationship between the predictor and the outcome variable, as well as the available data.

It is important to note that when using polynomial terms in regression analysis, it is necessary to interpret the coefficients associated with each polynomial term. These coefficients represent the change in the outcome variable for a one-unit change in the predictor variable, raised to the corresponding power.

Overall, polynomials are a flexible tool in regression analysis that allow for the modeling of non-linear relationships between predictors and outcomes. They provide a way to capture more complex patterns and improve the accuracy of the regression model.

Question 14

Q

knots

Answer

A

Spline functions are mathematical functions that are used to approximate or interpolate data points. They are commonly employed in regression analysis to capture non-linear relationships between variables. A spline function is composed of several polynomial segments, and “knots” are the points where these segments connect.

In the context of predictive modeling using splines, the number and placement of knots become crucial. The choice of knots affects the flexibility and smoothness of the fitted curve. Too few knots may result in an overly simplistic model that fails to capture the underlying complexity of the data, while too many knots may lead to overfitting, capturing noise in the data rather than the underlying pattern.
When using restricted cubic splines, it is generally recommended to use a smaller number of knots for smaller sample sizes. This is because adding more knots creates additional parameters that need to be estimated, increasing the complexity of the model. With a smaller sample size, estimating more parameters can lead to overfitting and unstable results.
The selection of the optimal number of knots is often based on statistical criteria, such as the Akaike Information Criterion (AIC) or the Bayesian Information Criterion (BIC), which balance model complexity and goodness of fit. These criteria can help determine the appropriate number of knots for a given sample size and dataset.

Overall, when using restricted cubic splines, it is important to consider the trade-off between flexibility and complexity, and to choose an appropriate number of knots based on the sample size and the specific characteristics of the data.

Question 15

Q

reestricted cubluc splines

Answer

A

A restricted cubic spline is a type of spline function commonly used in statistical modeling, particularly in regression analysis. It is a smoother version of a cubic spline, but with added constraints to improve stability and interpretability. The primary purpose of using restricted cubic splines in predictive models is to capture non-linear relationships between predictor variables and the response variable.

Here are some key points about restricted cubic splines in prediction models:

Cubic Splines vs. Restricted Cubic Splines:

Cubic splines are flexible but can lead to oscillations or wild fluctuations in the fitted curve.
Restricted cubic splines, on the other hand, apply constraints to the cubic spline to avoid extreme behavior, particularly in the tails of the predictor variable distribution.

Question 16

Q

regression based models

Answer

A

it with the full model

Requires clinical knowledge
Forces you to think about the predictors

Data driven methods

Forward selection (AVOID)
Backward selection
Stepwise

Model validation is a critical step in assessing the performance and reliability of a prediction model. Here are steps you can follow to conduct model validation:

Splitting the Data: Divide your dataset into two parts: a training set and a validation (or testing) set. Typically, you allocate a larger portion for training (e.g., 70-80%) and the rest for validation.
Training the Model: Use the training set to train your prediction model. This involves feeding the algorithm your data and allowing it to learn the patterns and relationships between the input variables and the target variable.
Validation Metrics Selection: Choose appropriate metrics to evaluate the model’s performance. Common metrics include accuracy, precision, recall, F1-score for classification tasks, and metrics like mean squared error (MSE), root mean squared error (RMSE), and R-squared for regression tasks.
Model Evaluation: Apply the trained model to the validation set to make predictions. Compare these predictions with the actual values in the validation set using the selected evaluation metrics.
Cross-Validation (Optional): In cases where data is limited, consider techniques like k-fold cross-validation. This involves splitting the data into k subsets, training the model k times (each time using k-1 subsets as training data and the remaining subset as validation data), and averaging the performance to get a more robust evaluation.
Adjustment and Improvement: Analyze the model’s performance metrics. If the model is not performing well, consider adjusting hyperparameters, using different algorithms, or engineering features differently. Iterate this process until you achieve satisfactory performance.
Final Evaluation: Once you’re satisfied with the model’s performance on the validation set, you can assess its performance on completely unseen data, which simulates how it would perform in the real world. This could be a holdout dataset that hasn’t been used in any part of the training or validation.

Question 17

Q

claibration

Answer

A

Calibration is the agreement between observed and predicted risks
- Graphically

–Calibration plot (preferred approach)

Supplement the calibration plot with an estimate of the

–Calibration-in-the-large (CITL)
Calibration slope
Good calibration would yield a~0 (CITL), b~1 (slope)

–If you calculate these, as is, in the development data a=0 and b=1 by definition, if they don’t something has gone wrong (i.e., not interesting in the development data, only interesting in validation data).

CALIBRATION SLOPE

Calculated by fitting a ‘calibration model’ to the data

ln(p/(1-p)) = a + b * linear predictor

–Where b is the estimate of the calibration slope

If b <1: the model is overfitted

–Predictions are too extreme

Too high and too low at the extremes
If b > 1: the model is underfitted

–Predictions are not varied enough (not low enough or high enough at the extremes)

Overfitted model (calibration slope < 1)

Question 18

Q

discrimination

Answer

A

Discrimination

Discrimination is the ability of the model to differentiate between individuals with and without the outcome
The area under the receiver operating characteristics (ROC) curve or concordance measure gives the probability:

–For a randomly selected pair of individuals, the model gives a higher probability to the individual who experience the event (and/or who had a shorter survival time for time-to-event models)

c-statistic for binary outcomes (e.g., logistic regression)*
c-index for time-to-event outcomes (e.g., Cox regression) [Harrell’s c-index most commonly used]

–Value of 0.5 indicates the model is not better than tossing a coin

–Value of 1.0 indicates the model has perfect discrimination

c-statistic for logistic regression (~0.5)

Question 19

Q

calibration and c statistic C-Statistic (Concordance Statistic):
Definition:

The C-statistic, also known as the area under the receiver operating characteristic curve (AUC-ROC), evaluates the discriminatory power of a model.
Key Points:

Binary Classification:

The C-statistic is commonly used for binary classification problems, where the model predicts the probability of belonging to the positive class.
ROC Curve:

The ROC curve is a graphical representation of the true positive rate (sensitivity) against the false positive rate (1-specificity) at various threshold settings. The AUC-ROC represents the area under this curve.
Interpretation:

A C-statistic of 0.5 indicates no discriminatory power (equivalent to random chance), while a value of 1.0 represents perfect discrimination.
Limitations:

The C-statistic may not capture certain aspects of model performance, especially in situations with imbalanced datasets or when false positives and false negatives have different consequences.
In summary, calibration assesses the accuracy of predicted probabilities, while the C-statistic evaluates the discriminatory power of a model. Both metrics provide valuable insights into different aspects of model performance, and it’s common to use them together for a comprehensive assessment. A well-calibrated model with good discriminatory power is generally desirable in predictive modeling.

Question 20

Q

internal validation

Answer

A

nternal validation methods, including resampling techniques like cross-validation and bootstrapping, as well as splitting strategies, are employed to assess the performance of predictive models using the data at hand. Let’s discuss each of these internal validation approaches:

Train-Test Split:
Method:
The dataset is divided into two subsets: one for training the model and the other for testing the model.
Common splits include 80% for training and 20% for testing.
Advantages:
Simple and computationally efficient.
Provides a quick estimate of model performance.
Considerations:
Randomness in the split can impact results.
The performance may vary based on the specific split.
Cross-Validation:
Method:
The dataset is divided into k folds, and the model is trained and tested k times.
Each fold serves as the test set exactly once, and the average performance across all folds is calculated.
Advantages:
Provides a more robust estimate of model performance.
Reduces the impact of a single random split.
Considerations:
Computationally more expensive than a single train-test split.
Leave-One-Out Cross-Validation (LOOCV):
Method:
A special case of k-fold cross-validation where k is equal to the number of observations.
In each iteration, one observation is used as the test set, and the model is trained on the remaining observations.
Advantages:
Provides a nearly unbiased estimate of model performance.
Considerations:
Can be computationally expensive, especially for large datasets.
Bootstrapping:
Method:
Samples are drawn with replacement from the dataset to create multiple bootstrap samples.
Each sample is used to train and test the model, and the average performance is calculated.
Advantages:
Allows for estimating the uncertainty of model performance.
Resampling provides an effective way to assess model stability.
Considerations:
Computationally intensive.
Time Series Split:
Method:
When dealing with time-series data, the dataset is split into training and test sets in a way that respects the chronological order of observations.
Advantages:
Suitable for time-dependent models.
Considerations:
Ensures that the model is evaluated on unseen future data.
Summary:
Resampling Techniques:

Cross-validation and bootstrapping involve resampling the dataset to obtain multiple training and testing subsets, allowing for a more robust evaluation of model performance.
Splitting Strategies:

Train-test split and time series split involve dividing the dataset into training and testing sets using different strategies.
Trade-offs:

The choice of internal validation method depends on factors such as dataset size, computational resources, and the desire for a more robust estimate of model performance.
Selecting an appropriate internal validation strategy is essential to obtain reliable insights into a model’s generalization performance and to identify potential issues such as overfitting or underfitting. It’s often recommended to use a combination of these techniques for a comprehensive evaluation.

Question 21

Q