Reducing Complexity in Data in Microsoft Azure Flashcards
A model uses three features to make predictions. Two of the features are categorical and hold three unique values each. Upon evaluation of the model, we identify the model has poor performance. Why is this?
Overfitting
Underfitting
Algorithm choice
Underfitting
Which of these common model evaluation metrics incorporates feature set complexity?
MSE - Mean Squared Error
BIC - Bayesian Informaton Criterion
AUC - Area Under ROC Curve
RMSE - Root Mean Square error
BIC
You want to remove categorical features that are unlikely to add value to your model. What would be an effective first step?
Hashing categorical features
One-hot encode categorical variables
Identify columns with no or low variance
Identify columns with no or low variance
Removing highly correlated features is an important step for which algorithm family?
k-means
Regression
Trees
Regression
Which of the following is NOT a commonly used kernel in kPCA?
Polynomial
Sigmoid
Radial Basis
Beta
Beta
High cardinality categorical variables have:
A large number of rows with a single label
High correlation with another feature
Labels with long names
A large number of distinct labels
A large number of distinct labels - High cardinality means that the column contains a large number of totally unique values.
What does Linear Discriminant Analysis (LDA) aim to maximize?
Difference between groups
Difference between observations
Sparse representation
Value compression
Difference between groups
k-means clustering can be classified as:
Supervised, parametric
Unsupervised, parametric
Unsupervised, non-parametric
Supervised, non-parametric
Unsupervised, non-parametric
Nonparametric statistics refers to a statistical method in which the data are not assumed to come from prescribed models that are determined by a small number of parameters; examples of such models include the normal distribution model and the linear regression model. Nonparametric statistics sometimes uses data that is ordinal, meaning it does not rely on numbers, but rather on a ranking or order of sorts.
What Python library might you use to build an autoencoder?
Tensorflow
PyStan
Scikit-learn
Codec
Tensorflow
What must you ensure about your features before you perform PCA?
Features are not collinear
Features are similarly scaled
Features are normally distributed
Features have high variance
Features are similarly scaled