Test 2 Flashcards
The goal of PCA is to find a high-dimensional representation of the data that maintains as much information as possible.
F
Machine learning algorithms are readily available in our tools, such as Altair AI Studio, Azure ML Studio, Python libraries, etc. As a result, it is no longer important to understand the mathematical principles and assumptions behind the algorithms.
F
Linear algebra, calculus, optimization, and probability theory are the four mathematical fields mentioned in the textbook.
T
A derivative is a measure of the sensitivity of a function to changes in the function’s input(s).
T
Vectors and matrices are the building blocks in linear algebra.
T
The transpose of a matrix is the matrix with its rows and columns inverted.
T
Master data management does not require data governance.
F
Data classification is a way to define the various levels of confidentiality/security required by the organization.
T
Symmetric encryption is a good way for a bank to authenticate your account credentials.
F
Information security involves protecting data from unauthorized access.
T
Master data is the contextual data about the organization/entity used to increase the informativeness of transaction data.
T
Modeling in data science involves creating representations of real-world phenomena.
T
In information security, defense in depth is the concept that an organizations uses multiple layers of security to protect sensitive/valuable data assets
T
Multiple linear regression improves the power of your analysis by quantifying the cumulative effect of all features.
T
While doing our bivariate analysis, we found that the following attributes had the following r-square values with respect to the label attribute:
age: 0.29
education: 0.14
experience: 0.16
Given this information we should expect a regression model that includes these same three attributes against the same label will have an R-square value of 0.59.
F
R2 represents how well the regression model explains the variance in the label value.
T
Multiple linear regression (MLR) is but one of many algorithms used for multivariate modeling.
T
Scaling techniques such as LogNormal, MinMax, Z-score, Tanh, and Logistic are used to adjust the values of numeric variables.
T
K-means clustering is more robust to outliers than k-medoids clustering
F
Cluster analysis is a form of supervised learning.
F
Lower values of the Calinski-Harabasz criterion indicate a better clustering solution.
F
K-medoids clustering identifies an actual data point for each cluster that is most centrally located.
T
Clustering requires us to specify a label attribute.
F
A 2-nearest neighbor model is more likely to overfit than a 20-nearest neighbor model.
T
Structured data includes text or image data.
F
Machine learning models can be and often are used without concern for interpretation.
T
The coefficient of determination (R-squared) indicates how well the predicted values explain the variation in the observed values.
T
Multiple linear regression differs from linear regression because it is generally used to predict categorical values.
F
The primary purpose of statistics is to make precise predictions about new observations.
F
The Mean Absolute Error (MAE) is less sensitive to outliers than the RMSE.
T
The F-statistic is used to determine if the overall regression model is a good fit for the data.
T
Logistic regression requires numeric predictor attributes, so categorical attributes need to be converted to numeric attributes before analysis.
T
The Root Mean Squared Error (RMSE) is expressed in the same units as the dependent variable.
T
Clustering is a supervised learning method.
F
The adjusted coefficient of determination penalizes for the number of independent variables in the model.
T
Quantitative variables can be either discrete or continuous.
T
Machine learning involves implementing static algorithms to make predictions.
F
Bias refers to the error due to sensitivity to small fluctuations in the training set.
F
Deep learning is a subset of machine learning that uses neural networks with many layers.
T
A 2-nearest neighbor model is more likely to overfit than a 20-nearest neighbor model.
T
Decision nodes are used in linear regression.
F
The error rate of a classifier is equal to the number of incorrect decisions made over the total number of decisions made.
T
Underfitting occurs when the model does not adequately reflect the distribution of the training data.
T
When using clustering, the target variable does not have to be precisely defined at training time.
T
In the use phase, k-means algorithms classify new instances by finding the k most similar training instances and applying a combination function to the known values of their target variables.
F
Random Forests are relatively robust against worthless features.
T
Convolutional neural networks (CNNs) are an example of algorithmic generation of features.
F
Weak learners in ensemble methods may perform only slightly better than a random decision.
T
The Data Preparation phase in CRISP-DM includes selecting data and cleaning data.
T
Fliers in a Tukey box plot represent data values beyond the cap values.
T
Derived attributes are new attributes constructed from one or more existing attributes.
T
The empirical rule states that any data within three standard deviations of the mean is considered an outlier.
F
Data cleaning and transformation are one-time tasks and do not require iteration in a data science project.
F
Imputing the null value is usually a better option than using some type of average.
F
The Tukey box plot is a visual representation of the skewness of a distribution.
F
Outliers should always be removed from data sets.
F
The median is generally represented as a line within the quartile box in a Tukey box plot.
T
The caps in a Tukey box plot can only represent the true min/max values.
F
The empirical rule is relevant for categorical or binary data.
F
Missing data can be classified as Missing Completely at Random (MCAR).
T
The architecture of the original Watson AI was simple and relied on a single machine learning model
F
Symbolic AI, also known as GOFAI, relies on machine learning algorithms.
F
Autonomous vehicles are considered the smartest robots built by mankind so far.
T
According to the author of the Papp, et al chapter on AI, machine learning can be used for purposes other than AI.
T
Early stage AI work focused exclusively on machine learning algorithms.
F
The first programming language introduced to help computers achieve symbolic intelligence was Python.
F
The architecture of AI solutions tends to become simpler over time.
F
One of the biggest problems in creating a good artificial generalized intelligence model is figuring out how to balance the need to be able to learn quickly from a small number of examples and the need to develop capabilities that apply generally.
T
The Universal Approximation Theorem states that an artificial neural network with one hidden layer can approximate any mathematical function.
T
When modeling the crypto market, the authors were able to borrow a model for traditional equities markets and use it as is with excellent results.
T
Inductive biases help machine learning algorithms learn faster and more efficiently.
T
Model descriptions include a detailed description of the model and any special features.
T
The data mining engineer ranks the models according to evaluation criteria.
T
AI is just a fancy name for machine learning models.
F
All modeling techniques make the same assumptions about the data.
F
One disadvantage of using AI Studio’s automated feature selection operators is that the selection process is based on feature correlations and does not account for the chosen modeling algorithm.
F
When creating indicator features for a categorical attribute, you should create an indicator feature for each potential category value.
F
One approach during the test design phase involves separating the data set into training and test sets.
T
Parameter settings are adjusted during the data preparation phase.
F
Classification models predict a numeric label value.
F
The lifecycle of a modelling and simulation project is linear and straightforward.
F
Black-Box models provide more insight into actual dynamics than White-Box models.
F
Verification answers the question, “Is the model developed right?”
T
The iterative process of modelling is depicted in the following figure:
Problem Formulation –> Modeling Concept –> Usefulness –> Validation –> Answer to the Problem
- Validation pointing to Modeling Concept and Problem Formulation with “No”
- Usefulness pointing to Modeling Concept and Problem Formulation with “No”
T
Calibration is used when parameter values are unknown and need to be estimated.
T
According to the textbook, ODE and SD are two of the most common microscopic modeling techniques.
T
According to the textbook readings, modeling methods can be classified into two primary categories: macroscopic methods and microscopic methods.
T
The term “dynamic” in modelling refers to time-dependent components.
T
Partial differential equations (PDEs) are used to describe systems behavior over time.
F
Documentation is crucial for the reproducibility of simulation models.
T
Validation answers the question, “Is the right model developed?”
T
Visualization is not necessary for documenting and validating simulation models.
F
What is the purpose of data cleaning in a data preparation phase?
To identify and correct errors or inconsistencies in the dataset.
Define feature engineering.
The process of creating new features from existing data to improve model performance.
How do you handle missing values in a dataset?
Techniques include imputation (filling in values) or deletion (removing missing data).
What types of visualizations are commonly used in EDA?
Histograms, box plots, and scatter plots to uncover patterns in data.
Why is correlation analysis important?
It helps examine relationships between variables, indicating potential dependencies.
What are outliers, and why should they be identified?
Outliers are data points that differ significantly from others; they can skew analysis results.
What is the difference between supervised and unsupervised learning?
Supervised learning uses labeled data; unsupervised learning uses unlabeled data.
Name a common metric used for evaluating model performance.
Accuracy, precision, recall, F1-score, or ROC-AUC.
What is overfitting, and how can it be prevented?
Overfitting occurs when a model learns noise instead of the underlying pattern; it can be prevented by using regularization techniques and cross-validation.
What are ensemble methods in machine learning?
Techniques that combine multiple models to improve overall accuracy (e.g., Random Forest, Boosting).
What is the purpose of dimensionality reduction?
To simplify models while retaining essential information, often using techniques like PCA.
How is Natural Language Processing (NLP) used in data science?
NLP techniques are used to process and analyze textual data.
What is the significance of model monitoring?
To evaluate models against real-world data and ensure continued accuracy.
Why is documentation important in model deployment?
It maintains clear records of model specifications, performance metrics, and changes made.
How can models be updated post-deployment?
By retraining models with new data to adapt to changing conditions.
What are the potential impacts of bias in data science?
Bias can lead to unfair or inaccurate model predictions affecting decisions and outcomes.
What measures can be taken to ensure data privacy?
Implementing security measures and complying with regulations like GDPR.
Why is transparency important in machine learning models?
It promotes user trust by clarifying algorithms and decision-making processes.
Supervised Learning
You know the independent (input) and dependent (output) variables.
Unsupervised Learning
You do NOT know the labels for output variables. The model identifies patterns or groupings on its own.
CRISP-DM
Cross-industry process for data mining. Steps:
Data Understanding: Explore initial data, find quality issues, and insights.
Data Prep: Clean, format, and transform data for modeling.
Modeling: Select and apply appropriate algorithms.
Know each stage and tasks involved.
Regression Evaluation Metrics
Common metrics include:
P-Value: Measures variable significance.
R-Squared: Proportion of variance explained by the model.
Adjusted R-Squared: Adjusts R-Squared based on the number of predictors.
Understand each metric’s role in model evaluation.
Assumptions of MLR
Key assumptions include:
Linearity, independence, homoscedasticity, and normality of residuals.
Review each assumption for accurate model performance
Distance
Measures similarity or dissimilarity between data points, used in clustering and k-NN.
Probability
Likelihood of a particular outcome; foundational in predictive modeling and statistics.
Bayes Theory
Method to calculate conditional probability. Know basic formula and application
Naïve Bayes
Simplified Bayes theory assuming feature independence. Used for classification tasks.
Entropy
Measurement of data disorder, specifically for labels.
Interpretation: How mixed positive and negative groups are in the dataset.
Information Gain
Reduction in entropy after splitting data; indicates improvement in data classification.
Data Prep Basics
Preprocessing steps:
Convert data to correct type.
Set target labels.
Create indicator and ordinal variables.
Address skewness.
Standardize or normalize data (e.g., Min-Max, Z-score).
Data Leaks
Using data that wouldn’t be available at prediction time, leading to overly optimistic models.
Accuracy
Ratio of correct predictions over total predictions.
Formula: (True Positives + True Negatives) / Total Predictions.
Missing Values and how to handle
Types:
Missing Completely at Random (MCAR): No pattern to missingness.
Missing Not at Random (MNAR): Missingness related to unobserved data.
Know strategies for each type.
Classification evaluation metrics
- accuracy
- AUC
- F-Score
confusion matrix
cross tab table that lists predictions and how many of each predictions and potential outcomes were right and wrong
- should be sized n x n because you dont always have a binominal label
clustering
Grouping data points into clusters based on similarity; an unsupervised learning technique.
k-NN
Classification or regression based on the ‘k’ closest points (neighbors) to a target.
k: one to whatever
nn: nearest neighbors
K-means
Partitioning data into ‘k’ clusters with each point assigned to the nearest cluster centroid.
Decision Trees
Tree-structured classifiers that split data based on feature values to make predictions.
Correcting for scew/kurtosis
Log transformation, square root, or Box-Cox to normalize data distributions.
standardizing/normalizing data values
Adjust data to a common scale.
Standardizing: Center data around 0 (Z-score).
Normalizing: Scale data between a set range (e.g., 0-1, Min-Max).
feature selection
choosing which features help, and eliminating ones that hurt data
inductive logic
Inferring general patterns from specific observations (bottom-up approach)
deductive logic
Starts with general rules or observations to reach specific conclusions (top-down approach)