Test 2 Flashcards

1
Q

The goal of PCA is to find a high-dimensional representation of the data that maintains as much information as possible.

A

F

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Machine learning algorithms are readily available in our tools, such as Altair AI Studio, Azure ML Studio, Python libraries, etc. As a result, it is no longer important to understand the mathematical principles and assumptions behind the algorithms.

A

F

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Linear algebra, calculus, optimization, and probability theory are the four mathematical fields mentioned in the textbook.

A

T

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

A derivative is a measure of the sensitivity of a function to changes in the function’s input(s).

A

T

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Vectors and matrices are the building blocks in linear algebra.

A

T

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

The transpose of a matrix is the matrix with its rows and columns inverted.

A

T

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Master data management does not require data governance.

A

F

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Data classification is a way to define the various levels of confidentiality/security required by the organization.

A

T

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Symmetric encryption is a good way for a bank to authenticate your account credentials.

A

F

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Information security involves protecting data from unauthorized access.

A

T

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Master data is the contextual data about the organization/entity used to increase the informativeness of transaction data.

A

T

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Modeling in data science involves creating representations of real-world phenomena.

A

T

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

In information security, defense in depth is the concept that an organizations uses multiple layers of security to protect sensitive/valuable data assets

A

T

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Multiple linear regression improves the power of your analysis by quantifying the cumulative effect of all features.

A

T

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

While doing our bivariate analysis, we found that the following attributes had the following r-square values with respect to the label attribute:
age: 0.29
education: 0.14
experience: 0.16
Given this information we should expect a regression model that includes these same three attributes against the same label will have an R-square value of 0.59.

A

F

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

R2 represents how well the regression model explains the variance in the label value.

A

T

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Multiple linear regression (MLR) is but one of many algorithms used for multivariate modeling.

A

T

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Scaling techniques such as LogNormal, MinMax, Z-score, Tanh, and Logistic are used to adjust the values of numeric variables.

A

T

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

K-means clustering is more robust to outliers than k-medoids clustering

A

F

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Cluster analysis is a form of supervised learning.

A

F

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Lower values of the Calinski-Harabasz criterion indicate a better clustering solution.

A

F

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

K-medoids clustering identifies an actual data point for each cluster that is most centrally located.

A

T

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Clustering requires us to specify a label attribute.

A

F

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

A 2-nearest neighbor model is more likely to overfit than a 20-nearest neighbor model.

A

T

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Structured data includes text or image data.

A

F

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Machine learning models can be and often are used without concern for interpretation.

A

T

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

The coefficient of determination (R-squared) indicates how well the predicted values explain the variation in the observed values.

A

T

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

Multiple linear regression differs from linear regression because it is generally used to predict categorical values.

A

F

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

The primary purpose of statistics is to make precise predictions about new observations.

A

F

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

The Mean Absolute Error (MAE) is less sensitive to outliers than the RMSE.

A

T

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

The F-statistic is used to determine if the overall regression model is a good fit for the data.

A

T

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

Logistic regression requires numeric predictor attributes, so categorical attributes need to be converted to numeric attributes before analysis.

A

T

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

The Root Mean Squared Error (RMSE) is expressed in the same units as the dependent variable.

A

T

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

Clustering is a supervised learning method.

A

F

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

The adjusted coefficient of determination penalizes for the number of independent variables in the model.

A

T

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

Quantitative variables can be either discrete or continuous.

A

T

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

Machine learning involves implementing static algorithms to make predictions.

A

F

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

Bias refers to the error due to sensitivity to small fluctuations in the training set.

A

F

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

Deep learning is a subset of machine learning that uses neural networks with many layers.

A

T

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

A 2-nearest neighbor model is more likely to overfit than a 20-nearest neighbor model.

A

T

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

Decision nodes are used in linear regression.

A

F

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

The error rate of a classifier is equal to the number of incorrect decisions made over the total number of decisions made.

A

T

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
38
Q

Underfitting occurs when the model does not adequately reflect the distribution of the training data.

A

T

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
39
Q

When using clustering, the target variable does not have to be precisely defined at training time.

A

T

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
40
Q

In the use phase, k-means algorithms classify new instances by finding the k most similar training instances and applying a combination function to the known values of their target variables.

A

F

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
41
Q

Random Forests are relatively robust against worthless features.

A

T

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
42
Q

Convolutional neural networks (CNNs) are an example of algorithmic generation of features.

A

F

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
43
Q

Weak learners in ensemble methods may perform only slightly better than a random decision.

A

T

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
44
Q

The Data Preparation phase in CRISP-DM includes selecting data and cleaning data.

A

T

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
45
Q

Fliers in a Tukey box plot represent data values beyond the cap values.

A

T

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
46
Q

Derived attributes are new attributes constructed from one or more existing attributes.

A

T

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
46
Q

The empirical rule states that any data within three standard deviations of the mean is considered an outlier.

A

F

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
47
Q

Data cleaning and transformation are one-time tasks and do not require iteration in a data science project.

A

F

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
48
Q

Imputing the null value is usually a better option than using some type of average.

A

F

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
49
Q

The Tukey box plot is a visual representation of the skewness of a distribution.

A

F

50
Q

Outliers should always be removed from data sets.

A

F

51
Q

The median is generally represented as a line within the quartile box in a Tukey box plot.

A

T

52
Q

The caps in a Tukey box plot can only represent the true min/max values.

A

F

53
Q

The empirical rule is relevant for categorical or binary data.

A

F

54
Q

Missing data can be classified as Missing Completely at Random (MCAR).

A

T

55
Q

The architecture of the original Watson AI was simple and relied on a single machine learning model

A

F

56
Q

Symbolic AI, also known as GOFAI, relies on machine learning algorithms.

A

F

57
Q

Autonomous vehicles are considered the smartest robots built by mankind so far.

A

T

58
Q

According to the author of the Papp, et al chapter on AI, machine learning can be used for purposes other than AI.

A

T

58
Q

Early stage AI work focused exclusively on machine learning algorithms.

A

F

59
Q

The first programming language introduced to help computers achieve symbolic intelligence was Python.

A

F

60
Q

The architecture of AI solutions tends to become simpler over time.

A

F

61
Q

One of the biggest problems in creating a good artificial generalized intelligence model is figuring out how to balance the need to be able to learn quickly from a small number of examples and the need to develop capabilities that apply generally.

A

T

62
Q

The Universal Approximation Theorem states that an artificial neural network with one hidden layer can approximate any mathematical function.

A

T

62
Q

When modeling the crypto market, the authors were able to borrow a model for traditional equities markets and use it as is with excellent results.

A

T

63
Q

Inductive biases help machine learning algorithms learn faster and more efficiently.

A

T

63
Q

Model descriptions include a detailed description of the model and any special features.

A

T

63
Q

The data mining engineer ranks the models according to evaluation criteria.

A

T

64
Q

AI is just a fancy name for machine learning models.

A

F

65
Q

All modeling techniques make the same assumptions about the data.

A

F

66
Q

One disadvantage of using AI Studio’s automated feature selection operators is that the selection process is based on feature correlations and does not account for the chosen modeling algorithm.

A

F

67
Q

When creating indicator features for a categorical attribute, you should create an indicator feature for each potential category value.

A

F

68
Q

One approach during the test design phase involves separating the data set into training and test sets.

A

T

69
Q

Parameter settings are adjusted during the data preparation phase.

A

F

70
Q

Classification models predict a numeric label value.

A

F

71
Q

The lifecycle of a modelling and simulation project is linear and straightforward.

A

F

72
Q

Black-Box models provide more insight into actual dynamics than White-Box models.

A

F

73
Q

Verification answers the question, “Is the model developed right?”

A

T

74
Q

The iterative process of modelling is depicted in the following figure:
Problem Formulation –> Modeling Concept –> Usefulness –> Validation –> Answer to the Problem

  • Validation pointing to Modeling Concept and Problem Formulation with “No”
  • Usefulness pointing to Modeling Concept and Problem Formulation with “No”
A

T

75
Q

Calibration is used when parameter values are unknown and need to be estimated.

A

T

76
Q

According to the textbook, ODE and SD are two of the most common microscopic modeling techniques.

A

T

77
Q

According to the textbook readings, modeling methods can be classified into two primary categories: macroscopic methods and microscopic methods.

A

T

78
Q

The term “dynamic” in modelling refers to time-dependent components.

A

T

79
Q

Partial differential equations (PDEs) are used to describe systems behavior over time.

A

F

80
Q

Documentation is crucial for the reproducibility of simulation models.

A

T

81
Q

Validation answers the question, “Is the right model developed?”

A

T

82
Q

Visualization is not necessary for documenting and validating simulation models.

A

F

83
Q

What is the purpose of data cleaning in a data preparation phase?

A

To identify and correct errors or inconsistencies in the dataset.

84
Q

Define feature engineering.

A

The process of creating new features from existing data to improve model performance.

85
Q

How do you handle missing values in a dataset?

A

Techniques include imputation (filling in values) or deletion (removing missing data).

86
Q

What types of visualizations are commonly used in EDA?

A

Histograms, box plots, and scatter plots to uncover patterns in data.

87
Q

Why is correlation analysis important?

A

It helps examine relationships between variables, indicating potential dependencies.

88
Q

What are outliers, and why should they be identified?

A

Outliers are data points that differ significantly from others; they can skew analysis results.

89
Q

What is the difference between supervised and unsupervised learning?

A

Supervised learning uses labeled data; unsupervised learning uses unlabeled data.

90
Q

Name a common metric used for evaluating model performance.

A

Accuracy, precision, recall, F1-score, or ROC-AUC.

91
Q

What is overfitting, and how can it be prevented?

A

Overfitting occurs when a model learns noise instead of the underlying pattern; it can be prevented by using regularization techniques and cross-validation.

92
Q

What are ensemble methods in machine learning?

A

Techniques that combine multiple models to improve overall accuracy (e.g., Random Forest, Boosting).

93
Q

What is the purpose of dimensionality reduction?

A

To simplify models while retaining essential information, often using techniques like PCA.

94
Q

How is Natural Language Processing (NLP) used in data science?

A

NLP techniques are used to process and analyze textual data.

95
Q

What is the significance of model monitoring?

A

To evaluate models against real-world data and ensure continued accuracy.

96
Q

Why is documentation important in model deployment?

A

It maintains clear records of model specifications, performance metrics, and changes made.

97
Q

How can models be updated post-deployment?

A

By retraining models with new data to adapt to changing conditions.

98
Q

What are the potential impacts of bias in data science?

A

Bias can lead to unfair or inaccurate model predictions affecting decisions and outcomes.

99
Q

What measures can be taken to ensure data privacy?

A

Implementing security measures and complying with regulations like GDPR.

100
Q

Why is transparency important in machine learning models?

A

It promotes user trust by clarifying algorithms and decision-making processes.

101
Q

Supervised Learning

A

You know the independent (input) and dependent (output) variables.

102
Q

Unsupervised Learning

A

You do NOT know the labels for output variables. The model identifies patterns or groupings on its own.

103
Q

CRISP-DM

A

Cross-industry process for data mining. Steps:

Data Understanding: Explore initial data, find quality issues, and insights.
Data Prep: Clean, format, and transform data for modeling.
Modeling: Select and apply appropriate algorithms.
Know each stage and tasks involved.

104
Q

Regression Evaluation Metrics

A

Common metrics include:

P-Value: Measures variable significance.
R-Squared: Proportion of variance explained by the model.
Adjusted R-Squared: Adjusts R-Squared based on the number of predictors.
Understand each metric’s role in model evaluation.

105
Q

Assumptions of MLR

A

Key assumptions include:

Linearity, independence, homoscedasticity, and normality of residuals.
Review each assumption for accurate model performance

106
Q

Distance

A

Measures similarity or dissimilarity between data points, used in clustering and k-NN.

107
Q

Probability

A

Likelihood of a particular outcome; foundational in predictive modeling and statistics.

108
Q

Bayes Theory

A

Method to calculate conditional probability. Know basic formula and application

109
Q

Naïve Bayes

A

Simplified Bayes theory assuming feature independence. Used for classification tasks.

110
Q

Entropy

A

Measurement of data disorder, specifically for labels.

Interpretation: How mixed positive and negative groups are in the dataset.

111
Q

Information Gain

A

Reduction in entropy after splitting data; indicates improvement in data classification.

112
Q

Data Prep Basics

A

Preprocessing steps:

Convert data to correct type.
Set target labels.
Create indicator and ordinal variables.
Address skewness.
Standardize or normalize data (e.g., Min-Max, Z-score).

113
Q

Data Leaks

A

Using data that wouldn’t be available at prediction time, leading to overly optimistic models.

114
Q

Accuracy

A

Ratio of correct predictions over total predictions.

Formula: (True Positives + True Negatives) / Total Predictions.

115
Q

Missing Values and how to handle

A

Types:
Missing Completely at Random (MCAR): No pattern to missingness.
Missing Not at Random (MNAR): Missingness related to unobserved data.

Know strategies for each type.

116
Q

Classification evaluation metrics

A
  • accuracy
  • AUC
  • F-Score
117
Q

confusion matrix

A

cross tab table that lists predictions and how many of each predictions and potential outcomes were right and wrong
- should be sized n x n because you dont always have a binominal label

118
Q

clustering

A

Grouping data points into clusters based on similarity; an unsupervised learning technique.

119
Q

k-NN

A

Classification or regression based on the ‘k’ closest points (neighbors) to a target.

k: one to whatever
nn: nearest neighbors

120
Q

K-means

A

Partitioning data into ‘k’ clusters with each point assigned to the nearest cluster centroid.

121
Q

Decision Trees

A

Tree-structured classifiers that split data based on feature values to make predictions.

122
Q

Correcting for scew/kurtosis

A

Log transformation, square root, or Box-Cox to normalize data distributions.

123
Q

standardizing/normalizing data values

A

Adjust data to a common scale.

Standardizing: Center data around 0 (Z-score).
Normalizing: Scale data between a set range (e.g., 0-1, Min-Max).

124
Q

feature selection

A

choosing which features help, and eliminating ones that hurt data

125
Q

inductive logic

A

Inferring general patterns from specific observations (bottom-up approach)

126
Q

deductive logic

A

Starts with general rules or observations to reach specific conclusions (top-down approach)