Test 2 Flashcards

Question 1

Q

The goal of PCA is to find a high-dimensional representation of the data that maintains as much information as possible.

Question 2

Q

Machine learning algorithms are readily available in our tools, such as Altair AI Studio, Azure ML Studio, Python libraries, etc. As a result, it is no longer important to understand the mathematical principles and assumptions behind the algorithms.

Question 3

Q

Linear algebra, calculus, optimization, and probability theory are the four mathematical fields mentioned in the textbook.

Question 4

Q

A derivative is a measure of the sensitivity of a function to changes in the function’s input(s).

Question 5

Q

Vectors and matrices are the building blocks in linear algebra.

Question 6

Q

The transpose of a matrix is the matrix with its rows and columns inverted.

Question 7

Q

Master data management does not require data governance.

Question 8

Q

Data classification is a way to define the various levels of confidentiality/security required by the organization.

Question 9

Q

Symmetric encryption is a good way for a bank to authenticate your account credentials.

Question 10

Q

Information security involves protecting data from unauthorized access.

Question 11

Q

Master data is the contextual data about the organization/entity used to increase the informativeness of transaction data.

Question 12

Q

Modeling in data science involves creating representations of real-world phenomena.

Question 13

Q

In information security, defense in depth is the concept that an organizations uses multiple layers of security to protect sensitive/valuable data assets

Question 14

Q

Multiple linear regression improves the power of your analysis by quantifying the cumulative effect of all features.

Question 15

Q

While doing our bivariate analysis, we found that the following attributes had the following r-square values with respect to the label attribute:
age: 0.29
education: 0.14
experience: 0.16
Given this information we should expect a regression model that includes these same three attributes against the same label will have an R-square value of 0.59.

Question 16

Q

R2 represents how well the regression model explains the variance in the label value.

Question 17

Q

Multiple linear regression (MLR) is but one of many algorithms used for multivariate modeling.

Question 18

Q

Scaling techniques such as LogNormal, MinMax, Z-score, Tanh, and Logistic are used to adjust the values of numeric variables.

Question 19

Q

K-means clustering is more robust to outliers than k-medoids clustering

Question 20

Q

Cluster analysis is a form of supervised learning.

Question 21

Q

Lower values of the Calinski-Harabasz criterion indicate a better clustering solution.

Question 22

Q

K-medoids clustering identifies an actual data point for each cluster that is most centrally located.

Question 23

Q

Clustering requires us to specify a label attribute.

Question 24

Q

A 2-nearest neighbor model is more likely to overfit than a 20-nearest neighbor model.

Question 25

Q

Structured data includes text or image data.

Question 26

Q

Machine learning models can be and often are used without concern for interpretation.

Question 27

Q

The coefficient of determination (R-squared) indicates how well the predicted values explain the variation in the observed values.

Question 28

Q

Multiple linear regression differs from linear regression because it is generally used to predict categorical values.

Question 29

Q

The primary purpose of statistics is to make precise predictions about new observations.

Question 30

Q

The Mean Absolute Error (MAE) is less sensitive to outliers than the RMSE.

Question 31

Q

The F-statistic is used to determine if the overall regression model is a good fit for the data.

Question 32

Q

Logistic regression requires numeric predictor attributes, so categorical attributes need to be converted to numeric attributes before analysis.

Question 33

Q

The Root Mean Squared Error (RMSE) is expressed in the same units as the dependent variable.

Question 34

Q

Clustering is a supervised learning method.

Question 35

Q

The adjusted coefficient of determination penalizes for the number of independent variables in the model.

Question 36

Q

Quantitative variables can be either discrete or continuous.

Question 37

Q

Machine learning involves implementing static algorithms to make predictions.

Question 38

Q

Bias refers to the error due to sensitivity to small fluctuations in the training set.

Question 39

Q

Deep learning is a subset of machine learning that uses neural networks with many layers.

Question 40

Q

A 2-nearest neighbor model is more likely to overfit than a 20-nearest neighbor model.

Question 41

Q

Decision nodes are used in linear regression.

Question 42

Q

The error rate of a classifier is equal to the number of incorrect decisions made over the total number of decisions made.

Question 43

Q

Underfitting occurs when the model does not adequately reflect the distribution of the training data.

Question 44

Q

When using clustering, the target variable does not have to be precisely defined at training time.

Question 45

Q

In the use phase, k-means algorithms classify new instances by finding the k most similar training instances and applying a combination function to the known values of their target variables.

Question 46

Q

Random Forests are relatively robust against worthless features.

Question 47

Q

Convolutional neural networks (CNNs) are an example of algorithmic generation of features.

Question 48

Q

Weak learners in ensemble methods may perform only slightly better than a random decision.

Question 49

Q

The Data Preparation phase in CRISP-DM includes selecting data and cleaning data.

Question 50

Q

Fliers in a Tukey box plot represent data values beyond the cap values.

Question 51

Q

Derived attributes are new attributes constructed from one or more existing attributes.

Question 52

Q

The empirical rule states that any data within three standard deviations of the mean is considered an outlier.

Question 53

Q

Data cleaning and transformation are one-time tasks and do not require iteration in a data science project.

Question 54

Q

Imputing the null value is usually a better option than using some type of average.

Question 55

Q

The Tukey box plot is a visual representation of the skewness of a distribution.

Question 56

Q

Outliers should always be removed from data sets.

Question 57

Q

The median is generally represented as a line within the quartile box in a Tukey box plot.

Question 58

Q

The caps in a Tukey box plot can only represent the true min/max values.

Question 59

Q

The empirical rule is relevant for categorical or binary data.

Question 60

Q

Missing data can be classified as Missing Completely at Random (MCAR).

Question 61

Q

The architecture of the original Watson AI was simple and relied on a single machine learning model

Question 62

Q

Symbolic AI, also known as GOFAI, relies on machine learning algorithms.

Question 63

Q

Autonomous vehicles are considered the smartest robots built by mankind so far.

Question 64

Q

According to the author of the Papp, et al chapter on AI, machine learning can be used for purposes other than AI.

Answer 1

A

To identify and correct errors or inconsistencies in the dataset.

Answer 2

A

The process of creating new features from existing data to improve model performance.

Answer 3

A

Techniques include imputation (filling in values) or deletion (removing missing data).

Answer 4

A

Histograms, box plots, and scatter plots to uncover patterns in data.

Answer 5

A

It helps examine relationships between variables, indicating potential dependencies.

Answer 6

A

Outliers are data points that differ significantly from others; they can skew analysis results.

Answer 7

A

Supervised learning uses labeled data; unsupervised learning uses unlabeled data.

Answer 8

A

Accuracy, precision, recall, F1-score, or ROC-AUC.

Answer 9

A

Overfitting occurs when a model learns noise instead of the underlying pattern; it can be prevented by using regularization techniques and cross-validation.

Answer 10

A

Techniques that combine multiple models to improve overall accuracy (e.g., Random Forest, Boosting).

Answer 11

A

To simplify models while retaining essential information, often using techniques like PCA.

Answer 12

A

NLP techniques are used to process and analyze textual data.

Answer 13

A

To evaluate models against real-world data and ensure continued accuracy.

Answer 14

A

It maintains clear records of model specifications, performance metrics, and changes made.

Answer 15

A

By retraining models with new data to adapt to changing conditions.

Answer 16

A

Bias can lead to unfair or inaccurate model predictions affecting decisions and outcomes.

Answer 17

A

Implementing security measures and complying with regulations like GDPR.

Answer 18

A

It promotes user trust by clarifying algorithms and decision-making processes.

Answer 19

A

You know the independent (input) and dependent (output) variables.

Answer 20

A

You do NOT know the labels for output variables. The model identifies patterns or groupings on its own.

Answer 21

A

Cross-industry process for data mining. Steps:

Data Understanding: Explore initial data, find quality issues, and insights.
Data Prep: Clean, format, and transform data for modeling.
Modeling: Select and apply appropriate algorithms.
Know each stage and tasks involved.

Answer 22

A

Common metrics include:

P-Value: Measures variable significance.
R-Squared: Proportion of variance explained by the model.
Adjusted R-Squared: Adjusts R-Squared based on the number of predictors.
Understand each metric’s role in model evaluation.

Answer 23

A

Key assumptions include:

Linearity, independence, homoscedasticity, and normality of residuals.
Review each assumption for accurate model performance

Answer 24

A

Measures similarity or dissimilarity between data points, used in clustering and k-NN.

Answer 25

A

Likelihood of a particular outcome; foundational in predictive modeling and statistics.

Answer 26

A

Method to calculate conditional probability. Know basic formula and application

Answer 27

A

Simplified Bayes theory assuming feature independence. Used for classification tasks.

Answer 28

A

Measurement of data disorder, specifically for labels.

Interpretation: How mixed positive and negative groups are in the dataset.

Answer 29

A

Reduction in entropy after splitting data; indicates improvement in data classification.

Answer 30

A

Preprocessing steps:

Convert data to correct type.
Set target labels.
Create indicator and ordinal variables.
Address skewness.
Standardize or normalize data (e.g., Min-Max, Z-score).

Answer 31

A

Using data that wouldn’t be available at prediction time, leading to overly optimistic models.

Answer 32

A

Ratio of correct predictions over total predictions.

Formula: (True Positives + True Negatives) / Total Predictions.

Answer 33

A

Types:
Missing Completely at Random (MCAR): No pattern to missingness.
Missing Not at Random (MNAR): Missingness related to unobserved data.

Know strategies for each type.

Answer 34

A

accuracy
AUC
F-Score

Answer 35

A

cross tab table that lists predictions and how many of each predictions and potential outcomes were right and wrong
- should be sized n x n because you dont always have a binominal label

Answer 36

A

Grouping data points into clusters based on similarity; an unsupervised learning technique.

Answer 37

A

Classification or regression based on the ‘k’ closest points (neighbors) to a target.

k: one to whatever
nn: nearest neighbors

Answer 38

A

Partitioning data into ‘k’ clusters with each point assigned to the nearest cluster centroid.

Answer 39

A

Tree-structured classifiers that split data based on feature values to make predictions.

Answer 40

A

Log transformation, square root, or Box-Cox to normalize data distributions.

Answer 41

A

Adjust data to a common scale.

Standardizing: Center data around 0 (Z-score).
Normalizing: Scale data between a set range (e.g., 0-1, Min-Max).

Answer 42

A

choosing which features help, and eliminating ones that hurt data

Answer 43

A

Inferring general patterns from specific observations (bottom-up approach)

Answer 44

A

Starts with general rules or observations to reach specific conclusions (top-down approach)