Test 2 Flashcards
The goal of PCA is to find a high-dimensional representation of the data that maintains as much information as possible.
F
Machine learning algorithms are readily available in our tools, such as Altair AI Studio, Azure ML Studio, Python libraries, etc. As a result, it is no longer important to understand the mathematical principles and assumptions behind the algorithms.
F
Linear algebra, calculus, optimization, and probability theory are the four mathematical fields mentioned in the textbook.
T
A derivative is a measure of the sensitivity of a function to changes in the function’s input(s).
T
Vectors and matrices are the building blocks in linear algebra.
T
The transpose of a matrix is the matrix with its rows and columns inverted.
T
Master data management does not require data governance.
F
Data classification is a way to define the various levels of confidentiality/security required by the organization.
T
Symmetric encryption is a good way for a bank to authenticate your account credentials.
F
Information security involves protecting data from unauthorized access.
T
Master data is the contextual data about the organization/entity used to increase the informativeness of transaction data.
T
Modeling in data science involves creating representations of real-world phenomena.
T
In information security, defense in depth is the concept that an organizations uses multiple layers of security to protect sensitive/valuable data assets
T
Multiple linear regression improves the power of your analysis by quantifying the cumulative effect of all features.
T
While doing our bivariate analysis, we found that the following attributes had the following r-square values with respect to the label attribute:
age: 0.29
education: 0.14
experience: 0.16
Given this information we should expect a regression model that includes these same three attributes against the same label will have an R-square value of 0.59.
F
R2 represents how well the regression model explains the variance in the label value.
T
Multiple linear regression (MLR) is but one of many algorithms used for multivariate modeling.
T
Scaling techniques such as LogNormal, MinMax, Z-score, Tanh, and Logistic are used to adjust the values of numeric variables.
T
K-means clustering is more robust to outliers than k-medoids clustering
F
Cluster analysis is a form of supervised learning.
F
Lower values of the Calinski-Harabasz criterion indicate a better clustering solution.
F
K-medoids clustering identifies an actual data point for each cluster that is most centrally located.
T
Clustering requires us to specify a label attribute.
F
A 2-nearest neighbor model is more likely to overfit than a 20-nearest neighbor model.
T
Structured data includes text or image data.
F
Machine learning models can be and often are used without concern for interpretation.
T
The coefficient of determination (R-squared) indicates how well the predicted values explain the variation in the observed values.
T
Multiple linear regression differs from linear regression because it is generally used to predict categorical values.
F
The primary purpose of statistics is to make precise predictions about new observations.
F
The Mean Absolute Error (MAE) is less sensitive to outliers than the RMSE.
T
The F-statistic is used to determine if the overall regression model is a good fit for the data.
T
Logistic regression requires numeric predictor attributes, so categorical attributes need to be converted to numeric attributes before analysis.
T
The Root Mean Squared Error (RMSE) is expressed in the same units as the dependent variable.
T
Clustering is a supervised learning method.
F
The adjusted coefficient of determination penalizes for the number of independent variables in the model.
T
Quantitative variables can be either discrete or continuous.
T
Machine learning involves implementing static algorithms to make predictions.
F
Bias refers to the error due to sensitivity to small fluctuations in the training set.
F
Deep learning is a subset of machine learning that uses neural networks with many layers.
T
A 2-nearest neighbor model is more likely to overfit than a 20-nearest neighbor model.
T
Decision nodes are used in linear regression.
F
The error rate of a classifier is equal to the number of incorrect decisions made over the total number of decisions made.
T
Underfitting occurs when the model does not adequately reflect the distribution of the training data.
T
When using clustering, the target variable does not have to be precisely defined at training time.
T
In the use phase, k-means algorithms classify new instances by finding the k most similar training instances and applying a combination function to the known values of their target variables.
F
Random Forests are relatively robust against worthless features.
T
Convolutional neural networks (CNNs) are an example of algorithmic generation of features.
F
Weak learners in ensemble methods may perform only slightly better than a random decision.
T
The Data Preparation phase in CRISP-DM includes selecting data and cleaning data.
T
Fliers in a Tukey box plot represent data values beyond the cap values.
T
Derived attributes are new attributes constructed from one or more existing attributes.
T
The empirical rule states that any data within three standard deviations of the mean is considered an outlier.
F
Data cleaning and transformation are one-time tasks and do not require iteration in a data science project.
F
Imputing the null value is usually a better option than using some type of average.
F