Sample Q from SOA Flashcards
- Determine which of the following statements is/are true.
I. The number of clusters must be pre-specified for both K-means and
hierarchical clustering.
II. The K-means clustering algorithm is less sensitive to the presence of
outliers than the hierarchical clustering algorithm.
III. The K-means clustering algorithm requires random assignments while the
hierarchical clustering algorithm does not.
(A) I only (B) II only (C) III only (D) I, II and II (E) The correct answer is not given by (A), (B), (C), or (D)
I is false because the number of clusters is pre-specified in the K-means algorithm but not for the hierarchical algorithm.
II is also false because both algorithms force each observation to a cluster so that both may be heavily distorted by the presence of outliers.
III is true.
- Consider the following statements:
I. Principal Component Analysis (PCA) provide low-dimensional linear
surfaces that are closest to the observations.
II. The first principal component is the line in p-dimensional space that is
closest to the observations.
III. PCA finds a low dimension representation of a dataset that contains as
much variation as possible.
IV. PCA serves as a tool for data visualization.
Determine which of the statements are correct.
(A) Statements I, II, and III only
(B) Statements I, II, and IV only
(C) Statements I, III, and IV only
(D) Statements II, III, and IV only
(E) Statements I, II, III, and IV are all correct
Statement I is correct – Principal components provide low-dimensional linear surfaces
that are closest to the observations.
Statement II is correct – The first principal component is the line in p-dimensional space
that is closest to the observations.
Statement III is correct – PCA finds a low dimension representation of a dataset that
contains as much variation as possible.
Statement IV is correct – PCA serves as a tool for data visualization.
- Consider the following statements:
I. The proportion of variance explained by an additional principal
component never decreases as more principal components are added.
II. The cumulative proportion of variance explained never decreases as more
principal components are added.
III. Using all possible principal components provides the best understanding
of the data.
IV. A scree plot provides a method for determining the number of principal
components to use.
Determine which of the statements are correct. (A) Statements I and II only (B) Statements I and III only (C) Statements I and IV only (D) Statements II and III only (E) Statements II and IV only
Statement I is incorrect – The proportion of variance explained by an additional principal
component decreases are stays the same as more principal components are added.
Statement II is correct – The cumulative proportion of variance explained increases or
stays the same as more principal components are added.
Statement III is incorrect – We want to use the least number of principal components
required to get the best understanding of the data.
Statement IV is correct – Typically, the number of principal components are chosen
based on a scree plot.
- Determine which of the following pairs of distribution and link function is the
most appropriate to model if a person is hospitalized or not.
(A) Normal distribution, identity link function
(B) Normal distribution, logit link function
(C) Binomial distribution, linear link function
(D) Binomial distribution, logit link function
(E) It cannot be determined from the information given.
- The intent is to model a binary outcome, thus a classification model is desired. In
GLM, this is equivalent to binomial distribution. The link function should be one
that restricts values to the range zero to one. Of linear and logit, only logit has this
property
- Determine which of the following statements describe the advantages of using an alternative fitting procedure, such as subset selection and shrinkage, instead of
least squares.
I. Doing so will likely result in a simpler model
II. Doing so will likely improve prediction accuracy
III. The results are likely to be easier to interpret
(A) I only
(B) II only
(C) III only
(D) I, II, and III
(E) The correct answer is not given by (A), (B), (C), or (D)
- Key: D
Alternative fitting procedures will tend to remove the irrelevant variables from the
predictors, thus resulting in a simpler and easier to interpret model. Accuracy will likely be improved due to reduction in variance.
- Determine which of the following statements about random forests is/are true?
I. If the number of predictors used at each split is equal to the total number
of available predictors, the result is the same as using bagging.
II. When building a specific tree, the same subset of predictor variables is
used at each split.
III. Random forests are an improvement over bagging because the trees are
decorrelated.
(A) None
(B) I and II only
(C) I and III only
(D) II and III only
(E) The correct answer is not given by (A), (B), (C), or (D).
- Key: C
II is false because with random forest a new subset of predictors is selected for each split.
- Determine which of the following statements is true
(A) Linear regression is a flexible approach
(B) Lasso is more flexible than a linear regression approach
(C) Bagging is a low flexibility approach
(D) There are methods that have high flexibility and are also easy to interpret
(E) None of (A), (B), (C), or (D) are true
- Key: E
A is false, linear regression is considered inflexible because the number of possible
models is restricted to a certain form.
B is false, the lasso determines the subset of variables to use while linear regression
allows the analyst discretion regarding adding or moving variables.
C is false, bagging provides additional flexibility.
D is false, there is a tradeoff between being flexible and easy to interpret.
- Determine which of the following statements is/are true for a simple linear
relationship, 0 1 y x =+ + ββ ε .
I. If ε = 0 , the 95% confidence interval is equal to the 95% prediction
interval.
II. The prediction interval is always at least as wide as the confidence
interval.
III. The prediction interval quantifies the possible range for Eyx (|).
(A) I only
(B) II only
(C) III only
(D) I, II, and III
(E) The correct answer is not given by (A), (B), (C), or (D).
- Key E
I is true. The prediction interval includes the irreducible error, but in this case it is zero.
II is true. Because it includes the irreducible error, the prediction interval is at least as
wide as the confidence interval.
III. is false. It is the confidence interval that quantifies this range.
- From an investigation of the residuals of fitting a linear regression by ordinary
least squares it is clear that the spread of the residuals increases as the predicted
values increase. Observed values of the dependent variable range from 0 to 100.
Determine which of the following statements is/are true with regard to transforming the
dependent variable to make the variance of the residuals more constant.
I. Taking the logarithm of one plus the value of the dependent variable may
make the variance of the residuals more constant.
II. A square root transformation may make the variance of the residuals more
constant.
III. A logit transformation may make the variance of the residuals more
constant.
(A) None
(B) I and II only
(C) I and III only
(D) II and III only
(E) The correct answer is not given by (A), (B), (C), or (D).
- Key: B
Adding a constant to the dependent variable avoids the problem of the logarithm of zero being negative infinity. In general, a log transformation may make the variance constant. Hence I is true. Power transformations with the power less than one, such as the squareroot transformation, may make the variance constant. Hence II is true. A logit transformation requires that the variable take on values between 0 and 1 and hence cannot be used here.
- Determine which of the following statements is applicable to K-means clustering
and is not applicable to hierarchical clustering.
(A) If two different people are given the same data and perform one iteration of the
algorithm, their results at that point will be the same.
(B) At each iteration of the algorithm, the number of clusters will be greater than the
number of clusters in the previous iteration of the algorithm.
(C) The algorithm needs to be run only once, regardless of how many clusters are
ultimately decided to use.
(D) The algorithm must be initialized with an assignment of the data points to a
cluster.
(E) None of (A), (B), (C), or (D) meet the meet the stated criterion.
- Key: D
(A) For K-means the initial cluster assignments are random. Thus different people can
have different clusters, so the statement is not true for K-means clustering. It is true for
hierarchical clustering.
(B) For K-means the number of clusters is set in advance and does not change as the
algorithm is run. For hierarchical clustering the number of clusters is determined after the
algorithm is completed.
(C) For K-means the algorithm needs to be re-run if the number of clusters is changed.
This is not the case for hierarchical clustering.
(D) This is true for K-means clustering. Agglomerative hierarchical clustering starts with
each data point being its own cluster.
- An analyst is modeling the probability of a certain phenomenon occurring. The
analyst has observed that the simple linear model currently in use results in
predicted values less than zero and greater than one.
Determine which of the following is the most appropriate way to address this issue.
(A) Limit the data to observations that are expected to result in predicted values
between 0 and 1.
(B) Consider predicted values below 0 as 0 and values above 1 as 1.
(C) Use a logit function to transform the linear model into only predicting values
between 0 and 1.
(D) Use the canonical link function for the Poisson distribution to transform the linear
model into only predicting values between 0 and 1.
(E) None of the above
- Key: C
(A) is not appropriate because removing data will likely bias the model estimates.
(B) is not appropriate because altering data will likely bias the model estimates.
(C) is correct.
(D) is not appropriate because the canonical link function is the logarithm, which will not
restrict values to the range zero to one
- A random walk is expressed as
Y(t) = Y (t-1) + c for t = 1,2, …
where
E (c) = µ(c) and Var(c) = σ(t) ^2
Determine which statements is/are true with respect to a random walk model.
I. If µ(c) ≠ 0, then the random walk is nonstationary in the mean.
II. If 2 0 σ (c)^2 = , then the random walk is nonstationary in the variance.
III. If 2 0 σ (c)^2 > , then the random walk is nonstationary in the variance.
(A) None
(B) I and II only
(C) I and III only
(D) II and III only
(E) The correct answer is not given by (A), (B), (C), or (D).
- Key: C
I is true because the mean 0 ( ) Ey y t t c = + µ depends on t.
II is false because the variance 2 () 0 Var y t t c = = σ does not depend in t.
III is true because the variance depends on t.
- Determine which of the following statements concerning decision tree pruning
is/are true.
I. The recursive binary splitting method can lead to overfitting the data.
II. A tree with more splits tends to have lower variance.
III. When using the cost complexity pruning method, α = 0 results in a very
large tree.
(A) None
(B) I and II only
(C) I and III only
(D) II and III only
(E) The correct answer is not given by (A), (B), (C), or (D).
- Key: C
I is true because the method optimizes with respect to the training set, but may perform
poorly on the test set.
II is false because additional splits tends to increase variance due to adding to the
complexity of the model.
III is true because in this case only the training error is measured.
- Determine which of the following considerations may make decision trees
preferable to other statistical learning methods.
I. Decision trees are easily interpretable.
II. Decision trees can be displayed graphically.
III. Decision trees are easier to explain than linear regression methods.
(A) None
(B) I and II only
(C) I and III only
(D) II and III only
(E) The correct answer is not given by (A), (B), (C), or (D).
- Key: E
All three statements are true. See Section 8.1 of An Introduction to Statistical Learning.
The statement that trees are easier to explain than linear regression methods may not be
obvious. For those familiar with regression but just learning about trees, the reverse may
be the case. However, for those not familiar with regression, relating the dependent
variable to the independent variables, especially if the dependent variable has been
transformed, can be difficult to explain.
- Principal component analysis is applied to a large data set with four variables.
Loadings for the first four principal components are estimated.
Determine which of the following statements is/are true with respect the loadings.
I. The loadings are unique.
II. For a given principal component, the sum of the squares of the loadings
across the four variables is one.
III. Together, the four principal components explain 100% of the variance.
(A) None
(B) I and II only
(C) I and III only
(D) II and III only
(E) The correct answer is not given by (A), (B), (C), or (D).
- Key: D
I is false because the loadings are unique only up to a sign flip.
II is true. Principal components are designed to maximize variance. If there are no
constraints on the magnitude of the loadings, the variance can be made arbitrarily large.
The PCA algorithm’s constraint is that the sum of the squares of the loadings equals 1.
III is true because four components can capture all the variation in four variables,
provided there are at least four data points (note that the problem states that the data set is
large).