quizzes Flashcards
Which of the following would be the least effective way to represent a color (e.g., “Pink”) in a dataset used in a predictive modeling task?
a) As a single numeric value of a color temperature scale (Kelvin)
b) As a one-hot nominal value
c) As an ordinal value based on its rank in an alphanumeric sorting of all colors
d) Based on three separate numerical values for Red, Green, and Blue (RGB)
As an ordinal value based on its rank in an alphanumeric sorting of all colors. There is no meaningful correlation between a rank and fundamental feature of color.
Consider a dataset with the following structure:
city | state | date | temp |
Berk | CA | 01/25/18 | 11 |
Assuming we wanted to transform this dataset into a dataset with only the features of State, Month, Temperature, with State represented by the longitude and latitude of the State’s capital, Month represented by a one-hot, and temperature left as a numeric: How many total features (columns) would be in this dataset?
15
Sum of Squares Error (SSE) can be used with K-means clustering to:
(check all that apply)
K = number of clusters
n = number of data points being clustered
a) Choose a value of K based on the heuristic of the “elbow” method
b) Choose between different clusterings (for a fixed K) produced by starting with different random K-means centroids
c) Find the best K by choosing the K with the minimum SSE for values of K from 1 to n
a & b
What is the range of the silhouette score?
[-1, 1]
How could a data point have a silhouette coefficient of 0?
If the data point is as close to points in its cluster as it is to points in the nearest cluster (not including its own)
How many different assignments of data points to clusters are there given n data points and K clusters? Assume a data point can only belong to a single cluster.
K^n
The plot below depicts data points for a dataset of 10 credit card seeking individuals, 6 of whom are considered to be a high credit risk and 4 of whom are considered to be a low credit risk.
What is the starting Gini impurity (index) of this dataset given credit risk as the target?
0.48
If there were equal low credit risk as high credit risk individuals, what would the Gini impurity be of the dataset without any splits?
0.5
If you were creating a decision tree based on this dataset using the C4.5 or CART algorithm, the first step would be to choose an attribute and split point that best partitioned the data points by the target value.
According to the credit risk plot, which attribute and split point would be the best choice among the following options?
Age with a split point of 35
Given enough depth (splits), a decision tree can successfully classify any training dataset with 100% accuracy.
False
Assume you are a building an image classification neural network to predict an image as either a dog, cat, or turtle. The images are 32x32 pixels and serialized into a vector of 1024 features per image. Assume there is only one hidden layer between the input and output layer. The hidden layer has 10 neurons (nodes). Ignoring bias terms, what is the total number of weights for this network?
10,270
Using a sigmoid as the activation function for a binary class in the output layer, what output value produced by the sigmoid would denote highest uncertainty for a class prediction:
+0.5
What input value into the sigmoid function would produce the highest uncertainty output value?
0
A binary classifier needs to predict the question: “Does the patient have lung cancer?” The table below shows a validation dataset labels and predictions. Compute the precision of these predictions:
Sample Number Actual Predicted
1 Normal Cancer
2 Cancer Cancer
3 Cancer Cancer
4 Normal Normal
5 Cancer Normal
Assume “Cancer” represents the positive class, and “Normal” represents the negative class.
Please round your answer to the 2nd decimal place.
[note: precision is a value between 0 and 1]
0.67
In which of the following prediction scenarios would it be appropriate to apply AUC as the metric?
When predicting a binary label with a probabilistic prediction