Mid-Term Exam Flashcards
List and describe four levels of measurements.
Nominal
Ordinal
Interval
Ratio
Python includes built in data types for lists, sets, and dicts?
True
The median of a set of numbers is defined as …
The middle value.
Put the numbers in order
If 2 numbers are in the middle compute the mean.
It is good at finding the center distribution.
All data are quantitative?
False
Ordinal values allow us to measure distances
False
Nominal values allow us to order different data points
False
Python includes built in data types for lists, sets, and dicts
True
Explain the difference between list and dict values in python.
Python lists are ordered variable-length arrays, rather than linked lists
Python dictionaries unordered unique resizable hash tables.
if statements in python can include additional clauses using the elseif keyword.
False
What does lamda do in python?
lamda is a way to create a temporary function
It can be used when performing a map(lambda x: x, x+2)
Data formats are used as a way to share data between systems?
True
Data are measurements of a phenomenon
True
Data are the same as the thing being measured
False
Systems store formatted data
True
CSV files are like a tab from a spreadsheet
True
numpy arrays have no data type
False
Array is a dimensional vector
NumPy arrays have a .all() method that returns true if any of the elements are true
false
What does the argmax function do in numpy?
Returns the indices of the maximum values along an axis
Explain the relationship between a Series and a DataFrame in pandas?
Series is the datastructure for a single column of a DataFrame
The data in a DataFrame is actually stored in memory as a collection of Series
What is an “index” on a pandas dataframe?
An index is the reference to the series in dataframe
You can see the first several rows of a dataframe using
df.first()
False
Given two series, a and b, does a == b return true if they’re equivalent and false otherwise? If not, what does it produce instead?
The == operator compares by checking for equality
It would return ‘both a and b are equal’
Compare and contrast supervised and unsupervised machine learning
Supervised learning is machine learning that we have the answer for (labeled data)
Unsupervised learning is ML for which we are uncertain as to the answer or have no labeled data.
Compare and contrast regression and classification
Regression is the process of trying to make a prediction based on a previous data set.
Classification is where we are attempting to determine if something is part of a class.
The line separating the positive class and negative class is called the
decision boundary
In order to apply machine learning, we must first convert our data into a numeric format
True
Define precision and recall.
Precision is true positive / (true positive + true negative)
Recall is true positive / (true positive + false negative)
What does a confusion matrix display?
A confusion matrix displays possible answers you get when classifying data
Was positive was negative
Condition positive | True positive | true negative
(type 1 error)
Condition negative | false positive
(type 2 error) | false negative
F1-score is a combination of which of the following metrics:
precision and recall
What is the visual effect of modifying the cluster_std when using make_blobs() to generate synthetic classification data?
By modifying the cluster_std to the dataset we would bring the data closer (or further apart) from the median.
How is the cost function used during the training process?
A cost function is something you want to minimize. For example, your cost function might be the sum of squared errors over your training set. Gradient descent is a method for finding the minimum of a function of multiple variables. So you can use gradient descent to minimize your cost function.
During training of a linear regressor, the user must specify the bias parameter in advance.
False
What is a residual in linear regression?
Is the difference between the optimal solution and the proposed guess.
How is the residual sum of squares (RSS) cost function for linear regression defined?
Sum of the squared of the residuals. It is a measure of the discrepancy between the data and an estimation model.
In linear regression, our regressor learns a decision boundary that’s shaped like a bowl (or upside-down dome).
False
What is the role of the logistic function in logistic regression?
To classify between the 0 and 1 boundary
What is the name of the cost function we learned for logistic regression?
Cross Entropy Function
The cost function for logistic regression penalizes all wrong answers equally
False
Define the basic algorithm for kNN classification.
For an unlabeled shape we can look at the closest nearby labelled samples to determine what the class of s should be.
1) Find the k nearest known training point
2) Average the values for the point
3) Assign the class
What parameter performs regularization in kNN classification? If you want to increase variance in a kNN classifier, should this parameter get larger or smaller? Why?
K is the parameter that performs regularization
To increase the variance in then you should reduce K.
Increasing k acts like a regularizer and tends towards more biased smoother model.
Explain cross-validation. Why do we use cross-validation?
We may have a limited data set so we could apply cross validation to break the training data up so as to limit the possibility of the model over learning (or memorizing) with available dataset
In class we discussed that there are three basic contributors to model error. What are they?
Overfitting
Underfitting
Not enough data
What is the purpose of regularization in machine learning?
The purpose of regularization is to encourage a simpler model.
What part of the learning process is modified when we apply regularization to a logistic or linear regression model?
We modify the weights in the cost function to to find the optimal solution.
L1 regularization is the sum of the square roots of the weights in your model
False
L2 regularization is the sum of the squares of the weights of your model
True
L1 regularization will result in sparser models
True
Sparser models means that there are fewer training samples used during the learning process
False
Hyperparameters are asjusted outside the training procedure of the model
True