Mid-Term Exam Flashcards by Jason Miller

List and describe four levels of measurements.

Nominal
Ordinal
Interval
Ratio

How well did you know this?

Not at all

Perfectly

Python includes built in data types for lists, sets, and dicts?

True

How well did you know this?

Not at all

Perfectly

The median of a set of numbers is defined as …

The middle value.
Put the numbers in order
If 2 numbers are in the middle compute the mean.

It is good at finding the center distribution.

How well did you know this?

Not at all

Perfectly

All data are quantitative?

False

How well did you know this?

Not at all

Perfectly

Ordinal values allow us to measure distances

False

How well did you know this?

Not at all

Perfectly

Nominal values allow us to order different data points

False

How well did you know this?

Not at all

Perfectly

Python includes built in data types for lists, sets, and dicts

True

How well did you know this?

Not at all

Perfectly

Explain the difference between list and dict values in python.

Python lists are ordered variable-length arrays, rather than linked lists

Python dictionaries unordered unique resizable hash tables.

How well did you know this?

Not at all

Perfectly

if statements in python can include additional clauses using the elseif keyword.

False

How well did you know this?

Not at all

Perfectly

What does lamda do in python?

lamda is a way to create a temporary function

It can be used when performing a map(lambda x: x, x+2)

How well did you know this?

Not at all

Perfectly

Data formats are used as a way to share data between systems?

True

How well did you know this?

Not at all

Perfectly

Data are measurements of a phenomenon

True

How well did you know this?

Not at all

Perfectly

Data are the same as the thing being measured

False

How well did you know this?

Not at all

Perfectly

Systems store formatted data

True

How well did you know this?

Not at all

Perfectly

CSV files are like a tab from a spreadsheet

True

How well did you know this?

Not at all

Perfectly

numpy arrays have no data type

False

Array is a dimensional vector

How well did you know this?

Not at all

Perfectly

NumPy arrays have a .all() method that returns true if any of the elements are true

false

How well did you know this?

Not at all

Perfectly

What does the argmax function do in numpy?

Returns the indices of the maximum values along an axis

How well did you know this?

Not at all

Perfectly

Explain the relationship between a Series and a DataFrame in pandas?

Series is the datastructure for a single column of a DataFrame

The data in a DataFrame is actually stored in memory as a collection of Series

How well did you know this?

Not at all

Perfectly

What is an “index” on a pandas dataframe?

An index is the reference to the series in dataframe

You can see the first several rows of a dataframe using

df.first()

False

Given two series, a and b, does a == b return true if they’re equivalent and false otherwise? If not, what does it produce instead?

The == operator compares by checking for equality

It would return ‘both a and b are equal’

Compare and contrast supervised and unsupervised machine learning

Supervised learning is machine learning that we have the answer for (labeled data)

Unsupervised learning is ML for which we are uncertain as to the answer or have no labeled data.

Compare and contrast regression and classification

Regression is the process of trying to make a prediction based on a previous data set.

Classification is where we are attempting to determine if something is part of a class.

The line separating the positive class and negative class is called the

decision boundary

In order to apply machine learning, we must first convert our data into a numeric format

True

Define precision and recall.

Precision is true positive / (true positive + true negative) Recall is true positive / (true positive + false negative)

What does a confusion matrix display?

A confusion matrix displays possible answers you get when classifying data Was positive was negative Condition positive | True positive | true negative (type 1 error) Condition negative | false positive (type 2 error) | false negative

F1-score is a combination of which of the following metrics:

precision and recall

What is the visual effect of modifying the cluster_std when using make_blobs() to generate synthetic classification data?

By modifying the cluster_std to the dataset we would bring the data closer (or further apart) from the median.

How is the cost function used during the training process?

A cost function is something you want to minimize. For example, your cost function might be the sum of squared errors over your training set. Gradient descent is a method for finding the minimum of a function of multiple variables. So you can use gradient descent to minimize your cost function.

During training of a linear regressor, the user must specify the bias parameter in advance.

False

What is a residual in linear regression?

Is the difference between the optimal solution and the proposed guess.

How is the residual sum of squares (RSS) cost function for linear regression defined?

Sum of the squared of the residuals. It is a measure of the discrepancy between the data and an estimation model.

In linear regression, our regressor learns a decision boundary that's shaped like a bowl (or upside-down dome).

False

What is the role of the logistic function in logistic regression?

To classify between the 0 and 1 boundary

What is the name of the cost function we learned for logistic regression?

Cross Entropy Function

The cost function for logistic regression penalizes all wrong answers equally

False

Define the basic algorithm for kNN classification.

For an unlabeled shape we can look at the closest nearby labelled samples to determine what the class of s should be. 1) Find the k nearest known training point 2) Average the values for the point 3) Assign the class

What parameter performs regularization in kNN classification? If you want to increase variance in a kNN classifier, should this parameter get larger or smaller? Why?

K is the parameter that performs regularization To increase the variance in then you should reduce K. Increasing k acts like a regularizer and tends towards more biased smoother model.

Explain cross-validation. Why do we use cross-validation?

We may have a limited data set so we could apply cross validation to break the training data up so as to limit the possibility of the model over learning (or memorizing) with available dataset

In class we discussed that there are three basic contributors to model error. What are they?

Overfitting Underfitting Not enough data

What is the purpose of regularization in machine learning?

The purpose of regularization is to encourage a simpler model.

What part of the learning process is modified when we apply regularization to a logistic or linear regression model?

We modify the weights in the cost function to to find the optimal solution.

L1 regularization is the sum of the square roots of the weights in your model

False

L2 regularization is the sum of the squares of the weights of your model

True

L1 regularization will result in sparser models

True

Sparser models means that there are fewer training samples used during the learning process

False

Hyperparameters are asjusted outside the training procedure of the model

True