Mid-Term Exam Flashcards

1
Q

List and describe four levels of measurements.

A

Nominal
Ordinal
Interval
Ratio

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Python includes built in data types for lists, sets, and dicts?

A

True

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

The median of a set of numbers is defined as …

A

The middle value.
Put the numbers in order
If 2 numbers are in the middle compute the mean.

It is good at finding the center distribution.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

All data are quantitative?

A

False

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Ordinal values allow us to measure distances

A

False

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Nominal values allow us to order different data points

A

False

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Python includes built in data types for lists, sets, and dicts

A

True

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Explain the difference between list and dict values in python.

A

Python lists are ordered variable-length arrays, rather than linked lists

Python dictionaries unordered unique resizable hash tables.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

if statements in python can include additional clauses using the elseif keyword.

A

False

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What does lamda do in python?

A

lamda is a way to create a temporary function

It can be used when performing a map(lambda x: x, x+2)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Data formats are used as a way to share data between systems?

A

True

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Data are measurements of a phenomenon

A

True

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Data are the same as the thing being measured

A

False

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Systems store formatted data

A

True

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

CSV files are like a tab from a spreadsheet

A

True

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

numpy arrays have no data type

A

False

Array is a dimensional vector

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

NumPy arrays have a .all() method that returns true if any of the elements are true

A

false

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What does the argmax function do in numpy?

A

Returns the indices of the maximum values along an axis

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Explain the relationship between a Series and a DataFrame in pandas?

A

Series is the datastructure for a single column of a DataFrame

The data in a DataFrame is actually stored in memory as a collection of Series

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What is an “index” on a pandas dataframe?

A

An index is the reference to the series in dataframe

21
Q

You can see the first several rows of a dataframe using

df.first()

A

False

22
Q

Given two series, a and b, does a == b return true if they’re equivalent and false otherwise? If not, what does it produce instead?

A

The == operator compares by checking for equality

It would return ‘both a and b are equal’

23
Q

Compare and contrast supervised and unsupervised machine learning

A

Supervised learning is machine learning that we have the answer for (labeled data)

Unsupervised learning is ML for which we are uncertain as to the answer or have no labeled data.

24
Q

Compare and contrast regression and classification

A

Regression is the process of trying to make a prediction based on a previous data set.

Classification is where we are attempting to determine if something is part of a class.

25
Q

The line separating the positive class and negative class is called the

A

decision boundary

26
Q

In order to apply machine learning, we must first convert our data into a numeric format

A

True

27
Q

Define precision and recall.

A

Precision is true positive / (true positive + true negative)

Recall is true positive / (true positive + false negative)

28
Q

What does a confusion matrix display?

A

A confusion matrix displays possible answers you get when classifying data

                           Was positive          was negative 

Condition positive | True positive | true negative
(type 1 error)

Condition negative | false positive
(type 2 error) | false negative

29
Q

F1-score is a combination of which of the following metrics:

A

precision and recall

30
Q

What is the visual effect of modifying the cluster_std when using make_blobs() to generate synthetic classification data?

A

By modifying the cluster_std to the dataset we would bring the data closer (or further apart) from the median.

31
Q

How is the cost function used during the training process?

A

A cost function is something you want to minimize. For example, your cost function might be the sum of squared errors over your training set. Gradient descent is a method for finding the minimum of a function of multiple variables. So you can use gradient descent to minimize your cost function.

32
Q

During training of a linear regressor, the user must specify the bias parameter in advance.

A

False

33
Q

What is a residual in linear regression?

A

Is the difference between the optimal solution and the proposed guess.

34
Q

How is the residual sum of squares (RSS) cost function for linear regression defined?

A

Sum of the squared of the residuals. It is a measure of the discrepancy between the data and an estimation model.

35
Q

In linear regression, our regressor learns a decision boundary that’s shaped like a bowl (or upside-down dome).

A

False

36
Q

What is the role of the logistic function in logistic regression?

A

To classify between the 0 and 1 boundary

37
Q

What is the name of the cost function we learned for logistic regression?

A

Cross Entropy Function

38
Q

The cost function for logistic regression penalizes all wrong answers equally

A

False

39
Q

Define the basic algorithm for kNN classification.

A

For an unlabeled shape we can look at the closest nearby labelled samples to determine what the class of s should be.

1) Find the k nearest known training point
2) Average the values for the point
3) Assign the class

40
Q

What parameter performs regularization in kNN classification? If you want to increase variance in a kNN classifier, should this parameter get larger or smaller? Why?

A

K is the parameter that performs regularization

To increase the variance in then you should reduce K.

Increasing k acts like a regularizer and tends towards more biased smoother model.

41
Q

Explain cross-validation. Why do we use cross-validation?

A

We may have a limited data set so we could apply cross validation to break the training data up so as to limit the possibility of the model over learning (or memorizing) with available dataset

42
Q

In class we discussed that there are three basic contributors to model error. What are they?

A

Overfitting
Underfitting
Not enough data

43
Q

What is the purpose of regularization in machine learning?

A

The purpose of regularization is to encourage a simpler model.

44
Q

What part of the learning process is modified when we apply regularization to a logistic or linear regression model?

A

We modify the weights in the cost function to to find the optimal solution.

45
Q

L1 regularization is the sum of the square roots of the weights in your model

A

False

46
Q

L2 regularization is the sum of the squares of the weights of your model

A

True

47
Q

L1 regularization will result in sparser models

A

True

48
Q

Sparser models means that there are fewer training samples used during the learning process

A

False

49
Q

Hyperparameters are asjusted outside the training procedure of the model

A

True