Jupyter Notebook 1.1 Simple Examples Flashcards

1
Q

What is a classification?

A

Assigning each data point to a class is called classification.

Classification is a supervised machine learning method.

The prediction task is a classification when a target variable is yes/no.

  • Diabetes competition (Binary classification)
    Task : Classify data sampled from a study of more than 177000 subjects
    Input : BMI, blood pressure, cholesterol, various lifestyle information. 21 features in total.
    Output: A lable of “Positive” or “Negative” i.e. whether the person has diabetes or not.
  • spam/not spam
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

In a dataset:

What is each row called?
What is each column called?

A

Each row is called a sample.

Each column is called a feature.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are DataFrames?

A

DataFrames are a fundamental data structure in Pandas, offering a 2-dimensional, size-mutable, and heterogeneous tabular structure with labeled axes (rows and columns). In simpler terms, it’s a way to store data in a table format, similar to an Excel sheet, which makes data manipulation and analysis more intuitive.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

How do we list available features and labels?

A

features = diabetes_dataset[‘feature_names’]
print(f”Freatures: {features}”)
Output: Features: [‘age’, ‘sex’, ‘bmi’, ‘bp’, ‘s1’, ‘s2’, ‘s3’, ‘s4’, ‘s5’, ‘s6’]

print(f”Labels: {diabetes_dataset[‘target_names’]}”)
Output: Labels: [‘setosa’ ‘versicolor’ ‘virginica’]
Obviously not the correct labels, but thats how you find em ya know!

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What method or code is used to remove rows from our DataFrames that cointain missing values (NaN)?

A

diabetes_data = diabetes_data.dropna(axis=0)

This code removes any rows from the DataFrame diabetes_data that contain missing values (NaN). Specifically:

  • dropna() is a method that removes missing data (NaN values).
  • axis=0 specifies that the operation should be performed on rows. If there is at least one NaN value in a row, that entire row will be removed from the DataFrame.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

How do you display a list of all columns in a DataFrame?

A

Use the columns property of the DataFrame. For example:
melbourne_data.columns

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What does the dropna(axis=0) method do in pandas?

A

It removes any rows from the DataFrame that contain missing values (NaN).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

How do you select the “prediction target” in a DataFrame?

A

Use dot-notation to select the desired column. For example:
y = melbourne_data.Price

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What are “features” in a machine learning model?

A

Features are the columns in your dataset that are used as inputs to predict the target variable

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

How do you select multiple features in a DataFrame?

A

Use a list of column names inside brackets. For example:
melbourne_features = [‘Rooms’, ‘Bathroom’, ‘Landsize’, ‘Lattitude’, ‘Longtitude’]
X = melbourne_data[melbourne_features]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are the four key steps to building and using a machine learning model?

A
  1. Define the model
  2. Fit the model with training data
  3. Predict using the model
  4. Evaluate the model’s performance
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is the purpose of specifying a random_state in a model?

A

It ensures reproducible results by controlling the randomness in model training.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

How do you fit a decision tree model in scikit-learn?

A

Define and fit the model using the following code:
from sklearn.tree import DecisionTreeRegressor
melbourne_model = DecisionTreeRegressor(random_state=1)
melbourne_model.fit(X, y)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

How do you make predictions with a fitted model in scikit-learn?

A

Use the predict() function. For example:
melbourne_model.predict(X.head())

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is overfitting?

A

Overfitting occurs when a model becomes too complex and captures noise in the training data, leading to poor generalization on new data. It fits the training data too well but struggles with unseen data.
Essentially the model is “remembering” the data rather than learning patterns.

Reasons for overfitting:
- Model is too complex
- It memorizes patterns

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is underfitting?

A

Underfitting occurs when a model is too simple to capture patterns in the data, resulting in poor performance on both training and test data.

Resons for underfitting:
- Simple model
- Too few or irrelevant features

17
Q

How does the train_test_split function work?
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

A
18
Q

How can we measure the significance of each feature in making predictions in random forests?

A

Random forests provide a measure called feature importance
importance = rf.feature_importances_
print(importances)
array([0.07449143, 0.27876091, 0.08888318, 0.07157507, 0.07091345,
0.15805822, 0.11822478, 0.13909297])

19
Q

What is permutation importance?

A

Permutation importance is a method to measure the impact of a feature on a model’s performance by randomly shuffling the feature’s values and seeing how much the model’s accuracy decreases.