Jupyter Notebook 1.1 Simple Examples Flashcards

Question 1

Q

What is a classification?

Answer

A

Assigning each data point to a class is called classification.

Classification is a supervised machine learning method.

The prediction task is a classification when a target variable is yes/no.

Diabetes competition (Binary classification)
Task : Classify data sampled from a study of more than 177000 subjects
Input : BMI, blood pressure, cholesterol, various lifestyle information. 21 features in total.
Output: A lable of “Positive” or “Negative” i.e. whether the person has diabetes or not.
spam/not spam

Question 2

Q

In a dataset:

What is each row called?
What is each column called?

Answer

A

Each row is called a sample.

Each column is called a feature.

Question 3

Q

What are DataFrames?

Answer

A

DataFrames are a fundamental data structure in Pandas, offering a 2-dimensional, size-mutable, and heterogeneous tabular structure with labeled axes (rows and columns). In simpler terms, it’s a way to store data in a table format, similar to an Excel sheet, which makes data manipulation and analysis more intuitive.

Question 4

Q

How do we list available features and labels?

Answer

A

features = diabetes_dataset[‘feature_names’]
print(f”Freatures: {features}”)
Output: Features: [‘age’, ‘sex’, ‘bmi’, ‘bp’, ‘s1’, ‘s2’, ‘s3’, ‘s4’, ‘s5’, ‘s6’]

print(f”Labels: {diabetes_dataset[‘target_names’]}”)
Output: Labels: [‘setosa’ ‘versicolor’ ‘virginica’]
Obviously not the correct labels, but thats how you find em ya know!

Question 5

Q

What method or code is used to remove rows from our DataFrames that cointain missing values (NaN)?

Answer

A

diabetes_data = diabetes_data.dropna(axis=0)

This code removes any rows from the DataFrame diabetes_data that contain missing values (NaN). Specifically:

dropna() is a method that removes missing data (NaN values).
axis=0 specifies that the operation should be performed on rows. If there is at least one NaN value in a row, that entire row will be removed from the DataFrame.

Question 6

Q

How do you display a list of all columns in a DataFrame?

Answer

A

Use the columns property of the DataFrame. For example:
melbourne_data.columns

Question 7

Q

What does the dropna(axis=0) method do in pandas?

Answer

A

It removes any rows from the DataFrame that contain missing values (NaN).

Question 8

Q

How do you select the “prediction target” in a DataFrame?

Answer

A

Use dot-notation to select the desired column. For example:
y = melbourne_data.Price

Question 9

Q

What are “features” in a machine learning model?

Answer

A

Features are the columns in your dataset that are used as inputs to predict the target variable

Question 10

Q

How do you select multiple features in a DataFrame?

Answer

A

Use a list of column names inside brackets. For example:
melbourne_features = [‘Rooms’, ‘Bathroom’, ‘Landsize’, ‘Lattitude’, ‘Longtitude’]
X = melbourne_data[melbourne_features]

Question 11

Q

What are the four key steps to building and using a machine learning model?

Answer

A

Define the model
Fit the model with training data
Predict using the model
Evaluate the model’s performance

Question 12

Q

What is the purpose of specifying a random_state in a model?

Answer

A

It ensures reproducible results by controlling the randomness in model training.

Question 13

Q

How do you fit a decision tree model in scikit-learn?

Answer

A

Define and fit the model using the following code:
from sklearn.tree import DecisionTreeRegressor
melbourne_model = DecisionTreeRegressor(random_state=1)
melbourne_model.fit(X, y)

Question 14

Q

How do you make predictions with a fitted model in scikit-learn?

Answer

A

Use the predict() function. For example:
melbourne_model.predict(X.head())

Question 15

Q

What is overfitting?

Answer

A

Overfitting occurs when a model becomes too complex and captures noise in the training data, leading to poor generalization on new data. It fits the training data too well but struggles with unseen data.
Essentially the model is “remembering” the data rather than learning patterns.

Reasons for overfitting:
- Model is too complex
- It memorizes patterns

Question 16

Q

What is underfitting?

Answer

Study These Flashcards

A

Underfitting occurs when a model is too simple to capture patterns in the data, resulting in poor performance on both training and test data.

Resons for underfitting:
- Simple model
- Too few or irrelevant features

Question 17

Q

How does the train_test_split function work?
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

Answer

Study These Flashcards

A

Question 18

Q

How can we measure the significance of each feature in making predictions in random forests?

Answer

Study These Flashcards

A

Random forests provide a measure called feature importance
importance = rf.feature_importances_
print(importances)
array([0.07449143, 0.27876091, 0.08888318, 0.07157507, 0.07091345,
0.15805822, 0.11822478, 0.13909297])

Question 19

Q

What is permutation importance?

Answer

Study These Flashcards

A

Permutation importance is a method to measure the impact of a feature on a model’s performance by randomly shuffling the feature’s values and seeing how much the model’s accuracy decreases.

Jupyter Notebook 1.1 Simple Examples Flashcards

(19 cards)