Jupyter Notebook 1.1 Simple Examples Flashcards
What is a classification?
Assigning each data point to a class is called classification.
Classification is a supervised machine learning method.
The prediction task is a classification when a target variable is yes/no.
- Diabetes competition (Binary classification)
Task : Classify data sampled from a study of more than 177000 subjects
Input : BMI, blood pressure, cholesterol, various lifestyle information. 21 features in total.
Output: A lable of “Positive” or “Negative” i.e. whether the person has diabetes or not. - spam/not spam
In a dataset:
What is each row called?
What is each column called?
Each row is called a sample.
Each column is called a feature.
What are DataFrames?
DataFrames are a fundamental data structure in Pandas, offering a 2-dimensional, size-mutable, and heterogeneous tabular structure with labeled axes (rows and columns). In simpler terms, it’s a way to store data in a table format, similar to an Excel sheet, which makes data manipulation and analysis more intuitive.
How do we list available features and labels?
features = diabetes_dataset[‘feature_names’]
print(f”Freatures: {features}”)
Output: Features: [‘age’, ‘sex’, ‘bmi’, ‘bp’, ‘s1’, ‘s2’, ‘s3’, ‘s4’, ‘s5’, ‘s6’]
print(f”Labels: {diabetes_dataset[‘target_names’]}”)
Output: Labels: [‘setosa’ ‘versicolor’ ‘virginica’]
Obviously not the correct labels, but thats how you find em ya know!
What method or code is used to remove rows from our DataFrames that cointain missing values (NaN)?
diabetes_data = diabetes_data.dropna(axis=0)
This code removes any rows from the DataFrame diabetes_data that contain missing values (NaN). Specifically:
- dropna() is a method that removes missing data (NaN values).
- axis=0 specifies that the operation should be performed on rows. If there is at least one NaN value in a row, that entire row will be removed from the DataFrame.
How do you display a list of all columns in a DataFrame?
Use the columns property of the DataFrame. For example:
melbourne_data.columns
What does the dropna(axis=0) method do in pandas?
It removes any rows from the DataFrame that contain missing values (NaN).
How do you select the “prediction target” in a DataFrame?
Use dot-notation to select the desired column. For example:
y = melbourne_data.Price
What are “features” in a machine learning model?
Features are the columns in your dataset that are used as inputs to predict the target variable
How do you select multiple features in a DataFrame?
Use a list of column names inside brackets. For example:
melbourne_features = [‘Rooms’, ‘Bathroom’, ‘Landsize’, ‘Lattitude’, ‘Longtitude’]
X = melbourne_data[melbourne_features]
What are the four key steps to building and using a machine learning model?
- Define the model
- Fit the model with training data
- Predict using the model
- Evaluate the model’s performance
What is the purpose of specifying a random_state in a model?
It ensures reproducible results by controlling the randomness in model training.
How do you fit a decision tree model in scikit-learn?
Define and fit the model using the following code:
from sklearn.tree import DecisionTreeRegressor
melbourne_model = DecisionTreeRegressor(random_state=1)
melbourne_model.fit(X, y)
How do you make predictions with a fitted model in scikit-learn?
Use the predict() function. For example:
melbourne_model.predict(X.head())
What is overfitting?
Overfitting occurs when a model becomes too complex and captures noise in the training data, leading to poor generalization on new data. It fits the training data too well but struggles with unseen data.
Essentially the model is “remembering” the data rather than learning patterns.
Reasons for overfitting:
- Model is too complex
- It memorizes patterns