Scitkit learn Flashcards
How do you bring in data from scitkit learn?
from sklearn.datasets import iris
How do you generate random data?
from sklearn.datasets import make_blobs
Whats the notation for make blobs?
X, y = make_blobs(
n_samples=150, n_features=2,
centers=3, cluster_std=0.5,
shuffle=True)
What do X and y represent in make blobs?
X is the data, y are the labels
What are the outputs of make_blobs?
X (the samples), y (the labels), centers
How do you specify the standard dev within clusters
cluster_std =
How do you randomly allocate clusters within the dataset?
shuffle = True
How do you import KMeans?
from sklearn.cluster import KMeans
How do you get the centroid location/
model.cluster_centres_
What is the notation for make blobs?
make_blobs(n_samples=100, n_features=2, *, centers=None, cluster_std=1.0, center_box=(-10.0, 10.0), shuffle=True, random_state=None, return_centers=False)
For make blobs is it clusters and center_std or centers and cluster_std?
centers and cluster_std
For make blobs is it center or centers
centers
For Kmeans what is the notation?
Means(n_clusters=8, *, init=’k-means++’, n_init=10, max_iter=300, tol=0.0001, precompute_distances=’deprecated’, verbose=0, random_state=None, copy_x=True, n_jobs=’deprecated’, algorithm=’auto’)
Which way around for Kmeans and make blobs?
KMeans = n_clusters
make blobs = centers
How do you import the thing to normalise the data?
from sklearn.preprocessing import StandardScaler
How do you normalise the data?
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X = scaler.fit_transform(df1)
How do you import DBSCAN?
from sklearn.cluster import DBSCAN
How do you iterate to find a sensible value of epsilon (without nearest neighbour)?
for epsilon in np.arange(0.1,1,0.1): object = DBSCAN(eps = epsilon) y = object.fit(df1) print(epsilon, np.unique(y.labels_)[-1]) Gives you the number of clusters for a given value of epsilon
Can you apply unique on y.labels_?
No
How do you get the unique labels for labels of dbscan?
np.unique(y.labels_)
Can you do fit_transform on DBSCAN?
No
How do you bring in iris data
from sklearn.datasets import load_iris
What does standard scaler return?
A numpy array
How do you return a dataframe from standard scaler?
pd.Dataframe(data = scaler.fit_transform(df1), columns = df1.columns)
How do you import nearest neighbours?
from sklearn.neighbors import NearestNeighbors
How do you work out the optimal value of epsilon in DBSCAN?
Work out at which point the value of the nearest neighbour jumps the most dramatically:
neigh = NearestNeighbors(n_neighbors = 2)
nbrs = neigh.fit(df1)
distances, indices = nbrs.kneighbors(df1)
distances = np.sort(distances, axis = 0)
distances = distances[:,1]
plt.plot(distances)
What does nbrs.kneighbors return?
distances and indices
How do you sort a numpy array/
np.sort(distances,axis = 0)
Fit or fit transform for kneighbors?
fit
What is the import statement for decision trees?
from sklearn.tree import DecisionTreeRegressor
What is the import statement for mean absolute error?
from sklearn.metrics import mean_absolute_error
What is the import statement for train / test / split?
from sklearn.model_selection import train_test_split?
How do you unpack train test split outputs?
train_X, val_X, train_y, val_y
What are the steps to build a model?
- Import the data
- Build X and y
3 Remove categorical data (if applicable)
4 Split the data into test and train
5Remove null values or impute missing values - Assign a modelling object to a variable
- Fit the model
- Predict values
- Calculate the MEA
- Process the test dataset (drop columns etc)
- Predict on the test dataset
What are the steps to build a model?
- Import the data
- Build X and y
3 Remove categorical data (if applicable)
4 Split the data into test and train
5Remove null values or impute missing values - Assign a modelling object to a variable
- Fit the model
- Predict values
- Calculate the MEA
- Process the test dataset (drop columns etc)
- Predict on the test dataset
What are the options for dealing with categorical variables?
1) Drop
2) Ordinal encoding
3) One-Hot encoding
How do you select only number columns?
df1.select_dtypes(‘numbers’)
How do you select only categorical columns?
df1.select_dtypes(‘object’)
What’s the import for ordinal encoder?
from sklearn.preprocessing import OrdinalEncoder
What are the steps for using one hot encoding
Import
Create the encoding variable
Create columns with the one hot encoder object
Reinsert the index
Select only the numerical columns
Add the one hot encoded columns to the numerical columns
from sklearn.preprocessing import OneHotEncoder
# Apply one-hot encoder to each column with categorical data OH_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False) OH_cols_train = pd.DataFrame(OH_encoder.fit_transform(X_train[object_cols])) OH_cols_valid = pd.DataFrame(OH_encoder.transform(X_valid[object_cols]))
# One-hot encoding removed index; put it back OH_cols_train.index = X_train.index OH_cols_valid.index = X_valid.index
# Remove categorical columns (will replace with one-hot encoding) num_X_train = X_train.drop(object_cols, axis=1) num_X_valid = X_valid.drop(object_cols, axis=1)
# Add one-hot encoded columns to numerical features OH_X_train = pd.concat([num_X_train, OH_cols_train], axis=1) OH_X_valid = pd.concat([num_X_valid, OH_cols_valid], axis=1)
What are the steps to use ordinal enconding?
Create a copy of the dataset
Build the encoder object into a variable
For the categorical columns, use the encoder
from sklearn.preprocessing import OrdinalEncoder
# Make copy to avoid changing original data label_X_train = X_train.copy() label_X_valid = X_valid.copy()
# Apply ordinal encoder to each column with categorical data ordinal_encoder = OrdinalEncoder() label_X_train[object_cols] = ordinal_encoder.fit_transform(X_train[object_cols]) label_X_valid[object_cols] = ordinal_encoder.transform(X_valid[object_cols])
print(“MAE from Approach 2 (Ordinal Encoding):”)
print(score_dataset(label_X_train, label_X_valid, y_train, y_valid))
What are the steps for building a pipeline?
- Assign the numerical and categorical columns
- Build the numerical transformer
- Build the categorical transformer
- Build the preprocessor
- Built the model object
- Build an objectic that has preprocessor and model as steps of a pipeline
- Fit and predict the model
What is the format of the preprocessor?
preprocessor = ColumnTransformer(
transformers=[
(‘num’, numerical_transformer, numerical_cols),
(‘cat’, categorical_transformer, categorical_cols)
])
What is the format of the categorical processor?
categorical_transformer = Pipeline(steps=[
(‘imputer’, SimpleImputer(strategy=’constant’)),
(‘onehot’, OneHotEncoder(handle_unknown=’ignore’))
How do you actually use the preprocessor and model?
# Bundle preprocessing and modeling code in a pipeline my_pipeline = Pipeline(steps=[('preprocessor', preprocessor), ('model', model) ])
# Preprocessing of training data, fit model my_pipeline.fit(X_train, y_train)
# Preprocessing of validation data, get predictions preds = my_pipeline.predict(X_valid)
How do you select categorical columns?
categorical_cols = [cname for cname in X_train_full.columns if
X_train_full[cname].nunique() < 10 and
X_train_full[cname].dtype == “object”]
How do you select numerical columns
numerical_cols = [cname for cname in X_train_full.columns if
X_train_full[cname].dtype in [‘int64’, ‘float64’]]
What is the import statement for cross validation?
from sklearn.model_selection import cross_val_score
How do you get the scores from cross validation?
scores = -1 * cross_val_score(my_pipeline, X, y,
cv=5,
scoring=’neg_mean_absolute_error’)
Do you need training and validation steps if you use cross validation?
No
Should you use cross validation for large datasets?
No - only for small
How is the error calculated in cross validation?
scoring=’neg_mean_absolute_error’)
What are the key parameters in XGBRegressor?
n_estimators and learning_rate
How can you stop XGBRegressor overfitting?
my_model.fit(X_train, y_train,
early_stopping_rounds=5,
eval_set=[(X_valid, y_valid)],
verbose=False)
What is target leakage?
Target leakage occurs when your predictors include data that will not be available at the time you make predictions. It is important to think about target leakage in terms of the timing or chronological order that data becomes available, not merely whether a feature helps make good predictions.
What is train test contamination?
You can corrupt this process in subtle ways if the validation data affects the preprocessing behavior.If your validation is based on a simple train-test split, exclude the validation data from any type of fitting, including the fitting of preprocessing steps.