Scitkit learn Flashcards

Question

How do you import nearest neighbours?

Answer 1

from sklearn.neighbors import NearestNeighbors

Answer 2

Work out at which point the value of the nearest neighbour jumps the most dramatically: neigh = NearestNeighbors(n_neighbors = 2) nbrs = neigh.fit(df1) distances, indices = nbrs.kneighbors(df1) distances = np.sort(distances, axis = 0) distances = distances[:,1] plt.plot(distances)

Answer 3

distances and indices

Answer 4

np.sort(distances,axis = 0)

Answer 5

from sklearn.tree import DecisionTreeRegressor

Answer 6

from sklearn.metrics import mean_absolute_error

Answer 7

from sklearn.model_selection import train_test_split?

Answer 8

train_X, val_X, train_y, val_y

Answer 9

1. Import the data 2. Build X and y 3 Remove categorical data (if applicable) 4 Split the data into test and train 5Remove null values or impute missing values 6. Assign a modelling object to a variable 7. Fit the model 8. Predict values 9. Calculate the MEA 10. Process the test dataset (drop columns etc) 11. Predict on the test dataset

Answer 10

1. Import the data 2. Build X and y 3 Remove categorical data (if applicable) 4 Split the data into test and train 5Remove null values or impute missing values 6. Assign a modelling object to a variable 7. Fit the model 8. Predict values 9. Calculate the MEA 10. Process the test dataset (drop columns etc) 11. Predict on the test dataset

Answer 11

1) Drop 2) Ordinal encoding 3) One-Hot encoding

Answer 12

df1.select_dtypes('numbers')

Answer 13

df1.select_dtypes('object')

Answer 14

from sklearn.preprocessing import OrdinalEncoder

Answer 15

Import Create the encoding variable Create columns with the one hot encoder object Reinsert the index Select only the numerical columns Add the one hot encoded columns to the numerical columns from sklearn.preprocessing import OneHotEncoder ``` # Apply one-hot encoder to each column with categorical data OH_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False) OH_cols_train = pd.DataFrame(OH_encoder.fit_transform(X_train[object_cols])) OH_cols_valid = pd.DataFrame(OH_encoder.transform(X_valid[object_cols])) ``` ``` # One-hot encoding removed index; put it back OH_cols_train.index = X_train.index OH_cols_valid.index = X_valid.index ``` ``` # Remove categorical columns (will replace with one-hot encoding) num_X_train = X_train.drop(object_cols, axis=1) num_X_valid = X_valid.drop(object_cols, axis=1) ``` ``` # Add one-hot encoded columns to numerical features OH_X_train = pd.concat([num_X_train, OH_cols_train], axis=1) OH_X_valid = pd.concat([num_X_valid, OH_cols_valid], axis=1) ```

Answer 16

Create a copy of the dataset Build the encoder object into a variable For the categorical columns, use the encoder from sklearn.preprocessing import OrdinalEncoder ``` # Make copy to avoid changing original data label_X_train = X_train.copy() label_X_valid = X_valid.copy() ``` ``` # Apply ordinal encoder to each column with categorical data ordinal_encoder = OrdinalEncoder() label_X_train[object_cols] = ordinal_encoder.fit_transform(X_train[object_cols]) label_X_valid[object_cols] = ordinal_encoder.transform(X_valid[object_cols]) ``` print("MAE from Approach 2 (Ordinal Encoding):") print(score_dataset(label_X_train, label_X_valid, y_train, y_valid))

Answer 17

1. Assign the numerical and categorical columns 2. Build the numerical transformer 3. Build the categorical transformer 4. Build the preprocessor 5. Built the model object 6. Build an objectic that has preprocessor and model as steps of a pipeline 7. Fit and predict the model

Answer 18

preprocessor = ColumnTransformer( transformers=[ ('num', numerical_transformer, numerical_cols), ('cat', categorical_transformer, categorical_cols) ])

Answer 19

categorical_transformer = Pipeline(steps=[ ('imputer', SimpleImputer(strategy='constant')), ('onehot', OneHotEncoder(handle_unknown='ignore'))

Answer 20

``` # Bundle preprocessing and modeling code in a pipeline my_pipeline = Pipeline(steps=[('preprocessor', preprocessor), ('model', model) ]) ``` ``` # Preprocessing of training data, fit model my_pipeline.fit(X_train, y_train) ``` ``` # Preprocessing of validation data, get predictions preds = my_pipeline.predict(X_valid) ```

Answer 21

categorical_cols = [cname for cname in X_train_full.columns if X_train_full[cname].nunique() < 10 and X_train_full[cname].dtype == "object"]

Answer 22

numerical_cols = [cname for cname in X_train_full.columns if X_train_full[cname].dtype in ['int64', 'float64']]

Answer 23

from sklearn.model_selection import cross_val_score

Answer 24

scores = -1 * cross_val_score(my_pipeline, X, y, cv=5, scoring='neg_mean_absolute_error')

Answer 25

No - only for small

Answer 26

scoring='neg_mean_absolute_error')

Answer 27

n_estimators and learning_rate

Answer 28

my_model.fit(X_train, y_train, early_stopping_rounds=5, eval_set=[(X_valid, y_valid)], verbose=False)

Answer 29

Target leakage occurs when your predictors include data that will not be available at the time you make predictions. It is important to think about target leakage in terms of the timing or chronological order that data becomes available, not merely whether a feature helps make good predictions.

Answer 30

You can corrupt this process in subtle ways if the validation data affects the preprocessing behavior.If your validation is based on a simple train-test split, exclude the validation data from any type of fitting, including the fitting of preprocessing steps.

Scitkit learn Flashcards

(56 cards)