Scitkit learn Flashcards

1
Q

How do you bring in data from scitkit learn?

A

from sklearn.datasets import iris

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

How do you generate random data?

A

from sklearn.datasets import make_blobs

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Whats the notation for make blobs?

A

X, y = make_blobs(
n_samples=150, n_features=2,
centers=3, cluster_std=0.5,
shuffle=True)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What do X and y represent in make blobs?

A

X is the data, y are the labels

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are the outputs of make_blobs?

A

X (the samples), y (the labels), centers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

How do you specify the standard dev within clusters

A

cluster_std =

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

How do you randomly allocate clusters within the dataset?

A

shuffle = True

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

How do you import KMeans?

A

from sklearn.cluster import KMeans

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

How do you get the centroid location/

A

model.cluster_centres_

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is the notation for make blobs?

A

make_blobs(n_samples=100, n_features=2, *, centers=None, cluster_std=1.0, center_box=(-10.0, 10.0), shuffle=True, random_state=None, return_centers=False)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

For make blobs is it clusters and center_std or centers and cluster_std?

A

centers and cluster_std

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

For make blobs is it center or centers

A

centers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

For Kmeans what is the notation?

A

Means(n_clusters=8, *, init=’k-means++’, n_init=10, max_iter=300, tol=0.0001, precompute_distances=’deprecated’, verbose=0, random_state=None, copy_x=True, n_jobs=’deprecated’, algorithm=’auto’)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Which way around for Kmeans and make blobs?

A

KMeans = n_clusters

make blobs = centers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

How do you import the thing to normalise the data?

A

from sklearn.preprocessing import StandardScaler

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

How do you normalise the data?

A

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X = scaler.fit_transform(df1)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

How do you import DBSCAN?

A

from sklearn.cluster import DBSCAN

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

How do you iterate to find a sensible value of epsilon (without nearest neighbour)?

A
for epsilon in np.arange(0.1,1,0.1):
object = DBSCAN(eps = epsilon)
y = object.fit(df1)
print(epsilon, np.unique(y.labels_)[-1])
Gives you the number of clusters for a given value of epsilon
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Can you apply unique on y.labels_?

A

No

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

How do you get the unique labels for labels of dbscan?

A

np.unique(y.labels_)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Can you do fit_transform on DBSCAN?

A

No

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

How do you bring in iris data

A

from sklearn.datasets import load_iris

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What does standard scaler return?

A

A numpy array

25
Q

How do you return a dataframe from standard scaler?

A

pd.Dataframe(data = scaler.fit_transform(df1), columns = df1.columns)

26
Q

How do you import nearest neighbours?

A

from sklearn.neighbors import NearestNeighbors

27
Q

How do you work out the optimal value of epsilon in DBSCAN?

A

Work out at which point the value of the nearest neighbour jumps the most dramatically:

neigh = NearestNeighbors(n_neighbors = 2)
nbrs = neigh.fit(df1)
distances, indices = nbrs.kneighbors(df1)

distances = np.sort(distances, axis = 0)
distances = distances[:,1]
plt.plot(distances)

28
Q

What does nbrs.kneighbors return?

A

distances and indices

29
Q

How do you sort a numpy array/

A

np.sort(distances,axis = 0)

30
Q

Fit or fit transform for kneighbors?

A

fit

31
Q

What is the import statement for decision trees?

A

from sklearn.tree import DecisionTreeRegressor

32
Q

What is the import statement for mean absolute error?

A

from sklearn.metrics import mean_absolute_error

33
Q

What is the import statement for train / test / split?

A

from sklearn.model_selection import train_test_split?

34
Q

How do you unpack train test split outputs?

A

train_X, val_X, train_y, val_y

35
Q

What are the steps to build a model?

A
  1. Import the data
  2. Build X and y
    3 Remove categorical data (if applicable)
    4 Split the data into test and train
    5Remove null values or impute missing values
  3. Assign a modelling object to a variable
  4. Fit the model
  5. Predict values
  6. Calculate the MEA
  7. Process the test dataset (drop columns etc)
  8. Predict on the test dataset
36
Q

What are the steps to build a model?

A
  1. Import the data
  2. Build X and y
    3 Remove categorical data (if applicable)
    4 Split the data into test and train
    5Remove null values or impute missing values
  3. Assign a modelling object to a variable
  4. Fit the model
  5. Predict values
  6. Calculate the MEA
  7. Process the test dataset (drop columns etc)
  8. Predict on the test dataset
37
Q

What are the options for dealing with categorical variables?

A

1) Drop
2) Ordinal encoding
3) One-Hot encoding

38
Q

How do you select only number columns?

A

df1.select_dtypes(‘numbers’)

39
Q

How do you select only categorical columns?

A

df1.select_dtypes(‘object’)

40
Q

What’s the import for ordinal encoder?

A

from sklearn.preprocessing import OrdinalEncoder

41
Q

What are the steps for using one hot encoding

A

Import
Create the encoding variable
Create columns with the one hot encoder object
Reinsert the index
Select only the numerical columns
Add the one hot encoded columns to the numerical columns

from sklearn.preprocessing import OneHotEncoder

# Apply one-hot encoder to each column with categorical data
OH_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False)
OH_cols_train = pd.DataFrame(OH_encoder.fit_transform(X_train[object_cols]))
OH_cols_valid = pd.DataFrame(OH_encoder.transform(X_valid[object_cols]))
# One-hot encoding removed index; put it back
OH_cols_train.index = X_train.index
OH_cols_valid.index = X_valid.index
# Remove categorical columns (will replace with one-hot encoding)
num_X_train = X_train.drop(object_cols, axis=1)
num_X_valid = X_valid.drop(object_cols, axis=1)
# Add one-hot encoded columns to numerical features
OH_X_train = pd.concat([num_X_train, OH_cols_train], axis=1)
OH_X_valid = pd.concat([num_X_valid, OH_cols_valid], axis=1)
42
Q

What are the steps to use ordinal enconding?

A

Create a copy of the dataset
Build the encoder object into a variable
For the categorical columns, use the encoder

from sklearn.preprocessing import OrdinalEncoder

# Make copy to avoid changing original data 
label_X_train = X_train.copy()
label_X_valid = X_valid.copy()
# Apply ordinal encoder to each column with categorical data
ordinal_encoder = OrdinalEncoder()
label_X_train[object_cols] = ordinal_encoder.fit_transform(X_train[object_cols])
label_X_valid[object_cols] = ordinal_encoder.transform(X_valid[object_cols])

print(“MAE from Approach 2 (Ordinal Encoding):”)
print(score_dataset(label_X_train, label_X_valid, y_train, y_valid))

43
Q

What are the steps for building a pipeline?

A
  1. Assign the numerical and categorical columns
  2. Build the numerical transformer
  3. Build the categorical transformer
  4. Build the preprocessor
  5. Built the model object
  6. Build an objectic that has preprocessor and model as steps of a pipeline
  7. Fit and predict the model
44
Q

What is the format of the preprocessor?

A

preprocessor = ColumnTransformer(
transformers=[
(‘num’, numerical_transformer, numerical_cols),
(‘cat’, categorical_transformer, categorical_cols)
])

45
Q

What is the format of the categorical processor?

A

categorical_transformer = Pipeline(steps=[
(‘imputer’, SimpleImputer(strategy=’constant’)),
(‘onehot’, OneHotEncoder(handle_unknown=’ignore’))

46
Q

How do you actually use the preprocessor and model?

A
# Bundle preprocessing and modeling code in a pipeline
my_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                              ('model', model)
                             ])
# Preprocessing of training data, fit model 
my_pipeline.fit(X_train, y_train)
# Preprocessing of validation data, get predictions
preds = my_pipeline.predict(X_valid)
47
Q

How do you select categorical columns?

A

categorical_cols = [cname for cname in X_train_full.columns if
X_train_full[cname].nunique() < 10 and
X_train_full[cname].dtype == “object”]

48
Q

How do you select numerical columns

A

numerical_cols = [cname for cname in X_train_full.columns if
X_train_full[cname].dtype in [‘int64’, ‘float64’]]

49
Q

What is the import statement for cross validation?

A

from sklearn.model_selection import cross_val_score

50
Q

How do you get the scores from cross validation?

A

scores = -1 * cross_val_score(my_pipeline, X, y,
cv=5,
scoring=’neg_mean_absolute_error’)

51
Q

Do you need training and validation steps if you use cross validation?

A

No

52
Q

Should you use cross validation for large datasets?

A

No - only for small

53
Q

How is the error calculated in cross validation?

A

scoring=’neg_mean_absolute_error’)

54
Q

What are the key parameters in XGBRegressor?

A

n_estimators and learning_rate

55
Q

How can you stop XGBRegressor overfitting?

A

my_model.fit(X_train, y_train,
early_stopping_rounds=5,
eval_set=[(X_valid, y_valid)],
verbose=False)

56
Q

What is target leakage?

A

Target leakage occurs when your predictors include data that will not be available at the time you make predictions. It is important to think about target leakage in terms of the timing or chronological order that data becomes available, not merely whether a feature helps make good predictions.

57
Q

What is train test contamination?

A

You can corrupt this process in subtle ways if the validation data affects the preprocessing behavior.If your validation is based on a simple train-test split, exclude the validation data from any type of fitting, including the fitting of preprocessing steps.