1 - 50 Flashcards
Data Preprocessing
Data Exploration =>
Cleaning (Duplicates, Missing data, Outliers, Scaling, Encoding, Discretizing, Creating new features) =>
Feature Selection (Feature Correlation, Modelling, Feature selection, Remodel).
1) More than 30% of missing values: Potentially drop feature or row.
2) Less than 30% of values missing: impute a deal that makes sense e.g. Median, mean, mode.
sklearn.linear_model.LogisticRegression(penalty=’l2’, *, dual=False, tol=0.0001, C=1.0, fit_intercept=True, intercept_scaling=1, class_weight=None, random_state=None, solver=’lbfgs’, max_iter=100, multi_class=’auto’, verbose=0, warm_start=False, n_jobs=None, l1_ratio=None)
Supervised Learning → Linear Model. Метод построения линейного классификатора, позволяющий оценивать вероятности принадлежности объектов классам. Да или нет, False or True, холодно или тепло, —> без точной цифры.
👉 Advantages
💡 Interpretable and explainable
💡Less prone to overfitting when using regularization
💡Applicable for multi-class predictions
👉 Disadvantages
💡 Assumes linearity between inputs and outputs
💡 Can overfit with small, high-dimensional data
from sklearn.linear_model import LogisticRegression model = LogisticRegression() model.fit(X,y) model.predict()
sklearn.metrics.mean_squared_error(y_true, y_pred, *, sample_weight=None, multioutput=’uniform_average’, squared=True)
Оценить выбор модели. Mean Squared Error (MSE). The measure of how close a fitted line is to data points.
from sklearn.metrics import mean_squared_error Y_true = [1,1,2,2,4] Y_pred = [0.6,1.29,1.99,2.69,3.4] mean_squared_error(Y_true,Y_pred) 👉 Output: 0.21606
sklearn.neighbors.NearestNeighbors(*, n_neighbors=5, radius=1.0, algorithm=’auto’, leaf_size=30, metric=’minkowski’, p=2, metric_params=None, n_jobs=None)
Обучения на основе соседей.
from sklearn.neighbors import NearestNeighbors samples = [[0, 0, 2], [1, 0, 0], [0, 0, 1]] neigh = NearestNeighbors(n_neighbors=2, radius=0.4) neigh.fit(samples) neigh.kneighbors([[0, 0, 1.3]], 2, return_distance=False)
Предоставляет функциональные возможности для неконтролируемых и контролируемых методов обучения на основе соседей.
sklearn.svm.SVC(*, C=1.0, kernel=’rbf’, degree=3, gamma=’scale’, coef0=0.0, shrinking=True, probability=False, tol=0.001, cache_size=200, class_weight=None, verbose=False, max_iter=-1, decision_function_shape=’ovr’, break_ties=False, random_state=None)
C-Support Vector Classification. Подход «один против одного» для классификации по нескольким классам., каждый из которых обучает данные из двух классов.
X = [[0], [1], [2], [3]] y = [0, 1, 2, 3] clf = svm.SVC(decision_function_shape='ovo') clf.fit(X, y) dec = clf.decision_function([[1]]) dec.shape[1] 👉 6
sklearn.model_selection.cross_validate()
По сравнению с train_test_split мы получим более точную оценку качества классификатора.
from sklearn.model_selection import cross_validate model = LinearRegression() cv_results = cross_validate(model, X, y, cv=5, scoring= ['max_error','r2', 'neg_mean_absolute_error', 'neg_mean_squared_error']
sklearn.model_selection.KFold(n_splits=5, *, shuffle=False, random_state=None)
Разделяете весь набор данных на K равными размерами «складки», и каждая сводка используется один раз для тестирования модели и K-1 раз для обучения модели.
from sklearn.model_selection import KFold X = np.array([[1, 2], [3, 4], [5, 6], [7, 8]]) y = np.array([1, 2, 1, 2]) cv = KFold(n_splits=3, random_state=0) for train_index, test_index in cv.split(X): print("TRAIN:", train_index, "TEST:", test_index)
sklearn.model_selection.train_test_split(*arrays, test_size=None, train_size=None, random_state=None, shuffle=True, stratify=None)
Случайным образом делит на обучающий и тестовый набор.
from sklearn.model_selection import train_test_split data=[[1,1],[2,2],[3,3],[4,4],[5,5],[6,6],[7,7],[8,8],[9,9],[10,10]] target=[1,2,3,4,5,6,7,8,9,10] x_train, x_test, y_train, y_test = train_test_split(data, target, test_size=0.3, random_state=0) → x_train is: [[10, 10], [2, 2], [7, 7], [8, 8], [4, 4], [1, 1], [6, 6]] → y_train is: [10, 2, 7, 8, 4, 1, 6] → x_test is: [[3, 3], [9, 9], [5, 5]] → y_test is: [3, 9, 5]
pandas.DataFrame.agg(func=None, axis=0, *args, **kwargs)
The method allows you to apply a function or a list of function names to be executed along one of the axes of the DataFrame, default 0, which is the index (row) axis.
new_df["quality rating"].agg({"quality rating": lambda x: x-x if x < 6 else x-x+1})
sklearn.preprocessing.OrdinalEncoder(*, categories=’auto’, dtype=, handle_unknown=’error’, unknown_value=None, encoded_missing_value=nan)
Кодирует категорические признаки как целочисленный массив.
from sklearn.preprocessing import OrdinalEncoder enc = OrdinalEncoder() X = [['Male', 1], ['Female', 3], ['Female', 2]] enc.fit(X) enc.transform([['Female', 3], ['Male', 1]]) 👉 array([[0., 2.], [1., 0.]])
sklearn.metrics.r2_score(y_true, y_pred, *, sample_weight=None, multioutput=’uniform_average’, force_finite=True)
Оценить выбор модели.
👉 Use R^2 when:
….. The unit of the error is not important
….. You want to compare different datasets.
….. Может принимать отрицательное значение.
from sklearn.metrics import r2_score y_true = [3, -0.5, 2, 7] y_pred = [2.5, 0.0, 2, 8] r2_score(y_true, y_pred) 👉 0.94
sklearn.metrics.confusion_matrix(y_true, y_pred, *, labels=None, sample_weight=None, normalize=None)
Evaluate the accuracy of classification.
from sklearn.metrics import confusion_matrix y_true = [2, 0, 2, 2, 0, 1] y_pred = [0, 0, 2, 2, 0, 2] print(confusion_matrix(y_true, y_pred)) 👉 [[2, 0, 0], | [0, 0, 1], | [1, 0, 2]])
sklearn.metrics.mean_absolute_error(y_true, y_pred, *, sample_weight=None, multioutput=’uniform_average’)
Оценить выбор модели.
Mean absolute error(MAE):
…. 👉 Less sensitive to outliers
…. 👉 Do not want to over-penalize(наказывать) outliers
…. 👉 Use MAE when all errors, large or small, have equal importance.
from sklearn.metrics import mean_absolute_error y_true = [3, -0.5, 2, 7] y_pred = [2.5, 0.0, 2, 8] mean_absolute_error(y_true, y_pred) 👉 0.5
sklearn.metrics.precision_score(y_true, y_pred, *, labels=None, pos_label=1, average=’binary’, sample_weight=None, zero_division=’warn’)
Оценить выбор модели.
Измеряет способность модели избежать False Alarm (False Positive, угадать 0ки).
from sklearn.metrics import precision_score y_true = [0, 1, 0, 0, 1, 0, 1, 1, 0, 1] y_pred = [0, 0, 0, 0, 1, 1, 1, 1, 1, 1] print(round(precision_score(y_true, y_pred), 2)) 👉 0.67
sklearn.metrics.max_error(y_true, y_pred)
Max Error(ME) — biggest error made by the model when predicting.
from sklearn.metrics import max_error y_true = [3, 2, 7, 1] y_pred = [4, 2, 7, 1] max_error(y_true, y_pred) 👉 1
sklearn.metrics.roc_curve(y_true, y_score, *, pos_label=None, sample_weight=None, drop_intermediate=True)
Для рисования кривой ROC - позволяющий оценить качество бинарной классификации, отображает соотношение между долей объектов от общего количества носителей признака.
y = [1, 1, 2, 2] pred = [0.1, 0.4, 0.35, 0.8] fpr, tpr, thresholds = roc_curve(y, pred, pos_label=2) 👉fpr:[ 0. 0.5 0.5 1. ] 👉tpr:[ 0.5 0.5 1. 1. ] 👉hresholds:[ 0.8 0.4 0.35 0.1 ]
sklearn.model_selection.learning_curve(estimator, X, y, *, groups=None, train_sizes=array([0.1, 0.33, 0.55, 0.78, 1.]), cv=None, scoring=None,
exploit_incremental_learning=False, n_jobs=None, pre_dispatch=’all’, verbose=0, shuffle=False, random_state=None, error_score=nan, return_times=False, fit_params=None)
Сan help to find the right amount of training data to fit our model with a good bias-variance trade-off.
from sklearn.model_selection import learning_curve train_sizes = [25,50,75,100,250,500,750,1000,1150] train_sizes, train_scores, test_scores = learning_curve(estimator=LinearRegression(), X=X, y=y, train_sizes=train_sizes, cv=5)
sklearn.linear_model.SGDRegressor(loss=’squared_error’, *, penalty=’l2’, alpha=0.0001, l1_ratio=0.15, fit_intercept=True, max_iter=1000, tol=0.001, shuffle=True, verbose=0, epsilon=0.1, random_state=None, learning_rate=’invscaling’, eta0=0.01, power_t=0.25, early_stopping=False, validation_fraction=0.1, n_iter_no_change=5, warm_start=False, average=False)
Выполняет линейную регрессию с использованием градиентного спуска.
loss = 👉squared_error, 👉huber, 👉epsilon_insensitive, 👉squared_epsilon_insensitive
sgd = SGDRegressor(loss='squared_error') sgd_model_cv = cross_validate(sgd, X, y, cv = 10, scoring = ['r2', 'max_error']) r2 = sgd_model_cv['test_r2'].mean() max_error = abs(sgd_model_cv['test_max_error']).max()
sklearn.metrics.recall_score(y_true, y_pred, *, labels=None, pos_label=1, average=’binary’, sample_weight=None, zero_division=’warn’)
Оценить выбор модели. Находит отзыв, а отзыва это интуитивно понятная способность классификатора находить все положительные образцы. Ищет только 1, угадать 1ки!
y_true = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1] y_pred = [0, 0, 1, 1, 1, 1, 1, 1, 1, 1] print(round(recall_score(y_true, y_pred), 2)) 👉 0.8
y_true = [0, 0, 0, 0, 1] y_pred = [1, 1, 1, 1, 1] print(round(recall_score(y_true, y_pred), 2)) 👉 1.0
sklearn.neighbors.KNeighborsRegressor(KNN)(n_neighbors=5, *, weights=’uniform’, algorithm=’auto’, leaf_size=30, p=2, metric=’minkowski’, metric_params=None, n_jobs=None)
Объект прогнозируется путем локальной интерполяции целей, связанных с ближайшими соседями в обучающем наборе.
from sklearn.neighbors import KNeighborsRegressor X = [[0], [1], [2], [3]] y = [0, 0, 1, 1] neigh = KNeighborsRegressor(n_neighbors=2) neigh.fit(X, y) print(neigh.predict([[1.5]])) 👉 [0.5]