351 - 400 Flashcards
git pull
Download changes from remote repo to your local machine, the opposite of push. Команда pull автоматически сливает коммиты, не давая вам сначала просмотреть их.
git pull origin master
git pull origin benchmark_name
sklearn.mixture.GaussianMixture(n_components=1, *, covariance_type=’full’, tol=0.001, reg_covar=1e-06, max_iter=100, n_init=1, init_params=’kmeans’, weights_init=None, means_init=None, precisions_init=None, random_state=None, warm_start=False, verbose=0, verbose_interval=10)
Unsupervised Learning → Clustering — A probabilistic model for modeling normally distributed clusters within a dataset.
👉 use cases
�� Customer segmentation
�� Recommendation systems
👉 Advantages
1. Computes a probability for an observation belonging to a cluster
2. Can identify overlapping clusters
3. More accurate results compared to K-means
👉 Disadvantages
1. Requires complex tuning
2. Requires setting the number of expected mixture components or clusters
from sklearn.mixture import GaussianMixture X = np.array([[1, 2], [1, 4], [1, 0], [10, 2], [10, 4], [10, 0]]) gm = GaussianMixture(n_components=2, random_state=0).fit(X) gm.means_ 👉 array([[10., 2.],[ 1., 2.]]) gm.predict([[0, 0], [12, 3]]) 👉 array([1, 0])
apyori.apriori()
Unsupervised Learning → Association — Rule-based approach that identifies the most frequent itemset in a given dataset where prior knowledge of frequent itemset properties is used.
👉 use cases
1. Product placements
2. Recommendation engines
3. Promotion optimization
👉 Advantages
1. Results are intuitive and Interpretable
2. Exhaustive approach as it finds all rules based on confidence and support
👉 Disadvantages
1. Generates many uninteresting itemsets
2. Computationally and memory intensive.
3. Results in many overlapping item sets
from apyori import apriori transactions = [['beer', 'nuts'], ['beer', 'cheese']] results = list(apriori(transactions))
sklearn.cluster.AgglomerativeClustering(n_clusters=2, *, affinity=’euclidean’, memory=None, connectivity=None, compute_full_tree=’auto’, linkage=’ward’, distance_threshold=None, compute_distances=False)
Unsupervised Learning → Clustering → Hierarchical Clustering — A “bottom-up” approach where each data point is treated as its own cluster—and then the closest two clusters are merged together iteratively.
👉 use cases
�� Fraud detection
�� Document clustering based on similarity
👉 Advantages
1. There is no need to specify the number of clusters
2. The resulting dendrogram is informative
👉 Disadvantages
1. Doesn’t always result in the best clustering
2. Not suitable for large datasets due to high complexity
from sklearn.cluster import AgglomerativeClustering X = np.array([[1, 2], [1, 4], [1, 0], [4, 2], [4, 4], [4, 0]]) clustering = AgglomerativeClustering().fit(X) clustering.labels_ 👉 array([1, 1, 1, 0, 0, 0])
xgboost.XGBRegressor(base_score=0.5, booster=’gbtree’, colsample_bylevel=1, colsample_bynode=1, colsample_bytree=1, gamma=0, importance_type=’gain’, learning_rate=0.1, max_delta_step=0, max_depth=3, min_child_weight=1, missing=None, n_estimators=100, n_jobs=1, nthread=None, objective=’reg:linear’, random_state=0, reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None, silent=None, subsample=1, verbosity=1)
Supervised Learning → Tree-Based Models — Gradient Boosting algorithm that is efficient & flexible.
👉 use cases
�� Churn prediction
�� Claims processing in insurance
👉 Advantages
�� Provides accurate results
�� Captures nonlinear relationships
👉 Disadvantages
�� Hyperparameter tuning can be complex
�� It does not perform well on sparse datasets
xgbr = xgb.XGBRegressor(verbosity=0) xgbr.fit(xtrain, ytrain) score = xgbr.score(xtrain, ytrain) print("Training score: ", score) 👉 Training score: 0.9738225090795732
sklearn.linear_model.Ridge(alpha=1.0, *, fit_intercept=True, normalize=’deprecated’, copy_X=True, max_iter=None, tol=0.001, solver=’auto’, positive=False, random_state=None)
Supervised Learning → Linear Model — it penalizes features that have low predictive outcomes by shrinking their coefficients closer to zero.
Can be used for classification or regression.
👉 use cases
�� Predictive maintenance for automobiles
�� Sales revenue prediction
👉 Advantages
�� Less prone to overfitting
�� Best suited where data suffer from multicollinearity
�� Explainable & interpretable
👉 Disadvantages
�� All the predictors are kept in the final model
�� Doesn’t perform feature selection
from sklearn.linear_model import Ridge n_samples, n_features = 10, 5 rng = np.random.RandomState(0) y = rng.randn(n_samples) X = rng.randn(n_samples, n_features) clf = Ridge(alpha=1.0) clf.fit(X, y)
sklearn.linear_model.ElasticNet(alpha=1.0, *, l1_ratio=0.5, fit_intercept=True, normalize=’deprecated’, precompute=False, max_iter=1000, copy_X=True, tol=0.0001, warm_start=False, positive=False, random_state=None, selection=’cyclic’)
Линейная регрессия с комбинированными L1 и L2 приорами в качестве регулятора.
from sklearn.linear_model import ElasticNet X, y = make_regression(n_features=2, random_state=0) regr = ElasticNet(random_state=0) regr.fit(X, y) print(regr.coef_) 👉 [18.83816048 64.55968825] print(regr.intercept_) 👉 1.451... print(regr.predict([[0, 0]])) 👉 [1.451...]
Unsupervised Learning
Associated with learning without supervision(надзор) or training. In unsupervised learning, the algorithms are trained with data which is neither labeled(отмечен) nor classified. In unsupervised learning, the agent needs to learn from patterns without corresponding output values.
👉 Clustering
✅ K-Means
✅ Hierarchical Clustering
✅ Gaussian Mixture Models
👉 Association
✅ Apriori algorithm
lightgbm.LGBMRegressor and lightgbm.LGBMClassifier(boosting_type=’gbdt’, num_leaves=31, max_depth=-1, learning_rate=0.1, n_estimators=100, subsample_for_bin=200000, objective=None, class_weight=None, min_split_gain=0.0, min_child_weight=0.001, min_child_samples=20, subsample=1.0, subsample_freq=0, colsample_bytree=1.0, reg_alpha=0.0, reg_lambda=0.0, random_state=None, n_jobs=None, importance_type=’split’, **kwargs)
LightGBM regressor is part of Supervised Learning and Tree-Based Models. A gradient boosting framework that is designed to be more efficient than other implementations.
👉 use cases
�� Predicting flight time for the airline
�� Predicting cholesterol levels based on health data
✔ Advantages
�� Can handle large amounts of data
�� Computational efficiency & fast training speed
�� Low memory usage
✅ Disadvantages
�� Can overfit due to leaf-wise splitting and high sensitivity
�� Hyperparameter tuning can be complex
from lightgbm import LGBMClassifier lgbm = LGBMClassifier(objective='multiclass', random_state=5) lgbm.fit(X, y) y_pred = lgbm.predict(X_test)
Supervised learning
Type of machine learning in which machine learn from known datasets (set of training examples), and then predict the output.
👉 Linear Regression 👉 Logistic Regression 👉 Ridge Regression 👉 Lasso Regression 👉 Decision Tree 👉 Random Forests 👉 Gradient Boosting Regression 👉 XGBoost 👉 LightGBM Regressor
matplotlib.boxplot(x, notch=None, sym=None, vert=None, whis=None, positions=None, widths=None, patch_artist=None, bootstrap=None, usermedians=None, conf_intervals=None, meanline=None, showmeans=None, showcaps=None, showbox=None, showfliers=None, boxprops=None, labels=None, flierprops=None, medianprops=None, meanprops=None, capprops=None, whiskerprops=None, manage_ticks=True, autorange=False, zorder=None, *, data=None)
Draw a box and whisker plot. ящик с усами. Боксплот сделан для того, чтобы показывать распределение, но график уникальный, потому что помимо распределения он показывает медиану, квартили, минимум, максимум и выбросы.
👉 Выбросы — это значения, очень сильно выделяющиеся из всей остальной массы ваших данных.
np.random.seed(10) data = np.random.normal(100, 20, 200) fig = plt.figure(figsize =(10, 7)) plt. boxplot(data) plt. show()
matplotlib.grid(visible=None, which=’major’, axis=’both’, **kwargs)
Add grid lines to the plot.
Display only grid lines for the x-axis: x = np.array([80, 85, 90, 95, 100, 105, 110, 115, 120, 125]) y = np.array([240, 250, 260, 270, 280, 290, 300, 310, 320, 330]) plt. title("Sports Watch Data") plt. xlabel("Average Pulse") plt. ylabel("Calorie Burnage") plt. plot(x, y) plt. grid() plt. show()
Display only grid lines for the y-axis: x = np.array([80, 85, 90, 95, 100, 105, 110, 115, 120, 125]) y = np.array([240, 250, 260, 270, 280, 290, 300, 310, 320, 330]) plt. title("Sports Watch Data") plt. xlabel("Average Pulse") plt. ylabel("Calorie Burnage") plt. plot(x, y) plt. grid(axis = 'x') plt. show()
matplotlib.plot(*args, scalex=True, scaley=True, data=None, **kwargs)
Draw points (markers) in a diagram. By default, the plot() function draws a line from point to point.
👉 Parameter 1 is an array containing the points on the x-axis.
👉 Parameter 2 is an array containing the points on the y-axis.
xpoints = np.array([1, 8]) ypoints = np.array([3, 10]) plt.plot(xpoints, ypoints) plt.show()
Draw two points in the diagram, one at position (1, 3) and one in position (8, 10): xpoints = np.array([1, 8]) ypoints = np.array([3, 10]) plt.plot(xpoints, ypoints, 'o') plt.show()
Plotting without x-points: ypoints = np.array([3, 8, 1, 10, 5, 7]) plt. plot(ypoints) plt. show()
matplotlib.annotate(text, xy, *args, **kwargs)
Annotate the point xy with text text. Одним словом установит стрелочку с текстовым описанием на необходимую точку на графике.
fig, geeeks = plt.subplots() t = np.arange(0.0, 5.0, 0.001) s = np.cos(3 * np.pi * t) line = geeeks.plot(t, s, lw = 2) geeeks.annotate('Local Max', xy =(3.3, 1), xytext =(3, 1.8), arrowprops = dict(facecolor ='green',shrink = 0.05),) geeeks. set_ylim(-2, 2) plt. show()
matplotlib.bar and matplotlib.barh(x, height, width=0.8, bottom=None, *, align=’center’, data=None, **kwargs)
Make a bar plot. Вертикальные линии. Информация за определенный период. Bar - по горизонтали. Barh - по вертикали.
x = np.array(["A", "B", "C", "D"]) y = np.array([3, 8, 1, 10]) plt. bar(x,y) plt. show()
x = np.array(["A", "B", "C", "D"]) y = np.array([3, 8, 1, 10]) plt. barh(x, y) plt. show()
x = np.array(["A", "B", "C", "D"]) y = np.array([3, 8, 1, 10]) plt. bar(x, y, color = "hotpink") plt. show()
x = np.array(["A", "B", "C", "D"]) y = np.array([3, 8, 1, 10]) plt. bar(x, y, width = 0.1) plt. show()
sklearn.inspection.permutation_importance(estimator, X, y, *, scoring=None, n_repeats=5, n_jobs=None, random_state=None,
sample_weight=None, max_samples=1.0)
The algorithm evaluates the importance of each feature in predicting the target. If the score drops when a feature is shuffled, it is considered important.
from sklearn.inspection import permutation_importance log_model = LogisticRegression().fit(X, y) # Fit model permutation_score = permutation_importance(log_model, X, y, n_repeats=10)
Holdout Method
Split data into Training and Test Data (like 70% | 30%).
Euclidean distance
Defined as the distance between two points. In other words, the Euclidean distance between two points in the Euclidean space is defined as the length of the line segment between two points. d = √[ (x2 – x1)2 + (y2 – y1)2]
F-Test
Will tell you if a group of variables is jointly significant.