251 - 301 Flashcards
pandas.DataFrame.plot(*args, **kwargs)
method to create diagrams.
The kind of plot to produce:
- ‘line’ : line plot (default)
- ‘bar’ : vertical bar plot
- ‘barh’ : horizontal bar plot
- ‘hist’ : histogram
- ‘box’ : boxplot
- ‘kde’ : Kernel Density Estimation plot
- ‘density’ : same as ‘kde’
- ‘area’ : area plot
- ‘pie’ : pie plot
- ‘scatter’ : scatter plot (DataFrame only)
- ‘hexbin’ : hexbin plot (DataFrame only)
df.plot()
df_pop_ceb.plot(kind="scatter", x="Year", y="Total Urban Population")
Underfitting
Underfitting is the inverse of overfitting, meaning that the statistical model or machine learning algorithm is too simplistic to accurately represent the data. A sign of underfitting is when there is a high bias and low variance detected in the current model or algorithm used (the inverse of overfitting: low bias and high variance).
Overfitting
явление, когда построенная модель хорошо объясняет примеры из обучающей выборки, но относительно плохо работает на примерах, не участвовавших в обучении (на примерах из тестовой выборки). Иными словами, модель запоминает огромное количество всех возможных примеров вместо того, чтобы научиться подмечать особенности.
string.punctuation
will give all sets of punctuation.
import string result = string.punctuation print(result) 👉 !"#$%&'()*+, -./:;<=>?@[\]^_`{|}~
nltk.tokenize.word_tokenize()
divide strings into lists of substrings. For example, tokenizers can be used to find the words and punctuation in a string.
from nltk.tokenize import word_tokenize
s = ‘'’Good muffins cost $3.88\nin New York. Please buy me two of them.\n\nThanks.’’’
word_tokenize(s)
👉 [‘Good’, ‘muffins’, ‘cost’, ‘$’, ‘3.88’, ‘in’, ‘New’, ‘York’, ‘.’, ‘Please’, ‘buy’, ‘me’, ‘two’, ‘of’, ‘them’, ‘.’, ‘Thanks’, ‘.’]
nltk.stem.WordNetLemmatizer()
process of grouping together the different inflected forms of a word so they can be analyzed as a single item. Lemmatization is similar to stemming but it brings context to the words.
from nltk.stem import WordNetLemmatizer lemmatizer = WordNetLemmatizer() 👉 print("rocks :", lemmatizer.lemmatize("rocks"))
print("corpora :", lemmatizer.lemmatize("corpora")) 👉 corpora : corpus
print("better :", lemmatizer.lemmatize("better", pos ="a")) 👉 better : good
nltk.corpus.stopwords()
commonly used words (such as “the”, “a”, “an”, “in”) that a search engine has been programmed to ignore
import nltk from nltk.corpus import stopwords print(stopwords.words('English'))
pmdarima.arima.ndiffs(x, alpha=0.05, test=’kpss’, max_d=2, **kwargs)
Estimate ARIMA differencing term, d. Perform a test of stationarity for different levels of d to estimate the number of differences required to make a given time series stationary.
from pmdarima.arima.utils import ndiffs ndiffs(df['linearized'])
sklearn.naive_bayes.MultinomialNB(*, alpha=1.0, fit_prior=True, class_prior=None)
Классификатор Naive Bayes для мультиномиальных моделей. Мультиномиальный классификатор Naive Bayes подходит для классификации с дискретными характеристиками (например, подсчет слов для классификации текста).
from sklearn.naive_bayes import MultinomialNB rng = np.random.RandomState(1) X = rng.randint(5, size=(6, 100)) y = np.array([1, 2, 3, 4, 5, 6]) clf = MultinomialNB() clf.fit(X, y) print(clf.predict(X[2:3]))
sklearn.decomposition.LatentDirichletAllocation(n_components=10, *, doc_topic_prior=None, topic_word_prior=None, learning_method=’batch’, learning_decay=0.7, learning_offset=10.0, max_iter=10, batch_size=128, evaluate_every=-1, total_samples=1000000.0, perp_tol=0.1,
mean_change_tol=0.001, max_doc_update_iter=100, n_jobs=None, verbose=0, random_state=None)
Скрытое распределение Дирихлета с онлайн вариационным алгоритмом Байеса.
X, _ = make_multilabel_classification(random_state=0)
lda = LatentDirichletAllocation(n_components=5, random_state=0)
lda.fit(X)
lda.transform(X[-2:])
👉 array([[0.00360392, 0.25499205, 0.0036211 , 0.64236448, 0.09541846], [0.15297572, 0.00362644, 0.44412786, 0.39568399, 0.003586 ]])
mlxtend.plotting.plot_decision_regions(X, y, clf=svm, legend=2)
Visualize the decision regions of a classifier. A function for plotting decision regions of classifiers in 1 or 2 dimensions.
from mlxtend.plotting import plot_decision_regions iris = datasets.load_iris() X = iris.data[:, [0, 2]] y = iris.target svm = SVC(C=0.5, kernel='linear') svm.fit(X, y) Plotting decision regions plot_decision_regions(X, y, clf=svm, legend=2)
tf.keras.utils.to_categorical(y, num_classes=None, dtype=’float32’)
Converts a class vector (integers) to a binary class on numpy arrays.
a = tf.keras.utils.to_categorical([0, 1, 2, 3], num_classes=4) a = tf.constant(a, shape=[4, 4]) print(a)
tf.keras.activations
Built-in activation functions:
- relu() - Applies the rectified linear unit activation function.
- tanh() - Hyperbolic tangent activation function.
- sigmoid() - For CLASSIFICATION TASKS. (CLASSIFICATION WITH 2 CLASSES)
- softmax() - For CLASSIFICATION TASKS. Converts numbers into probabilities that sum to 1. (CLASSIFICATION WITH 14 CLASSES)
- linear() - For REGRESSION TASKS. (REGRESSION WITH 1 or 16 OUTPUTS)
- deserialize() - Returns activation function given a string identifier.
- elu() - Exponential Linear Unit.
- exponential() - Exponential activation function.
- gelu() - Applies the Gaussian error linear unit (GELU) activation function.
- get() - Returns function.
- hard_sigmoid() - Hard sigmoid activation function.
- selu() - Scaled Exponential Linear Unit (SELU).
- serialize() - Returns the string identifier of an activation function.
- softplus() - Softplus activation function, softplus(x) = log(exp(x) + 1).
- softsign() - Softsign activation function, softsign(x) = x / (abs(x) + 1).
- swish() - Swish activation function, swish(x) = x * sigmoid(x).
👉 Regression of size 1 model = Sequential() model.add(layers.Dense(10, activation='relu', input_dim=100)) model.add(...) model.add(layers.Dense(1, activation='linear'))
👉 Classification with 8 classes model = Sequential() model.add(layers.Dense(10, activation='relu', input_dim=100)) model.add(...) model.add(layers.Dense(8, activation='softmax'))
tf.keras.metrics.AUC(num_thresholds=200, curve=’ROC’, summation_method=’interpolation’, name=None, dtype=None, thresholds=None, multi_label=False, num_labels=None, label_weights=None,
from_logits=False)
Approximates the AUC (Area under the curve) of the ROC or PR curves.
m = tf.keras.metrics.AUC(num_thresholds=3) m.update_state([0, 0, 1, 1], [0, 0.5, 0.3, 0.9]) tp = [2, 1, 0], fp = [2, 0, 0], fn = [0, 1, 2], tn = [0, 2, 2] tp_rate = recall = [1, 0.5, 0], fp_rate = [1, 0, 0] auc = ((((1+0.5)/2)*(1-0)) + (((0.5+0)/2)*(0-0))) = 0.75 m.result().numpy()
tf.math.square(x, name=None)
tf.math.reduce_mean(input_tensor, axis=None, keepdims=False, name=None)
Computes the mean of elements across dimensions of a tensor.
x = tf.constant([[1., 1.], [2., 2.]]) tf.reduce_mean(x) 👉 [1.5] tf.reduce_mean(x, 0) 👉 [1.5, 1.5] tf.reduce_mean(x, 1) 👉 [1., 2.]
sklearn.preprocessing.Normalizer(norm=’l2’, *, copy=True)
Нормализуйте образцы по отдельности в соответствии с единичной нормой. Каждая выборка (т.е. каждая строка матрицы данных)с хотя бы одной ненулевой компонентой масштабируется независимо от других выборок так, чтобы ее норма (l1,l2 или inf) равнялась единице.
from sklearn.preprocessing import Normalizer X = [[4, 1, 2, 2], [1, 3, 9, 3], [5, 7, 5, 1]] transformer = Normalizer().fit(X) transformer.transform(X) 👉 array([[0.8, 0.2, 0.4, 0.4], [0.1, 0.3, 0.9, 0.3], [0.5, 0.7, 0.5, 0.1]])
tf.keras.layers.Normalization(axis=-1, mean=None, variance=None, **kwargs)
Этот слой будет приводить свои входные данные к распределению, центрированному вокруг 0 со стандартным отклонением 1.
adapt_data = np.array([[1.], [2.], [3.], [4.], [5.]], dtype=np.float32) input_data = np.array([[1.], [2.], [3.]], np.float32) layer = Normalization() layer.adapt(adapt_data) layer(input_data)
tf.keras.regularizers
class L1: A regularizer that applies an L1 regularization penalty. class L1L2: A regularizer that applies both L1 and L2 regularization penalties. class L2: A regularizer that applies an L2 regularization penalty. class OrthogonalRegularizer: A regularizer that encourages input vectors to be orthogonal to each other. class Regularizer: Regularizer base class.
dense = tf.keras.layers.Dense(3, kernel_regularizer='l1_l2')
class L2Regularizer(tf.keras.regularizers.Regularizer)
Бинаризация
Это обычная операция по подсчету текстовых данных, при которой аналитик может решить рассмотреть только наличие или отсутствие признака, а не, например, количественное число происшествий.