ML Courses 2 (basic level) Flashcards

Question 1

Q

числовые данные

Answer

A

numerical [ньюмэрикал] data

Question 2

Q

модель обрабатывает данные

Answer

A

a model processes data

Question 3

Q

Для чего сырые необработанные значения данных должны быть преобразованы в обучаемые значения? И что такое инжениринг признаков?

Answer

A

To ensure effective model training, raw dataset values must be transformed into trainable values within (внутри) the feature vector.

This process is called feature engineering and it plays a crucial role in improving model performance.

Question 4

Q

Two common feature engineering techniques и их краткое описание.

Answer

A

1) Normalization converts numerical values into a standard range to ensure consistency across different features.

OR

The goal of normalization is to transform features to be on a similar scale.

2) Binning (Bucketing) groups numerical values into predefined ranges to simplify patterns in the data.

Question 5

Q

How to detect outliers?

Answer

A

1) Compare the 0th-25th percentile delta with the 75th-100th percentile data. A large difference suggests the presence of outliers.

2) If the standard derivation is almost as high as the mean, it also suggests the presence of outliers.

Question 6

Q

Что такое стандартное отклонение?

Answer

A

A standard derivation is the average squared difference between the values and the mean.

Question 7

Q

Основные методы нормализации и когда использовать

Answer

A

Four basic methods of normalization:

Linear Scaling or just scaling converts floating-point values into a standard range, such as 0 to 1 or -1 to 1.

This method is the best for features with stable ranges and few outliers.

Z-Score shows how far a value is from the mean in terms of standard derivation.

Z-Score Scaling is useful when the data follows a normal distribution or something close to it.

Log Scaling computes the logarithm of each value to compress large numbers and spread out small numbers.

Best for power law distributions (where small values dominate and Lage numbers values drop off quickly.

Clipping limits extreme outliers to a specific maximum value to prevent them from distorting the model.

Question 8

Q

Группировка признака что это и когда применять.

Answer

A

Binning (also called bucketing) is a technique that groups numerical values into discrete bins or categories.

Binning is a great alternative to scaling or clipping when features are non-linear or clustered (растут пучками)

Question 9

Q

Группировка по квантилям и в чем отличие от группировки одинаковой ширины?

Answer

A

Quantile bucketing ensures that each bucket contains approximately the same number of examples,

unlike equal-width [икуал-уиф] bucketing, which can result in some buckets having too many or too few examples.

Question 10

Q

отчистка данных

Answer

A

data scrubbing [скрёбин]

Question 11

Q

Категории проблемных примеров в датасете

Answer

A

Many examples in datasets are unreliable (ненадежный) [анрэлайэбл] due to one or more of the following problems:

Omitted [омитэд] values (пропущены значения)
Duplicate examples
Out-of-range feature values.
Bad labels (неправильные метки)

Question 12

Q

Что такое encoding?

Answer

A

Encoding ensures categorical data is represented in a way that models can understand and train on

Question 13

Q

Что такое One-Hot Encoding?

Answer

A

One-hot encoding is a technique used to convert categorical data into a numerical format that machine learning can understand.

How it works:
1. Each category is represented by a vector of N elements, where N is the number of categories.
2. Exactly one element in the vector set to 1, while all others are 0.

Question 14

Q

Что значит не сбалансированный дата сет?

Answer

A

An imbalanced dataset occurs (возникать), when one class (majority) significantly outnumbers (превосходить) another class (minority)

Question 15

Q

Методы обработки несбалансированных наборов данных

Answer

A

Techniques to handle imbalanced datasets:

Downsampling: randomly removes a subset of majority class examples to improve class balance.
Upweighting: assigns higher importance to minority class examples during training by increasing their example weight.

Question 16

Q

Пошагово как использовать training data, validation data и test data

Answer

A

A good workflow (рабочий процесс) for development and testing:

Train model on training data.
Evaluate model on validation set.
Tweak (изменить подправить) model according to results on validation set.
Pick model that does best validation set.
Confirm results on test data.

Question 17

Q

Характеристика хорошего тест или валидационного сета

Answer

A

A good test or validation set should meet the following criteria:

Large enough to get statistically significant testing results.
Representative of the dataset as a whole
Reflects real-world-data
No duplicates from the training set

Question 18

Q

Что такое обобщение модели?

Answer

A

Generalization is the opposite of overfitting. A goal is to create a model, that generalizes well to new data.

Question 19

Q

Что делает регулязация?

Answer

A

Regularization controls how large the model weights can grow. A higher rate keeps the model simpler and helps prevent overfitting.

Question 20

Q

Как работает неиросеть?

Answer

A

How neural network works step by step:

First of all input data goes into the first layer [лэер] of neurons.
Each neuron multiplies the input by its weight (importance) and adds a bias.
The result passes through an activation function, which decides:

whether to send (отправлять ли) the Signal forward or not,
and how strong the signal should be.

The signals move forward to the next layer, then the next, and so on.
Finally the neural network produces a result - for example, “this is a cat” or “this is spam”

Question 21

Q

What is Back-propagation (обратное распространение)?

Answer

A

Back-propagation is the most common training algorithm for neural networks.

It adjusts model weights using gradients to minimize the loss.
Libraries like Keras handle it automatically

Question 22

Q

Что такое вложение?

Answer

A

An embedding is a vector representation of data in embedding space. Generally speaking, a model finds potential embeddings by projecting the high-demensional space of initial data vectors into a lower-dimensional space.

Question 23

Q