ML Courses 2 (basic level) Flashcards

1
Q

числовые данные

A

numerical [ньюмэрикал] data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

модель обрабатывает данные

A

a model processes data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Для чего сырые необработанные значения данных должны быть преобразованы в обучаемые значения? И что такое инжениринг признаков?

A

To ensure effective model training, raw dataset values must be transformed into trainable values within (внутри) the feature vector.

This process is called feature engineering and it plays a crucial role in improving model performance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Two common feature engineering techniques и их краткое описание.

A

1) Normalization converts numerical values into a standard range to ensure consistency across different features.

OR

The goal of normalization is to transform features to be on a similar scale.

2) Binning (Bucketing) groups numerical values into predefined ranges to simplify patterns in the data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

How to detect outliers?

A

1) Compare the 0th-25th percentile delta with the 75th-100th percentile data. A large difference suggests the presence of outliers.

2) If the standard derivation is almost as high as the mean, it also suggests the presence of outliers.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Что такое стандартное отклонение?

A

A standard derivation is the average squared difference between the values and the mean.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Основные методы нормализации и когда использовать

A

Four basic methods of normalization:

  1. Linear Scaling or just scaling converts floating-point values into a standard range, such as 0 to 1 or -1 to 1.

This method is the best for features with stable ranges and few outliers.

  1. Z-Score shows how far a value is from the mean in terms of standard derivation.

Z-Score Scaling is useful when the data follows a normal distribution or something close to it.

  1. Log Scaling computes the logarithm of each value to compress large numbers and spread out small numbers.

Best for power law distributions (where small values dominate and Lage numbers values drop off quickly.

  1. Clipping limits extreme outliers to a specific maximum value to prevent them from distorting the model.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Группировка признака что это и когда применять.

A

Binning (also called bucketing) is a technique that groups numerical values into discrete bins or categories.

Binning is a great alternative to scaling or clipping when features are non-linear or clustered (растут пучками)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Группировка по квантилям и в чем отличие от группировки одинаковой ширины?

A

Quantile bucketing ensures that each bucket contains approximately the same number of examples,

unlike equal-width [икуал-уиф] bucketing, which can result in some buckets having too many or too few examples.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

отчистка данных

A

data scrubbing [скрёбин]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Категории проблемных примеров в датасете

A

Many examples in datasets are unreliable (ненадежный) [анрэлайэбл] due to one or more of the following problems:

Omitted [омитэд] values (пропущены значения)
Duplicate examples
Out-of-range feature values.
Bad labels (неправильные метки)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Что такое encoding?

A

Encoding ensures categorical data is represented in a way that models can understand and train on

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Что такое One-Hot Encoding?

A

One-hot encoding is a technique used to convert categorical data into a numerical format that machine learning can understand.

How it works:
1. Each category is represented by a vector of N elements, where N is the number of categories.
2. Exactly one element in the vector set to 1, while all others are 0.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Что значит не сбалансированный дата сет?

A

An imbalanced dataset occurs (возникать), when one class (majority) significantly outnumbers (превосходить) another class (minority)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Методы обработки несбалансированных наборов данных

A

Techniques to handle imbalanced datasets:

  1. Downsampling: randomly removes a subset of majority class examples to improve class balance.
  2. Upweighting: assigns higher importance to minority class examples during training by increasing their example weight.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Пошагово как использовать training data, validation data и test data

A

A good workflow (рабочий процесс) for development and testing:

  1. Train model on training data.
  2. Evaluate model on validation set.
  3. Tweak (изменить подправить) model according to results on validation set.
  4. Pick model that does best validation set.
  5. Confirm results on test data.
17
Q

Характеристика хорошего тест или валидационного сета

A

A good test or validation set should meet the following criteria:

  1. Large enough to get statistically significant testing results.
  2. Representative of the dataset as a whole
  3. Reflects real-world-data
  4. No duplicates from the training set
18
Q

Что такое обобщение модели?

A

Generalization is the opposite of overfitting. A goal is to create a model, that generalizes well to new data.

19
Q

Что делает регулязация?

A

Regularization controls how large the model weights can grow. A higher rate keeps the model simpler and helps prevent overfitting.

20
Q

Как работает неиросеть?

A

How neural network works step by step:

  1. First of all input data goes into the first layer [лэер] of neurons.
  2. Each neuron multiplies the input by its weight (importance) and adds a bias.
  3. The result passes through an activation function, which decides:
  • whether to send (отправлять ли) the Signal forward or not,
  • and how strong the signal should be.
  1. The signals move forward to the next layer, then the next, and so on.
  2. Finally the neural network produces a result - for example, “this is a cat” or “this is spam”
21
Q

What is Back-propagation (обратное распространение)?

A

Back-propagation is the most common training algorithm for neural networks.

  • It adjusts model weights using gradients to minimize the loss.
  • Libraries like Keras handle it automatically
22
Q

Что такое вложение?

A

An embedding is a vector representation of data in embedding space. Generally speaking, a model finds potential embeddings by projecting the high-demensional space of initial data vectors into a lower-dimensional space.