Learning From Data Flashcards

Question

What is a prediction error in linear regression?

Answer 1

The difference between the predicted value (𝑦̂) and the actual value (y), i.e., 𝑦̂ - y.

Answer 2

The total variance in the observed values from the mean.

Answer 3

The part of the total variation explained by the regression model.

Answer 4

The sum of the squared residuals (errors), i.e., the unexplained variation.

Answer 5

R² = 1 - (RSS / TSS)

Answer 6

To find a line that minimizes the prediction error across all data points.

Answer 7

To find the minimum of the squared error function.

Answer 8

They are the two equations you get when you take derivatives of the error function and set them to 0. Solving them gives you β₀ and β₁.

Answer 9

To generate accurate forecasts for future or unseen data.

Answer 10

Predicting whether a customer will default on a loan.

Answer 11

Understanding how marketing spend affects sales revenue.

Answer 12

You might create a black-box model that lacks explainability

Answer 13

The cost function you want to minimize.

Answer 14

To compare them and choose the best based on performance metrics.

Answer 15

Settings that control the learning process but are not learned from the data.

Answer 16

Choose cost function → Train multiple models → Compare → Interpret or predict.

Answer 17

To model nonlinear relationships in data by adding higher-order terms of features.

Answer 18

Because it is linear in the parameters (weights), even if the features are nonlinear.

Answer 19

To select the best model by balancing complexity and error.

Answer 20

BIC = n⋅ln(SSe) + p⋅ln(n) - n⋅ln(n) Number of parameters (p), number of observations (n), and the sum of squared error (SSe).

Answer 21

A better trade-off between model complexity and goodness of fit.

Answer 22

When a model is too complex and captures noise in the data, it has a low training error but a high validation error.

Answer 23

To train the model on one part and test it on unseen data to evaluate generalisation.

Answer 24

Randomly splits 40% of the data into a test set and 60% into a training set.

Answer 25

It gives a more reliable estimate of model performance and reduces variance in the evaluation.

Answer 26

A method that splits data into k subsets, trains on k–1, tests on the 1 left out, and averages results over k rounds.

Answer 27

Sampling technique that maintains the same distribution of classes in training and test sets as in the full dataset.

Answer 28

When a model is too simple to capture the data’s patterns, there is a high training and validation error.

Answer 29

Bias is the error resulting from incorrect assumptions in the learning algorithm, leading to underfitting.

Answer 30

Variance is the error resulting from sensitivity to small fluctuations in the training set, leading to overfitting.

Answer 31

Bias, variance, and irreducible error (noise).

Answer 32

Unavoidable errors due to randomness in data. Present even in the best model.

Answer 33

Lowers bias but raises variance.

Answer 34

Raises bias and lowers variance.

Answer 35

The goal is to find the sweet spot where the total error is minimised, balancing bias and variance.

Answer 36

It adds a penalty to large model weights to reduce overfitting and control complexity.

Answer 37

A regularisation method that uses the sum of the absolute value of the coefficients (L1 norm). It shrinks and eliminates some coefficients (feature selection).

Answer 38

A regularisation method using the sum of the squared weights (L2 norm). It shrinks coefficients but does not set any to zero.

Answer 39

LASSO can eliminate features by setting coefficients to zero; Ridge cannot.

Answer 40

The sum of the absolute values of the coefficients; used in LASSO.

Answer 41

The square root of the sum of squares of the coefficients; used in Ridge regression.

Answer 42

When your model is overfitting due to high variance.

Answer 43

Reducing the number of features in the eqation, which can prevent overfitting.

Answer 44

The process of detecting and correcting corrupt, inaccurate, or incomplete data before analysis.

Answer 45

Expected data values that are absent, often shown as NaN, None, N/A, etc.

Answer 46

Human error, skipped questions, sensor failure, database issues, and more.

Answer 47

MCAR (completely random), MAR (related to observed data), MNAR (related to missing data itself).

Answer 48

When data is missing purely by chance. The probability of missing value is equal for all units.

Answer 49

When some data objects in the data are more likely to have missing values. The probability of missing values is related to the observed data but not to the missing data itself.

Answer 50

When we know which data object will have missing values. The probability of missing values is related to the actual missing data.

Answer 51

Keep as is, remove rows, remove columns, impute values.

Answer 52

When sharing data with others, and when algorithms can handle missing values.

Answer 53

With MCAR data, and only when other strategies don't work.

Answer 54

When there is a large amount of data missing e.g. 25%, and it is not a critical column.

Answer 55

To minimize analytical bias. For MCAR use mean/median/mode; for MAR use a central tendency of a relevant data subgroup; for NBAR use regression analysis.

Answer 56

A data point that significantly deviates from other observations.

Answer 57

Using the IQR method, where IQR is Q3-Q1: values outside Q1 − 1.5×IQR or Q3 + 1.5×IQR.

Answer 58

Leave them, cap them, log transform, or remove them (last resort).

Answer 59

When the modelling used is robust against outliers, or detecting outliers is the goal.

Answer 60

When analysis is sensitive to outliers.

Answer 61

When data is skewed, so some objects significantly deviate from the majority.

Answer 62

As a last resort, when other methods are inapplicable.

Answer 63

Random ones are unavoidable fluctuations in data. Systematic ones are consistent repeatable errors that could be associated with the data source.

Answer 64

Rescaling data to have mean 0 and standard deviation 1. For every feature, subtract the mean and divide by SD.

Answer 65

Rescaling data to a [0, 1] range using min-max scaling. For every feature, subtract the minimum, and divide by the full range.

Answer 66

For skewed data, or data that spans several orders of magnitude. For every feature, log it.

Answer 67

Converting continuous features into categorical by binning values.

Answer 68

Reducing noise in data using techniques like moving average.

Answer 69

Averaging data points in successive subsets.

Answer 70

Categories

Answer 71

To see relationships between two or more variables

Answer 72

Visualising matrices or correlations

Answer 73

To predict categorical outcomes (class labels) using input data.

Answer 74

Supervised learning, it is trained on labelled examples.

Answer 75

80% training, 20% testing.

Answer 76

The probability of a binary outcome based on one or more features.

Answer 77

Between 0 and 1 (probability).

Answer 78

Sigmoid function: 1/1+e^-x

Answer 79

Continuous input, binary output.

Answer 80

Using accuracy, not R². Correct guesses / All guesses

Answer 81

0.5, meaning predict class 1 if probability > 0.5.

Answer 82

The first algorithm to classify data using a linear decision boundary.

Answer 83

A step function which outputs either 0 or 1.

Answer 84

ŷ = f(w⋅x+b). w = vector of weight. x = vector of features. b = bias term. f() = the step function.

Answer 85

w = w + α(y−ŷ)x. w = the weight. α is the learning rate, a small positive number. y = the true label. ŷ = the predicted label. x is the input vector.

Answer 86

Only when the prediction is incorrect.

Answer 87

It can only solve problems that are linearly separable.

Answer 88

It can handle non-linear data by adding hidden layers.

Answer 89

They allow the model to learn complex, non-linear relationships.

Answer 90

Multi-Layer Perceptron.

Answer 91

Usually sigmoid, tanh, or ReLU functions.

Answer 92

An algorithm for finding the minimum of a loss function by moving in the direction of steepest descent.

Answer 93

To minimise the loss (error) of a model by adjusting its parameters.

Answer 94

A function that quantifies how far off a model's predictions are from the true values.

Answer 95

The route of the sum of squared differences between predicted and true values

Answer 96

A vector of partial derivatives that points in the direction of greatest increase in a function.

Answer 97

θ=θ−α⋅∇ J (θ), where α is the learning rate, θ is the parameter, and ∇ J (θ) is the gradient of the loss function.

Answer 98

The size of the step taken toward the minimum during each update.

Answer 99

Stochastic uses randomness in updates (e.g. using random samples); deterministic does not.

Answer 100

To summarize the performance of a classification model by comparing predictions and actual values.

Answer 101

Cases where the model correctly predicted the positive class.

Answer 102

Cases where the model incorrectly predicted the positive class.

Answer 103

Cases where the model missed the actual positive class.

Answer 104

Cases where the model correctly predicted the negative class.

Answer 105

True positives / Total positives guessed. How many predicted positives were correct.

Answer 106

True positives / Total actual positives. How many actual postives were predicted.

Answer 107

True negatives / Total actual negatives. How many actual negatives were predicted.

Answer 108

2 * Precision * Recall / Precision + Recall

Answer 109

It shows how true positive rate and false positive rate vary with decision threshold.

Answer 110

Area under the ROC curve. It summarises classifier quality over all thresholds.

Answer 111

Perfect classifier.

Answer 112

Random guessing.

Answer 113

Because it doesn't account for class imbalance — a model could guess the majority class and still score high.

Answer 114

The trade-off between sensitivity (TP rate) and specificity (1 - FP rate).

Answer 115

Natural Language Processing — how computers understand and analyse human language.

Answer 116

A document as a vector of word frequencies, ignoring grammar or word order.

Answer 117

Splitting text into individual words (tokens).

Answer 118

Reducing words to their root form by chopping off endings (e.g., "playing" → "play").

Answer 119

Reducing words to their dictionary base form using vocabulary (e.g., "better" → "good").

Answer 120

To group together different forms of the same word for better text analysis.

Answer 121

A matrix where most values are zero, since each document contains only a small subset of all possible words.

Answer 122

A statistic that measures how important a word is in a document relative to all documents.

Answer 123

Term Frequency — how often a word appears in a document.

Answer 124

Inverse Document Frequency — how rare a word is across all documents.

Answer 125

To downweight common words and upweight rare but meaningful ones.

Answer 126

It highlights words that are unique and informative in each document.

Answer 127

The task of assigning categories to text, e.g. spam vs not spam

Answer 128

Words may not be separated clearly by spaces

Answer 129

The collection of all text data being used

Answer 130

That a word is frequent in a document but rare elsewhere

Answer 131

Classification, clustering, similarity analysis

Answer 132

Word pairs and triples, e.g., “burn fire”, “black night sky”

Answer 133

A matrix where each row is a document and each column is a word, with values indicating presence, count, or TF-IDF.

Answer 134

They are high-dimensional and hard to cluster.

Answer 135

An early method for topic modelling using linear algebra.

Answer 136

Unsupervised technique to reduce dimensionality and extract hidden topics from documents.

Answer 137

A probabilistic method that models each document as a mixture of topics.

Answer 138

Automatically summarising what an article is about.

Answer 139

It struggles with short texts and unseen documents.

Answer 140

A dense vector representation of a word capturing its context and meaning.

Answer 141

Using neural networks to predict a word from context (or vice versa).

Answer 142

Predicts context words given a target word.

Answer 143

Predicts a target word given context words.

Answer 144

They are low-dimensional and capture semantics.

Answer 145

The words have similar meanings or usage.

Answer 146

Vector arithmetic can reveal analogies (e.g., king - man + woman ≈ queen).

Answer 147

When a word has multiple meanings

Answer 148

Vectors that represent specific senses of a word, not just the word itself.

Answer 149

A vector that represents the meaning of a full sentence.

Answer 150

Semantic similarity, translation, summarisation.

Answer 151

Using deeper neural networks like transformers.

Answer 152

Use hidden layer vectors from a trained language model.

Answer 153

To find documents similar in meaning to a query.

Answer 154

Similar documents get grouped based on vector proximity.

Answer 155

Requires labelled data for each sense of a word.

Answer 156

By mapping similar words from different languages to nearby vectors.

Answer 157

Supervised learning algorithm

Answer 158

Classification

Answer 159

Similar data points exist close to each other in feature space

Answer 160

Distance (usually Euclidean)

Answer 161

It determines how many nearest neighbours to consider

Answer 162

By the majority class of its k nearest neighbours

Answer 163

The entire training set

Answer 164

They can wrongly influence classification due to being close by accident

Answer 165

When one class has many more examples than another, biasing results

Answer 166

Use weighted KNN where closer neighbours have more influence

Answer 167

The model becomes too sensitive to noise (overfits)

Answer 168

It smooths out local patterns (underfits)

Answer 169

Set k ≈ √(number of training samples)

Answer 170

To avoid ties between classes

Answer 171

A version where closer neighbours have more impact on the classification

Answer 172

The inverse of the distance

Answer 173

Instance-based

Answer 174

It computes distance from the new point to all training points

Answer 175

It compares against every data point at prediction time

Answer 176

Normalisation, to ensure fair distance calculation

Answer 177

Support Vector Machine

Answer 178

Supervised

Answer 179

Classification, regression, and clustering

Answer 180

To find the optimal hyperplane that separates data points with the largest margin.

Answer 181

The shortest distance between the decision boundary and the closest data points

Answer 182

The data points closest to the decision boundary that define it

Answer 183

The classifier becomes more confident and better generalised

Answer 184

Because a single outlier can affect the decision boundary significantly

Answer 185

A margin that allows some misclassification to improve generalisation

Answer 186

Bias vs variance

Answer 187

It is mapped to a higher-dimensional space where it may become linearly separable

Answer 188

f(x) = x²

Answer 189

A hyperplane

Answer 190

A way to compute dot products in high-dimensional space without explicitly transforming the data

Answer 191

It reduces computational cost and makes SVMs feasible in high-dimensional spaces

Answer 192

The dot product of two transformed vectors

Answer 193

k(x, z) = ⟨f(x), f(z)⟩, where f maps input to feature space

Answer 194

A real number

Answer 195

Pairwise similarity scores between all points in the dataset

Answer 196

The trade-off between maximising the margin and minimising classification error

Answer 197

The complexity of the decision boundary (low gamma = smoother boundary)

Answer 198

Model overfits (too complex)

Answer 199

Support vectors, margin, hyperplane, kernel trick

Answer 200

It builds robust boundaries and handles non-linearity using kernel methods

Answer 201

A tree-like model that splits data into groups based on feature values

Answer 202

Supervised learning

Answer 203

To predict categorical labels

Answer 204

Classification predicts categories, regression predicts continuous values

Answer 205

The first question or split

Answer 206

A final classification outcome

Answer 207

Follow left for True, right for False at each node

Answer 208

A measure of how mixed the class labels are in a group

Answer 209

The node is pure (only one class)

Answer 210

1 - Σ(pᵢ²) where pᵢ is the proportion of class i

Answer 211

To choose the best feature and threshold to split on

Answer 212

Weighted average of each branch's Gini

Answer 213

Try all features, pick the one with lowest Gini impurity

Answer 214

Try thresholds between values (e.g., Age < 15)

Answer 215

It creates more pure (less impure) splits

Answer 216

When the tree fits the training data too closely and fails on new data

Answer 217

Stopping the tree from growing too deep or splitting on small groups

Answer 218

Trimming a grown tree to simplify it and reduce overfitting

Answer 219

The cost-complexity parameter that controls tree size

Answer 220

Plot accuracy vs. alpha and pick the best performing point

Answer 221

Yes, based on how much each feature reduces impurity

Answer 222

CART (Classification and Regression Trees)

Answer 223

Entities like people, authors, or websites

Answer 224

Relationships or interactions between nodes

Answer 225

The structure of how nodes and edges are connected

Answer 226

A network where nodes are people and edges represent social relationships

Answer 227

Retweet network, co-authorship network, citation network

Answer 228

A network built from indirect interaction signals like retweets or mentions

Answer 229

An edge if two entities appear together in a context

Answer 230

Number of edges connected to a node

Answer 231

The node is directly connected to many others

Answer 232

In-degree counts incoming edges; out-degree counts outgoing ones

Answer 233

Centrality that considers the importance of neighbouring nodes

Answer 234

It's connected to many important nodes

Answer 235

Undirected networks

Answer 236

A version of eigenvector centrality for directed graphs, used by Google

Answer 237

Importance and out-degree of linked nodes

Answer 238

Measures how close a node is to all other nodes via shortest paths

Answer 239

A node can quickly interact with all others

Answer 240

Doesn’t work well in disconnected networks

Answer 241

Number of shortest paths that pass through a node

Answer 242

The node acts as a bridge or broker between others

Answer 243

Nodes that control flow of information

Answer 244

A type of machine learning where the model finds patterns in data without labels

Answer 245

It doesn’t rely on labeled examples

Answer 246

Clustering

Answer 247

Unsupervised learning

Answer 248

Grouping similar data points into clusters

Answer 249

Measure distance, compare points, group them

Answer 250

A group of similar data points

Answer 251

The number of clusters

Answer 252

It finds dense areas and expands clusters from them

Answer 253

A method that builds a tree-like hierarchy of clusters

Answer 254

Agglomerative merges, divisive splits

Answer 255

Each point belongs to one cluster

Answer 256

Each point can belong to multiple clusters with probabilities

Answer 257

Soft clustering

Answer 258

Finding clusters in a network graph

Answer 259

Network or graph data

Answer 260

In social networks and biology

Answer 261

Straight-line (L2 norm) distance between points

Answer 262

L1 norm – sum of absolute differences

Answer 263

The size of intersection divided by the size of union of two sets

Answer 264

Comparing two probability distributions

Answer 265

A symmetric and smoother version of KL divergence

Answer 266

Data representation and a distance metric

Answer 267

Different algorithms and representations behave differently

Answer 268

Reducing the number of features while preserving important information

Answer 269

To remove noise, reduce computation, and simplify analysis

Answer 270

Feature selection and feature extraction

Answer 271

Choosing a subset of the original features

Answer 272

Methods that select features based on simple metrics like variance

Answer 273

Variance thresholding

Answer 274

Methods that evaluate features by training and comparing models

Answer 275

Forward search or recursive feature elimination

Answer 276

Feature selection built into the model (e.g. decision trees)

Answer 277

Creating new features by combining the original ones

Answer 278

Principal Component Analysis (PCA)

Answer 279

Rotates data to align with directions of maximum variance

Answer 280

New axes that are linear combinations of original features

Answer 281

No — they are orthogonal (uncorrelated)

Answer 282

How much variance each component captures

Answer 283

The direction with the highest variance

Answer 284

The directions (axes) of the new components

Answer 285

Compute the covariance matrix

Answer 286

Diagonalise the covariance matrix to get eigenvectors/eigenvalues

Answer 287

Project the data onto the eigenvectors

Answer 288

Discard them if they have low variance

Answer 289

When features are highly correlated

Answer 290

t-distributed Stochastic Neighbor Embedding

Answer 291

Preserve local structure for visualisation

Answer 292

Matches high-D and low-D distance distributions

Answer 293

Local clustering (groups of similar points)

Answer 294

Distances between clusters are meaningless, not scalable

Answer 295

Uniform Manifold Approximation and Projection

Answer 296

Both local and global structure

Answer 297

Faster, more scalable, works for more than 3D

Answer 298

Visualising high-dimensional datasets like images or text

Answer 299

Because distance and size are distorted

Answer 300

No — the axes have no meaningful scale

Answer 301

No — they’re meant for visualisation only

Answer 302

PCA creates interpretable components with variance explained

Answer 303

Complex models can match data even if it doesn't make sense

Answer 304

A dendrogram (tree)

Answer 305

A tree diagram showing how clusters are merged or split

Answer 306

Clusters with only one item

Answer 307

A bottom-up method that starts with individual points and merges them

Answer 308

Compute pairwise distances between all points

Answer 309

The number of possibilities grows exponentially

Answer 310

Distance between the closest members of two clusters

Answer 311

Distance between the farthest members of two clusters

Answer 312

Average distance between all members of the two clusters

Answer 313

Distance between the centroids of two clusters

Answer 314

Merge clusters that minimise total variance

Answer 315

Density-Based Spatial Clustering of Applications with Noise

Answer 316

Identify dense clusters and mark sparse areas as noise

Answer 317

The radius used to find neighbouring points

Answer 318

The minimum number of points required to form a dense region

Answer 319

A point with ≥ MinPts in its ε-neighbourhood

Answer 320

A point with < MinPts, but in the ε-neighbourhood of a core point

Answer 321

A point that is neither a core nor a border point

Answer 322

Through chains of ε-neighbourhoods from core points

Answer 323

Arbitrary shapes (not limited to circles)

Answer 324

It struggles with varying cluster densities

Answer 325

Very - eps and MinPts must be tuned carefully

Answer 326

When you want to explore different clusterings at different granularity

Answer 327

A partitional clustering algorithm

Answer 328

The number of clusters, K

Answer 329

Minimise the sum of squared errors (SSE)

Answer 330

The average of all points in a cluster

Answer 331

Randomly pick K initial centroids

Answer 332

Recalculate the centroids

Answer 333

When centroids or assignments no longer change

Answer 334

Sensitive to outliers, assumes spherical clusters

Answer 335

Normalising or standardising data

Answer 336

Gaussian Mixture Model

Answer 337

Each cluster follows a Gaussian distribution

Answer 338

Soft assignment (probabilities)

Answer 339

Log-likelihood

Answer 340

Spehrical for K mean, Eliptical for GMM.

Answer 341

K means: Hard GMM: Soft

Answer 342

Clustering has no ground truth labels

Answer 343

External, internal, and relative

Answer 344

Measure whether points that should be close to/far from each other really are close/far.

Answer 345

Measure how the clustering labels compare to externally supplied class labels.

Answer 346

Compare one clustering to another, to see if they agree.

Answer 347

Perfect separation and compactness

Answer 348

Bad clustering — point likely misclassified

Answer 349

Input: Image; Output: High-level information like object detection or 3D reconstruction.

Answer 350

As three matrices for Red, Green, and Blue channels.

Answer 351

Object detection, image segmentation, 3D reconstruction.

Answer 352

The visual cortex of the brain.

Answer 353

A neuron-like model that applies weights to inputs, sums them, and applies an activation function.

Answer 354

Because each node needs too many weights, e.g., 10,000 for a 100×100 image.

Answer 355

It’s the process of sliding a filter over the image and computing dot products to detect patterns.

Answer 356

A small matrix used in CNNs to detect features like edges or textures.

Answer 357

The output matrix showing where a filter detected its pattern in the input image.

Answer 358

Fewer parameters, local pattern detection, and translation invariance.

Answer 359

It needs large amounts of data to generalize well.

Answer 360

High computational requirements, often needing GPUs and cloud computing.

Answer 361

Neural networks function as "black boxes" with complex internal representations.

Answer 362

No, they are easily overconfident and can be tricked.

Answer 363

Choosing the right architecture, learning rate, and avoiding overfitting.

Answer 364

To add non-linearity by setting all negative values to zero.

Answer 365

To reduce the spatial dimensions and computation cost while keeping key features.

Answer 366

Taking the maximum value in a small patch of the feature map.

Answer 367

Taking the average of the values in a small patch.

Answer 368

Adding extra borders (usually zeros) to retain image size after convolution.

Answer 369

The step size of the filter as it moves across the image.

Answer 370

It reduces their size faster, but may lose detail.

Answer 371

Its derivatives can vanish, making learning slow.

Answer 372

Probabilities for each class.

Answer 373

it discards all negative values

Answer 374

The model may overshoot and not converge.

Answer 375

Learning is very slow or may get stuck.

Answer 376

A technique that helps accelerate gradient descent by smoothing updates.

Answer 377

A setting defined before training that controls model structure or learning.

Answer 378

Filter size, stride, padding, learning rate, activation function, epochs, batch size.

Answer 379

It determines how many examples are processed at once in each training step.

Learning From Data Flashcards

(407 cards)