Learning From Data Flashcards

(407 cards)

1
Q

What is Structured Data?

A

Data that is organised in a predefined schema, for example data stored in a Relational Database.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is Semi-Structured Data?

A

Data that has some structure, but not in fixed rows or columns. For example: JSON, XML or NoSQL.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is Unstructured Data?

A

Data without a predefined structure, which doesn’t fit neatly in tables. For example: text, images, videos or PDF’s.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is Data Integration?

A

The practice of combining data from different sources into a single, coherent data store.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is Common User Interface?

A

Manual, controlled data-integration, but not scalable.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is Middleware Data Integration?

A

Uses middleware software to bridge and facilitate communication between different systems, consistent but needs maintenance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is Application-based integration?

A

Software applications locate, retrieve and integrate data by making data from different sources and systems compatible with one another.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is uniform data access?

A

Provides a consistent view of data from diverse sources without moving or altering it, keeping the data in its original location. Virtual integration without moving data; lighter but may affect integrity.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is Common data storage?

A

It retrieves and presents data uniformly while creating and storing a duplicate copy, often in a central repository. Stores copies centrally; great for analysis but expensive.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is the difference between supervised and unsupervised learning?

A

Supervised learning uses data with labelled outcomes, while unsupervised learning algorithms use data without labelled outcomes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is supervised learning?

A

To learn a mapping function from inputs x to outputs y, where x is features, and y is a label or target.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is unsupervised learning?

A

Aim to make sense of data, and uncover patterns.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

In supervised learning, what are features?

A

The input variables (columns in the dataset).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What are model parameters?

A

Values that the model learns (e.g., coefficients in linear regression).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What’s the difference between y^ and y?

A

y^ is the model’s prediction; y is the true observed value.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

When is Regression used in Supervised Learning?

A

When the target is a quantitive value.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

When is Classification used in Supervised Learning?

A

When the target is qualitative or a class.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What is the equation for β₁ (the slope) in β₁x + β₀ for OLS linear regression

A

β₁ = Σ((xᵢ - x̄)(yᵢ - ȳ)) / Σ((xᵢ - x̄)²)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What is the equation for β₀ in β₁x+β₀ for OLS linear regression

A

β₀ = ȳ − β₁x̄

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What is the error in OLS regression?

A

The difference between actual value (y) and predicted value (ŷ): Error = y − ŷ

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What is the Sum of Squared Errors (SSE)?

A

The sum of all squared differences between predicted and actual values.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

How is Mean Squared Error (MSE) calculated?

A

MSE = SSE divided by the number of observations.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What does R² (Coefficient of Determination) tell us?

A

The proportion of total variation in the dependent variable explained by the model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What is the goal of Ordinary Least Squares (OLS)?

A

To minimize the sum of squared prediction errors.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
What is a prediction error in linear regression?
The difference between the predicted value (𝑦̂) and the actual value (y), i.e., 𝑦̂ - y.
26
What is the Total Sum of Squares (TSS)?
The total variance in the observed values from the mean.
27
What is the Explained Sum of Squares (ESS)?
The part of the total variation explained by the regression model.
28
What is the Residual Sum of Squares (RSS)?
The sum of the squared residuals (errors), i.e., the unexplained variation.
29
How is R² calculated in regression?
R² = 1 - (RSS / TSS)
30
What is the objective of OLS linear regression?
To find a line that minimizes the prediction error across all data points.
31
Why do we take partial derivatives in OLS?
To find the minimum of the squared error function.
32
What are the normal equations in OLS linear regression?
They are the two equations you get when you take derivatives of the error function and set them to 0. Solving them gives you β₀ and β₁.
33
What is the goal of a predictive model?
To generate accurate forecasts for future or unseen data.
34
Give an example of using regression for prediction.
Predicting whether a customer will default on a loan.
35
Give an example of using regression for interpretation.
Understanding how marketing spend affects sales revenue.
36
What risk comes from focusing only on prediction?
You might create a black-box model that lacks explainability
37
What should you always define before training a model?
The cost function you want to minimize.
38
What is the benefit of trying multiple models?
To compare them and choose the best based on performance metrics.
39
What are hyperparameters in model training?
Settings that control the learning process but are not learned from the data.
40
What’s a good workflow for modeling?
Choose cost function → Train multiple models → Compare → Interpret or predict.
41
What’s the main reason for using polynomial regression?
To model nonlinear relationships in data by adding higher-order terms of features.
42
Why is polynomial regression still considered linear?
Because it is linear in the parameters (weights), even if the features are nonlinear.
43
What is BIC (Bayes Information Criterion) used for?
To select the best model by balancing complexity and error.
44
What is the the BIC formula?
BIC = n⋅ln(SSe) + p⋅ln(n) - n⋅ln(n) Number of parameters (p), number of observations (n), and the sum of squared error (SSe).
45
What does a lower BIC indicate?
A better trade-off between model complexity and goodness of fit.
46
What is overfitting?
When a model is too complex and captures noise in the data, it has a low training error but a high validation error.
47
What’s the purpose of splitting data into training and testing sets?
To train the model on one part and test it on unseen data to evaluate generalisation.
48
What does train_test_split(X, y, test_size=0.4) do?
Randomly splits 40% of the data into a test set and 60% into a training set.
49
Why is k-fold CV better than a single train/test split?
It gives a more reliable estimate of model performance and reduces variance in the evaluation.
50
What is k-fold cross-validation?
A method that splits data into k subsets, trains on k–1, tests on the 1 left out, and averages results over k rounds.
51
What is stratified sampling in machine learning?
Sampling technique that maintains the same distribution of classes in training and test sets as in the full dataset.
52
What is underfitting?
When a model is too simple to capture the data’s patterns, there is a high training and validation error.
53
Define bias in a model.
Bias is the error resulting from incorrect assumptions in the learning algorithm, leading to underfitting.
54
Define variance in a model.
Variance is the error resulting from sensitivity to small fluctuations in the training set, leading to overfitting.
55
What are the three sources of model error?
Bias, variance, and irreducible error (noise).
56
What is Irreducible Error?
Unavoidable errors due to randomness in data. Present even in the best model.
57
What does increasing the model complexity cause?
Lowers bias but raises variance.
58
What does decreasing the model complexity cause?
Raises bias and lowers variance.
59
What is the Bias-Variance tradeoff?
The goal is to find the sweet spot where the total error is minimised, balancing bias and variance.
60
What does regularisation do?
It adds a penalty to large model weights to reduce overfitting and control complexity.
61
What is LASSO Regression?
A regularisation method that uses the sum of the absolute value of the coefficients (L1 norm). It shrinks and eliminates some coefficients (feature selection).
62
What is Ridge Regression?
A regularisation method using the sum of the squared weights (L2 norm). It shrinks coefficients but does not set any to zero.
63
What’s the key difference between Ridge and LASSO regression?
LASSO can eliminate features by setting coefficients to zero; Ridge cannot.
64
What is the L1 norm?
The sum of the absolute values of the coefficients; used in LASSO.
65
What is the L2 norm?
The square root of the sum of squares of the coefficients; used in Ridge regression.
66
When should you use regularisation?
When your model is overfitting due to high variance.
67
What is Feature Selection?
Reducing the number of features in the eqation, which can prevent overfitting.
68
What is data cleaning?
The process of detecting and correcting corrupt, inaccurate, or incomplete data before analysis.
69
What are missing values?
Expected data values that are absent, often shown as NaN, None, N/A, etc.
70
What causes missing values?
Human error, skipped questions, sensor failure, database issues, and more.
71
What are the three types of missing data?
MCAR (completely random), MAR (related to observed data), MNAR (related to missing data itself).
72
What is MCAR data?
When data is missing purely by chance. The probability of missing value is equal for all units.
73
What is MAR data?
When some data objects in the data are more likely to have missing values. The probability of missing values is related to the observed data but not to the missing data itself.
74
What is MNAR data?
When we know which data object will have missing values. The probability of missing values is related to the actual missing data.
75
What are the four approaches to dealing with missing data?
Keep as is, remove rows, remove columns, impute values.
76
When is keeping the data as is with missing values useful?
When sharing data with others, and when algorithms can handle missing values.
77
When is removing rows with missing data useful?
With MCAR data, and only when other strategies don't work.
78
When is removing columns useful with missing data?
When there is a large amount of data missing e.g. 25%, and it is not a critical column.
79
When is imputing values useful with missing data?
To minimize analytical bias. For MCAR use mean/median/mode; for MAR use a central tendency of a relevant data subgroup; for NBAR use regression analysis.
80
What is an outlier?
A data point that significantly deviates from other observations.
81
How can we detect outliers?
Using the IQR method, where IQR is Q3-Q1: values outside Q1 − 1.5×IQR or Q3 + 1.5×IQR.
82
What should we do with outliers?
Leave them, cap them, log transform, or remove them (last resort).
83
When should you leave an outlier?
When the modelling used is robust against outliers, or detecting outliers is the goal.
84
When should you cap an outlier?
When analysis is sensitive to outliers.
85
When should you log transform an outlier?
When data is skewed, so some objects significantly deviate from the majority.
86
When should you remove data objects with outliers?
As a last resort, when other methods are inapplicable.
87
What are random and systematic errors?
Random ones are unavoidable fluctuations in data. Systematic ones are consistent repeatable errors that could be associated with the data source.
88
What is data standardisation?
Rescaling data to have mean 0 and standard deviation 1. For every feature, subtract the mean and divide by SD.
89
What is data normalisation?
Rescaling data to a [0, 1] range using min-max scaling. For every feature, subtract the minimum, and divide by the full range.
90
When should we use log transformation?
For skewed data, or data that spans several orders of magnitude. For every feature, log it.
91
What is discretisation?
Converting continuous features into categorical by binning values.
92
What is smoothing?
Reducing noise in data using techniques like moving average.
93
What is moving average for smoothing?
Averaging data points in successive subsets.
94
What are bar charts good for?
Categories
95
What are Line plots good for?
Trends
96
What are scatter plots good for?
To see relationships between two or more variables
97
What are heatmaps good for?
Visualising matrices or correlations
98
What is the goal of classification in machine learning?
To predict categorical outcomes (class labels) using input data.
99
What type of learning is classification?
Supervised learning, it is trained on labelled examples.
100
What is the typical train-test data split ratio?
80% training, 20% testing.
101
What does logistic regression model?
The probability of a binary outcome based on one or more features.
102
What is the output range of logistic regression?
Between 0 and 1 (probability).
103
What function does logistic regression use to map outputs?
Sigmoid function: 1/1+e^-x ​
104
What kind of input/output does logistic regression handle?
Continuous input, binary output.
105
How do you evaluate a logistic regression model?
Using accuracy, not R². Correct guesses / All guesses
106
What is the logistic regression decision boundary typically set at?
0.5, meaning predict class 1 if probability > 0.5.
107
What is the Perceptron?
The first algorithm to classify data using a linear decision boundary.
108
What activation function does the Perceptron use?
A step function which outputs either 0 or 1.
109
What is the forward pass equation for the Perceptron?
ŷ = f(w⋅x+b). w = vector of weight. x = vector of features. b = bias term. f() = the step function.
110
What is the weight update rule in a perceptron?
w = w + α(y−ŷ)x. w = the weight. α is the learning rate, a small positive number. y = the true label. ŷ = the predicted label. x is the input vector.
111
When are weights updated in perceptron training?
Only when the prediction is incorrect.
112
What is a major limitation of the Perceptron?
It can only solve problems that are linearly separable.
113
What problem does MLP solve?
It can handle non-linear data by adding hidden layers.
114
What is the role of hidden layers in MLP?
They allow the model to learn complex, non-linear relationships.
115
What does MLP stand for?
Multi-Layer Perceptron.
116
What kind of function is used by MLP instead of the step function?
Usually sigmoid, tanh, or ReLU functions.
117
What is gradient descent?
An algorithm for finding the minimum of a loss function by moving in the direction of steepest descent.
118
What is the main goal of gradient descent?
To minimise the loss (error) of a model by adjusting its parameters.
119
What is a loss function?
A function that quantifies how far off a model's predictions are from the true values.
120
What is the L2 loss function in machine learning?
The route of the sum of squared differences between predicted and true values
121
What is a gradient?
A vector of partial derivatives that points in the direction of greatest increase in a function.
122
How does gradient descent update parameters?
θ=θ−α⋅∇ J (θ), where α is the learning rate, θ is the parameter, and ∇ J (θ) is the gradient of the loss function.
123
What does the learning rate control in gradient descent?
The size of the step taken toward the minimum during each update.
124
What’s the difference between stochastic and deterministic gradient descent?
Stochastic uses randomness in updates (e.g. using random samples); deterministic does not.
125
What is the confusion matrix used for?
To summarize the performance of a classification model by comparing predictions and actual values.
126
What are true positives (TP)?
Cases where the model correctly predicted the positive class.
127
What are false positives (FP)?
Cases where the model incorrectly predicted the positive class.
128
What are false negatives (FN)?
Cases where the model missed the actual positive class.
129
What are true negatives (TN)?
Cases where the model correctly predicted the negative class.
130
What is precision?
True positives / Total positives guessed. How many predicted positives were correct.
131
What is recall?
True positives / Total actual positives. How many actual postives were predicted.
132
What is specificity?
True negatives / Total actual negatives. How many actual negatives were predicted.
133
What is the F1 score?
2 * Precision * Recall / Precision + Recall
134
What does the ROC curve show?
It shows how true positive rate and false positive rate vary with decision threshold.
135
What is AUC?
Area under the ROC curve. It summarises classifier quality over all thresholds.
136
What does an AUC of 1.0 mean?
Perfect classifier.
137
What does an AUC of 0.5 mean?
Random guessing.
138
Why might high accuracy be misleading?
Because it doesn't account for class imbalance — a model could guess the majority class and still score high.
139
What trade-off does the ROC curve help you visualise?
The trade-off between sensitivity (TP rate) and specificity (1 - FP rate).
140
What does NLP stand for?
Natural Language Processing — how computers understand and analyse human language.
141
What does a bag-of-words model represent?
A document as a vector of word frequencies, ignoring grammar or word order.
142
What is tokenisation?
Splitting text into individual words (tokens).
143
What is stemming?
Reducing words to their root form by chopping off endings (e.g., "playing" → "play").
144
What is lemmatisation?
Reducing words to their dictionary base form using vocabulary (e.g., "better" → "good").
145
Why do we use stemming or lemmatisation?
To group together different forms of the same word for better text analysis.
146
What is a sparse matrix in NLP?
A matrix where most values are zero, since each document contains only a small subset of all possible words.
147
What is TF-IDF?
A statistic that measures how important a word is in a document relative to all documents.
148
What does TF stand for in TF-IDF?
Term Frequency — how often a word appears in a document.
149
What does IDF stand for in TF-IDF?
Inverse Document Frequency — how rare a word is across all documents.
150
Why do we use the IDF component in TF-IDF?
To downweight common words and upweight rare but meaningful ones.
151
Why is TF-IDF better than raw word counts?
It highlights words that are unique and informative in each document.
152
What is text classification?
The task of assigning categories to text, e.g. spam vs not spam
153
Why is tokenisation harder in some languages like Vietnamese?
Words may not be separated clearly by spaces
154
What is a corpus in NLP?
The collection of all text data being used
155
What does a high TF-IDF score indicate?
That a word is frequent in a document but rare elsewhere
156
What can vector representations of text be used for?
Classification, clustering, similarity analysis
157
What are bigrams and trigrams?
Word pairs and triples, e.g., “burn fire”, “black night sky”
158
What is a document-term matrix?
A matrix where each row is a document and each column is a word, with values indicating presence, count, or TF-IDF.
159
Why are sparse vectors problematic?
They are high-dimensional and hard to cluster.
160
What is Latent Semantic Indexing (LSI)?
An early method for topic modelling using linear algebra.
161
What is topic modelling?
Unsupervised technique to reduce dimensionality and extract hidden topics from documents.
162
What is Latent Dirichlet Allocation (LDA)?
A probabilistic method that models each document as a mixture of topics.
163
Give a real-world use case of topic modelling.
Automatically summarising what an article is about.
164
What is a downside of topic modelling?
It struggles with short texts and unseen documents.
165
What is a word embedding?
A dense vector representation of a word capturing its context and meaning.
166
How are word embeddings trained?
Using neural networks to predict a word from context (or vice versa).
167
What is the Skip-Gram model?
Predicts context words given a target word.
168
What is the CBOW model?
Predicts a target word given context words.
169
Why are dense vectors better than sparse ones?
They are low-dimensional and capture semantics.
170
What does it mean if two word vectors are close together?
The words have similar meanings or usage.
171
How do embeddings capture relationships?
Vector arithmetic can reveal analogies (e.g., king - man + woman ≈ queen).
172
What is polysemy?
When a word has multiple meanings
173
What are sense embeddings?
Vectors that represent specific senses of a word, not just the word itself.
174
What is a sentence embedding?
A vector that represents the meaning of a full sentence.
175
Name three applications of sentence embeddings.
Semantic similarity, translation, summarisation.
176
How are sentence embeddings built?
Using deeper neural networks like transformers.
177
What is the core idea behind sentence embeddings?
Use hidden layer vectors from a trained language model.
178
How are embeddings used in search engines?
To find documents similar in meaning to a query.
179
How do embeddings help with clustering?
Similar documents get grouped based on vector proximity.
180
What is the challenge with training sense embeddings?
Requires labelled data for each sense of a word.
181
How can embeddings work across languages?
By mapping similar words from different languages to nearby vectors.
182
What type of machine learning algorithm is KNN?
Supervised learning algorithm
183
What is KNN mostly used for?
Classification
184
What does KNN assume about data?
Similar data points exist close to each other in feature space
185
What metric does KNN use to measure similarity?
Distance (usually Euclidean)
186
What is the role of the parameter k in KNN?
It determines how many nearest neighbours to consider
187
How is the class of a new point predicted in KNN?
By the majority class of its k nearest neighbours
188
What kind of data does KNN store?
The entire training set
189
Why can outliers be a problem for KNN?
They can wrongly influence classification due to being close by accident
190
What is class imbalance in KNN?
When one class has many more examples than another, biasing results
191
How can we reduce the effect of class imbalance in KNN?
Use weighted KNN where closer neighbours have more influence
192
What happens if k is too small in KNN?
The model becomes too sensitive to noise (overfits)
193
What happens if k is too large in KNN?
It smooths out local patterns (underfits)
194
What is a common rule-of-thumb for choosing k?
Set k ≈ √(number of training samples)
195
Why should k be odd in binary classification?
To avoid ties between classes
196
What is weighted KNN?
A version where closer neighbours have more impact on the classification
197
What does weighted KNN use to assign importance to neighbours?
The inverse of the distance
198
Is KNN a model-based or instance-based algorithm?
Instance-based
199
How does KNN handle new data points?
It computes distance from the new point to all training points
200
Why is KNN slow with large datasets?
It compares against every data point at prediction time
201
What should be done to data before applying KNN, and why?
Normalisation, to ensure fair distance calculation
202
What does SVM stand for?
Support Vector Machine
203
Is SVM supervised or unsupervised?
Supervised
204
What types of tasks can SVM be used for?
Classification, regression, and clustering
205
What is the main goal of an SVM?
To find the optimal hyperplane that separates data points with the largest margin.
206
What is the margin in SVM?
The shortest distance between the decision boundary and the closest data points
207
What are support vectors in SVM?
The data points closest to the decision boundary that define it
208
What happens if the margin is maximised in SVM?
The classifier becomes more confident and better generalised
209
Why is SVM sensitive to outliers?
Because a single outlier can affect the decision boundary significantly
210
What is a soft margin in SVM?
A margin that allows some misclassification to improve generalisation
211
What trade-off does soft margin involve in SVM?
Bias vs variance
212
What happens if the data is not linearly separable in SVM?
It is mapped to a higher-dimensional space where it may become linearly separable
213
Give an example of a transformation that helps linear separation in SVM.
f(x) = x²
214
What is the shape of the decision boundary in the transformed space in SVM?
A hyperplane
215
What is the kernel trick?
A way to compute dot products in high-dimensional space without explicitly transforming the data
216
Why is the kernel trick useful in SVM?
It reduces computational cost and makes SVMs feasible in high-dimensional spaces
217
What does a kernel function compute in SVM?
The dot product of two transformed vectors
218
Give the formal definition of a kernel function in SVM.
k(x, z) = ⟨f(x), f(z)⟩, where f maps input to feature space
219
What kind of result does a kernel function return in SVM?
A real number
220
What does the kernel matrix contain in SVM?
Pairwise similarity scores between all points in the dataset
221
What does the C parameter control in SVM?
The trade-off between maximising the margin and minimising classification error
222
What does gamma control in an SVM?
The complexity of the decision boundary (low gamma = smoother boundary)
223
What happens with very high gamma in SVM?
Model overfits (too complex)
224
What are the key components of SVM?
Support vectors, margin, hyperplane, kernel trick
225
Why is SVM powerful for classification?
It builds robust boundaries and handles non-linearity using kernel methods
226
What is a decision tree?
A tree-like model that splits data into groups based on feature values
227
What kind of machine learning is a decision tree?
Supervised learning
228
What is the goal of a classification tree?
To predict categorical labels
229
What is the difference between a classification tree and a regression tree?
Classification predicts categories, regression predicts continuous values
230
What is a root node in a decision tree?
The first question or split
231
What is a leaf node?
A final classification outcome
232
How do you traverse a decision tree?
Follow left for True, right for False at each node
233
What is impurity in decision trees?
A measure of how mixed the class labels are in a group
234
What does Gini = 0 mean?
The node is pure (only one class)
234
What is Gini impurity?
1 - Σ(pᵢ²) where pᵢ is the proportion of class i
235
Why do we use Gini impurity?
To choose the best feature and threshold to split on
236
How do we compute total Gini for a split?
Weighted average of each branch's Gini
237
How do we choose the first split in a deicision tree?
Try all features, pick the one with lowest Gini impurity
238
What if a feature is numeric in a decision tree?
Try thresholds between values (e.g., Age < 15)
239
What’s a good sign that a feature is useful?
It creates more pure (less impure) splits
240
What is overfitting in decision trees?
When the tree fits the training data too closely and fails on new data
241
What is pre-pruning?
Stopping the tree from growing too deep or splitting on small groups
242
What is post-pruning?
Trimming a grown tree to simplify it and reduce overfitting
243
What is alpha (α) in post-pruning?
The cost-complexity parameter that controls tree size
244
How do we choose the best alpha value?
Plot accuracy vs. alpha and pick the best performing point
245
Can decision trees provide feature importance?
Yes, based on how much each feature reduces impurity
245
What algorithm is used to compute feature importance?
CART (Classification and Regression Trees)
246
What are nodes in a network?
Entities like people, authors, or websites
247
What are edges in a network?
Relationships or interactions between nodes
248
What is network topology?
The structure of how nodes and edges are connected
249
What is a social network?
A network where nodes are people and edges represent social relationships
250
Name 3 types of proxy networks.
Retweet network, co-authorship network, citation network
251
What is a proxy network?
A network built from indirect interaction signals like retweets or mentions
252
What kind of edge does a co-occurrence network have?
An edge if two entities appear together in a context
253
What is degree centrality?
Number of edges connected to a node
254
What does high degree centrality mean?
The node is directly connected to many others
255
What is the difference between in-degree and out-degree?
In-degree counts incoming edges; out-degree counts outgoing ones
256
What is eigenvector centrality?
Centrality that considers the importance of neighbouring nodes
257
What does it mean if a node has high eigenvector centrality?
It's connected to many important nodes
258
Is eigenvector centrality better for directed or undirected networks?
Undirected networks
259
What is PageRank?
A version of eigenvector centrality for directed graphs, used by Google
260
What does PageRank consider?
Importance and out-degree of linked nodes
261
What is closeness centrality?
Measures how close a node is to all other nodes via shortest paths
262
What does low closeness centrality indicate?
A node can quickly interact with all others
263
What are the limitations of closeness centrality?
Doesn’t work well in disconnected networks
264
What is betweenness centrality?
Number of shortest paths that pass through a node
264
What does high betweenness centrality mean?
The node acts as a bridge or broker between others
264
What does betweenness centrality help identify?
Nodes that control flow of information
265
What is unsupervised learning?
A type of machine learning where the model finds patterns in data without labels
266
How does unsupervised learning differ from supervised?
It doesn’t rely on labeled examples
267
What is the main task in unsupervised learning?
Clustering
268
What kind of learning is topic modelling?
Unsupervised learning
269
What is clustering?
Grouping similar data points into clusters
270
What are the steps of clustering?
Measure distance, compare points, group them
271
What does a "cluster" represent?
A group of similar data points
272
What does K in K-means represent?
The number of clusters
273
How does DBSCAN work?
It finds dense areas and expands clusters from them
274
What is hierarchical clustering?
A method that builds a tree-like hierarchy of clusters
274
What’s the difference between agglomerative and divisive clustering?
Agglomerative merges, divisive splits
275
What is hard clustering?
Each point belongs to one cluster
276
What is soft clustering?
Each point can belong to multiple clusters with probabilities
277
Which one is like logistic regression: hard or soft clustering?
Soft clustering
278
What is community detection?
Finding clusters in a network graph
279
What kind of data is used in community detection?
Network or graph data
280
Where is community detection useful?
In social networks and biology
281
What is Euclidean distance?
Straight-line (L2 norm) distance between points
282
What is Manhattan distance?
L1 norm – sum of absolute differences
283
What is Jaccard similarity?
The size of intersection divided by the size of union of two sets
284
What is KL divergence used for?
Comparing two probability distributions
285
What is Jensen-Shannon divergence?
A symmetric and smoother version of KL divergence
286
What two ingredients are needed for clustering?
Data representation and a distance metric
287
Why does clustering give different results on different data?
Different algorithms and representations behave differently
288
What is dimensionality reduction?
Reducing the number of features while preserving important information
289
Why do we reduce dimensionality?
To remove noise, reduce computation, and simplify analysis
290
Name two main approaches to dimensionality reduction.
Feature selection and feature extraction
291
What is feature selection?
Choosing a subset of the original features
292
What are filter methods?
Methods that select features based on simple metrics like variance
293
Give an example of a filter method.
Variance thresholding
294
What are wrapper methods?
Methods that evaluate features by training and comparing models
295
Give an example of a wrapper method.
Forward search or recursive feature elimination
296
What are embedded methods?
Feature selection built into the model (e.g. decision trees)
297
What is feature extraction?
Creating new features by combining the original ones
298
Which dimensionality reduction technique is a feature extraction method?
Principal Component Analysis (PCA)
299
What does PCA do?
Rotates data to align with directions of maximum variance
300
What are principal components?
New axes that are linear combinations of original features
301
Are principal components correlated?
No — they are orthogonal (uncorrelated)
302
What do eigenvalues represent in PCA?
How much variance each component captures
303
What is the first principal component (PC1)?
The direction with the highest variance
303
What do eigenvectors represent in PCA?
The directions (axes) of the new components
304
What is the first step in PCA?
Compute the covariance matrix
305
What is the second step in PCA?
Diagonalise the covariance matrix to get eigenvectors/eigenvalues
306
What is the third step in PCA?
Project the data onto the eigenvectors
307
What do you do with the last components?
Discard them if they have low variance
308
When does PCA work best?
When features are highly correlated
309
What does t-SNE stand for?
t-distributed Stochastic Neighbor Embedding
310
What is the goal of t-SNE?
Preserve local structure for visualisation
311
How does t-SNE work?
Matches high-D and low-D distance distributions
312
What type of clustering does t-SNE reveal?
Local clustering (groups of similar points)
313
What are t-SNE's weaknesses?
Distances between clusters are meaningless, not scalable
314
What does UMAP stand for?
Uniform Manifold Approximation and Projection
315
What does UMAP aim to preserve?
Both local and global structure
316
How does UMAP compare to t-SNE?
Faster, more scalable, works for more than 3D
317
What are common uses of UMAP?
Visualising high-dimensional datasets like images or text
318
Why can’t we trust cluster size in t-SNE/UMAP?
Because distance and size are distorted
319
Can t-SNE/UMAP axes be interpreted?
No — the axes have no meaningful scale
320
Should you use t-SNE/UMAP for modelling?
No — they’re meant for visualisation only
321
What is a key benefit of PCA over t-SNE/UMAP?
PCA creates interpretable components with variance explained
322
What does “with enough parameters, you can fit an elephant” mean?
Complex models can match data even if it doesn't make sense
323
What structure does hierarchical clustering produce?
A dendrogram (tree)
324
What is a dendrogram?
A tree diagram showing how clusters are merged or split
325
What are singleton clusters?
Clusters with only one item
326
What is agglomerative clustering?
A bottom-up method that starts with individual points and merges them
326
What is the first step in agglomerative clustering?
Compute pairwise distances between all points
327
Why don’t we try every possible dendrogram?
The number of possibilities grows exponentially
328
What is single linkage?
Distance between the closest members of two clusters
329
What is complete linkage?
Distance between the farthest members of two clusters
330
What is average linkage?
Average distance between all members of the two clusters
331
What is centroid linkage?
Distance between the centroids of two clusters
332
What is Ward’s method?
Merge clusters that minimise total variance
333
What does DBSCAN stand for?
Density-Based Spatial Clustering of Applications with Noise
334
What is the goal of DBSCAN?
Identify dense clusters and mark sparse areas as noise
335
What does eps (ε) mean in DBSCAN?
The radius used to find neighbouring points
336
What does MinPts mean in DBSCAN?
The minimum number of points required to form a dense region
337
What is a core point in DBSCAN?
A point with ≥ MinPts in its ε-neighbourhood
338
What is a border point in DBSCAN?
A point with < MinPts, but in the ε-neighbourhood of a core point
339
What is a noise point in DBSCAN?
A point that is neither a core nor a border point
340
How are points in a cluster connected in DBSCAN?
Through chains of ε-neighbourhoods from core points
341
What shape clusters can DBSCAN find?
Arbitrary shapes (not limited to circles)
342
Does DBSCAN need to know the number of clusters?
No
343
What’s a major limitation of DBSCAN?
It struggles with varying cluster densities
344
How sensitive is DBSCAN to its parameters?
Very - eps and MinPts must be tuned carefully
345
When is hierarchical clustering most useful?
When you want to explore different clusterings at different granularity
346
What kind of algorithm is K-means?
A partitional clustering algorithm
347
What input does K-means require?
The number of clusters, K
348
What is the objective of K-means?
Minimise the sum of squared errors (SSE)
349
What is a centroid in K-means?
The average of all points in a cluster
350
What’s the first step in K-means?
Randomly pick K initial centroids
351
What’s the next step after assigning points in k means?
Recalculate the centroids
352
When does K-means stop?
When centroids or assignments no longer change
353
Name two weaknesses of K-means.
Sensitive to outliers, assumes spherical clusters
354
What preprocessing helps K-means?
Normalising or standardising data
355
What does GMM stand for?
Gaussian Mixture Model
356
What does GMM assume about clusters?
Each cluster follows a Gaussian distribution
357
What does GMM use for assignment?
Soft assignment (probabilities)
358
What function does GMM maximise?
Log-likelihood
359
What shape does K-mean assume, and which does GMM assume?
Spehrical for K mean, Eliptical for GMM.
360
Which uses soft assignment, and which hard: K-means or GMM?
K means: Hard GMM: Soft
361
Why is cluster validation important?
Clustering has no ground truth labels
362
What are the three types of validation?
External, internal, and relative
363
What is internal validation based on?
Measure whether points that should be close to/far from each other really are close/far.
364
What is external validation based on?
Measure how the clustering labels compare to externally supplied class labels.
365
What is relative validation based on?
Compare one clustering to another, to see if they agree.
366
What does a silhouette score of 1 mean?
Perfect separation and compactness
366
What does a silhouette score of -1 mean?
Bad clustering — point likely misclassified
367
What is the input and output of a computer vision system?
Input: Image; Output: High-level information like object detection or 3D reconstruction.
368
How is a color image represented in a matrix format?
As three matrices for Red, Green, and Blue channels.
369
Name three tasks computer vision systems can perform.
Object detection, image segmentation, 3D reconstruction.
370
What biological system inspired deep neural networks?
The visual cortex of the brain.
371
What is a perceptron?
A neuron-like model that applies weights to inputs, sums them, and applies an activation function.
372
Why do standard MLPs not scale well to large images?
Because each node needs too many weights, e.g., 10,000 for a 100×100 image.
373
What is a convolution in CNNs?
It’s the process of sliding a filter over the image and computing dot products to detect patterns.
374
What is a filter (or kernel) in CNN?
A small matrix used in CNNs to detect features like edges or textures.
375
What is a feature map?
The output matrix showing where a filter detected its pattern in the input image.
376
Why are CNNs better than MLPs for vision?
Fewer parameters, local pattern detection, and translation invariance.
377
Why is deep learning considered data-hungry?
It needs large amounts of data to generalize well.
378
What makes deep learning expensive?
High computational requirements, often needing GPUs and cloud computing.
379
Why is deep learning hard to interpret?
Neural networks function as "black boxes" with complex internal representations.
380
Can deep learning models represent uncertainty well?
No, they are easily overconfident and can be tricked.
381
What are some challenges in training deep learning models?
Choosing the right architecture, learning rate, and avoiding overfitting.
382
What is the purpose of ReLU in a CNN?
To add non-linearity by setting all negative values to zero.
383
What does ReLU(x) return?
max(0, x)
384
Why do we use pooling in CNNs?
To reduce the spatial dimensions and computation cost while keeping key features.
385
What is max pooling?
Taking the maximum value in a small patch of the feature map.
386
What is average pooling?
Taking the average of the values in a small patch.
387
What is padding in a CNN?
Adding extra borders (usually zeros) to retain image size after convolution.
388
What is stride in convolutional layers?
The step size of the filter as it moves across the image.
389
How does increasing stride affect feature maps?
It reduces their size faster, but may lose detail.
390
What’s a drawback of the tanh activation function?
Its derivatives can vanish, making learning slow.
391
What type of function is used at the output layer in classification tasks?
Softmax.
392
What does the softmax function output?
Probabilities for each class.
393
What’s a drawback of the relu activation function?
it discards all negative values
394
What happens if the learning rate is too high?
The model may overshoot and not converge.
395
What happens if the learning rate is too low?
Learning is very slow or may get stuck.
396
What is momentum in optimisation?
A technique that helps accelerate gradient descent by smoothing updates.
397
What is a hyperparameter?
A setting defined before training that controls model structure or learning.
398
Give examples of CNN hyperparameters.
Filter size, stride, padding, learning rate, activation function, epochs, batch size.
399
What is the role of batch size in training?
It determines how many examples are processed at once in each training step.