Learning from data Flashcards

Question

What is the primary objective for an interpretation model?

Answer 1

The primary objective is to train a model to find insights from the data.

Answer 2

An interpretation model is a method or tool used to understand how a machine learning model makes predictions — often by analyzing model coefficients or feature importance to reveal which input features most influence the output.

Answer 3

Polynomials can help to predict better - they can sometimes better fit the curvature of the actual data. Polynomials can also help to find variables that explain variation in data better

Answer 4

Higher order features - features that are created by transforming the original features (or predictors) in a non-linear way. Higher-order features help capture more complex, non-linear relationships between the independent variables and the dependent variable that would otherwise be missed by a simple linear model.

Answer 5

he Bayesian Information Criterion (BIC) is a statistical measure used for model selection. It helps compare different models and identify the one that best balances goodness of fit with model complexity. Specifically, BIC penalizes models with more parameters to avoid overfitting, favoring simpler models when they adequately explain the data.

Answer 6

n.ln(SSE)-n.ln(N)+ln(n).P SSE = sum squared error n = number of observations p is the number of parameters. ln is natural logarithm

Answer 7

Interaction terms in linear regression modeling are terms added to the model to capture the combined effect of two (or more) predictor variables on the dependent variable. Interaction terms account for the situation where the effect of one predictor on the outcome depends on the level or value of another predictor.

Answer 8

If there are two predictor variables, say x1 and x2, then interaction term would be represented by multiplying the two variables together. The resuting equation could look like this y = β0 + β1X1 + β2X2 + β3(X1 x X2) + ϵ

Answer 9

In machine learning, a training set is used to train or fit the model, allowing it to learn patterns and relationships from the data. A training set can also be used to learn the optimal parameters for a model The testing set (or test set) is kept separate and used to evaluate the model's performance on unseen data. This separation is crucial to prevent overfitting, where the model performs well on the training data but poorly on new, unseen data. By using both a training and testing set, you ensure the model generalizes well and can make accurate predictions in real-world scenarios.

Answer 10

-We can compare with actual values -Predict label with model -Measure error

Answer 11

* Instead of using a single training and test set, we use cross validation to calculate the error across multiple training and test sets. * With cross validation, we split the data into multiple pairs of training and test sets. * Average the error across each one of the test set errors. * Performance measure will be more statistically significant.

Answer 12

A single parameter called k is the number of groups a data sample is to be split * Process: 1. Shuffle dataset. 2. Split dataset into k groups. 3. For each unique group: a) Use the group as a test set. b) Use the remaining group as a training set. c) Fit model on the training set, then evaluate model on the test set. d) Store the evaluation score, then discard the model. 4. Summarise model performance using the model evaluation scores

Answer 13

Stratified sampling is a sampling technique where the samples are selected in the same proportion as they appear in the population.

Answer 14

Implementing stratified sampling in cross validation ensures that the training and test sets have the same proportion of the feature of interest as in the original dataset. * By doing this with the target feature, we ensure that the cross validation result is a close approximation of the generalisation error

Answer 15

Underfitting, which is where a model is too simple to capture the underlying patterns in the data

Answer 16

Overfitting, which is where the model learns the training data instead of relationships in the data

Answer 17

Good approach: as soon as the cross validation error start to increase, we stop making the model more complex.

Answer 18

The model is likely to just learn the training data and and will start overfitting

Answer 19

Bias Variance Irreducible error

Answer 20

Bias is the tendancy to miss true values with predicting, a model with high bias will not be very accurate and will predict wrong a lot of the time. We want as low bias as possible

Answer 21

- The model misrepresenting the data given missing information. -An overly simple model, i.e., bias to the simplicity of the model. -The model missing the real patterns in the data High bias is often linked to underfitting the training model

Answer 22

Variance is the tendency of predictions to fluctuate and is characterised by high sensitivity of output to small changes in input data

Answer 23

Variance is often caused by overly complex or poorly fit models e.g very high degree polynomial models. It's associated with overfitting

Answer 24

Irreducible error in a model refers to the portion of the total error that cannot be reduced or eliminated, no matter how well the model is designed or how much data is used. This error arises from inherent variability or noise in the data that the model cannot capture or explain.

Answer 25

Measurement errors in data collection. Randomness in the underlying process being modeled. Unpredictable factors that affect the outcome but are not included in the model. It is impossible to perfectly model the majority of real world data. Thus, we have to be comfortable with some measure of error.

Answer 26

Making model adjustments aimed at reducing bias ca often end up increasing variance and vice versa. Analogous to the complexity tradeoff where we want to choose the right level of complexity to find the best model The model should be complex enough not to underfit but not so complex that it overfits "We search for a model that describes the feature target relationship but not so complex that it fits to spurious patterns"

Answer 27

The higher the degree of a polynomial regression, the more complex that model is (lower bias, higher variance). * At lower degrees: the predictions are too rigid to capture the curved pattern in the data (bias). * At higher degrees: the predictions fluctuate wildly because of the model’s sensitivity (variance). As the degree of the polynomial increases: Bias decreases, because the model becomes more flexible and better captures the training data. Variance increases, because the model becomes more sensitive to noise and variations in the data. The optimal model has sufficient complexity to describe the data without overfitting.

Answer 28

A cost function in machine learning is a mathematical function that measures the error or difference between the predicted output of a model and the actual target values. It serves as a quantitative measure of how well or poorly a machine learning model performs. The goal of training a model is to minimize this cost function.

Answer 29

To regularize linear models we can add a regularization strength parameter directly into the cost function This parameter(lambda) adds a penalty proportionate to size(magnitude or numerical value) of the estimated model parameter * When λ is large, stronger parameters are penalised. Thus, a more complex model will be penalised.

Answer 30

The regularisation strength parameter λ allows us to manage the complexity tradeoff. * More regularisation introduces a simpler model or more bias. * Less regularisation makes the model more complex and increases variance. If the model overfits (variance is too high), regularisation can improve generalisation error and reduce variance.

Answer 31

In ridge regression, the penalty λ is applied proportionally to squared coefficient values. * This penalty imposes bias on the model and reduces variance

Answer 32

We can use cross validation - it's best practice to scale features

Answer 33

* In LASSO (Least Absolute Shrinkage and Selection Operator), the penalty λ is applied proportionally to absolute coefficient values

Answer 34

LASSO and Ridge regression are also known as L1 regularisation and L2 regularisation, respectively The names L1 and L2 regularisation come from the L1 and L2 norm of a vector w, respectively

Answer 35

The L1-norm is calculated as the sum of the absolute vector values, where the absolute value of a scalar uses the notation |wN|.

Answer 36

The L2-norm is calculated as the square root of the sum of squared vector values

Answer 37

L1 (Lasso) 🔹 Adds absolute values of weights 🔹 Promotes sparsity — sets some weights to zero 🔹 Good for feature selection L2 (Ridge) 🔹 Adds squares of weights 🔹 Shrinks all weights gently 🔹 Good for stability and generalization 👉 L1 = “drop irrelevant features” 👉 L2 = “keep all, just smaller”

Answer 38

Feature selection is figuring out which one of our features are important to include in the model Reducing the number of features can prevent overfitting. * For some models, fewer features can improve fitting time and/or results. * Identifying the most important features can improve model interpretability. * Feature selection can also be performed by removing features. * Remove feature one at a time, measure the predictive results using cross validation, if the feature elimination improves the cross validation results, or doesn’t increase the error much, that feature can be removed.

Answer 39

Regularisation performs feature selection by shrinking the contribution of features. * For L1-regularisation, this is accomplished by driving some coefficients to zero. 220 6. Linear Model Sel

Answer 40

Data integration or data fusion is the assumption that all data can be retrieved from the same source, enabling efficient analysis without worrying about disparate sources.

Answer 41

Data cleaning is the process of detecting and correcting corrupt, inaccurate, or incomplete records in a dataset. It ensures the dataset is reliable, accurate, and usable for analysis by removing or rectifying errors, outliers, and inconsistencies.

Answer 42

Missing values are data points expected in the dataset but absent. In pandas, they are often represented as NaN (Not a Number). Other representations include None, 9999, or N/A. Missing values can arise from human error, skipped survey questions, or database management issues.

Answer 43

Missing Completely At Random (MCAR): Missing values occur randomly, unrelated to observed or unobserved data (e.g., random sensor failure). - The probability of missing values is equal for all units Missing At Random (MAR): probability of a missing value depends on observed data but not the missing data itself (e.g., sensor failure during high wind speeds). Missing Not At Random (MNAR): happens when we know exactly which data object will have missing values. In this case, the probability of missing values is related to the actual missing data.(e.g., tampered sensors near a polluting power plant).

Answer 44

Keep as-is: Used when the tools or goals can handle missing values directly, such as the KNN algorithm. This approach ensures the dataset remains intact and avoids introducing biases during preprocessing. Remove rows with missing values: Suitable for MCAR situations, where missing values are completely random. Avoid in MAR and MNAR cases, as removing rows risks introducing bias by excluding specific subsets of data. Remove columns with missing values: Effective when missing rates exceed 25%, especially for non-critical attributes. Avoid for critical attributes, as their removal can compromise the analysis. Impute missing values: Replace missing values with central tendencies (mean, median, or mode), subgroup averages (for MAR), or regression models (for MNAR). Be cautious of introducing bias during imputation.

Answer 45

We should aim to preserve data and information and minimize bias introduction. Additional considerations include: -What are our analytical goals?(clustering classification etc) -what analytical tools are we using -what is the cause of our missing values? -What type of missing values are we dealing with?

Answer 46

An outlier is a data point that significantly differs from others. Causes include: Errors: Data entry or measurement mistakes. Legitimacy: True but extreme values that may skew results. Fraud: Deliberate manipulation requiring scrutiny. Random errors: Unavoidable fluctuations or inconsistencies in measurements due to chance.

Answer 47

Outliers can be detected using the interquartile range (IQR): Define outliers as values outside the range: Q1 - 1.5 * (Q3 - Q1) <-IQR OR Q3 + 1.5 * IQR. Management strategies: Do nothing (if analysis is robust). Replace with caps (upper or lower bound). -ideal when analysis is sensitive to outliers and retaining all data objects is vital Apply log transformation - for particularly skewed data. Remove outliers * Worst option due to potential loss of information. It should be done when other methods are inapplicable and when data is correct, but outlier values are excessively distinct

Answer 48

Systematic errors are consistent, repeatable errors linked to specific sources (e.g., faulty instruments). They are difficult to detect as they may go unnoticed in data but can cause bias in analyses. Outlier detection can sometimes help identify these errors.

Answer 49

Data Transformation is the last stage of data preprocessing before using analytic tools. It ensures our dataset meets key characteristics and is ready for analysis! Data transformation adjusts data ranges to: Meet model assumptions. Improve algorithm stability. Ensure fair contribution from all features. Common techniques include standardization, normalization, log transformation, and categorical-to-numerical conversions.

Answer 50

Standardization: Rescales data to have a mean of 0 and a standard deviation of 1. Normalization: Rescales data to a specific range, typically [0, 1].

Answer 51

Log transformation is used to address skewness and extreme values. It is particularly useful for: Features varying over orders of magnitude. Cases where ratios are more critical than absolute differences. Formula X^1 = log(x) ^ i think this is dx

Answer 52

A discrepancy or deviation in measured data from the actual or true values, which can arise due to various factors during data collection or measurement.

Answer 53

The process of converting continuous data (numerical values) into discrete intervals or categories. This transformation helps simplify the data, making it easier to analyze, interpret, or use in certain machine learning algorithms that perform better with categorical data.

Answer 54

Binary coding: Categorical data is represented by unique numbers for each category. These numbers are then converted into it's binary form with each digit of this binary number having it's own data column(feature?) -Works well with high-cardinality categorical variables and reduces dimensionality -will not preserve ordinal relationships Ranking transformation (ordinal encoding)- Data is transformed based on the order or rank of the values, useful where hierarchies are significant -preserves the natural ordering of data -not suitable for nominal data and may lost info about relative distances Attribute conversion - Transforms existing attributes (features) into new attributes that are better suited for machine learning models. This is a broader technique that encompasses several types of transformations, including scaling, normalizing, and converting attributes into different representations. -Can improve model performance by making data more interpretable -requires domain knowledge and risks introducing bias

Answer 55

Smoothing reduces noise and fluctuations in data to reveal underlying trends. Common methods include: Moving average: averaging data points in successive subsets to create smoothed values.

Answer 56

Bar charts display categorical data distributions. Tips for effective use: Avoid too many bars. Use horizontal layouts for readability when necessary.

Answer 57

Line plots: Best for showing trends over time. Avoid markers; use legends on curves. Scatter plots: Visualize relationships between variables, often with multiple variables for richer insights.

Answer 58

Pie charts display proportions but are often misleading due to poor readability and comparison challenges. Alternative visualizations like bar charts are recommended.

Answer 59

Classification is a supervised learning task where an algorithm is trained on labeled data to predict the class of new data points. It assigns inputs (x) to predefined categories (y), such as predicting whether an email is spam or not. The classifier will choose the class that has the highest percent chance of being accurate

Answer 60

The dataset is split into two parts: the training set (usually 80%) is used to train the model, and the test set (usually 20%) is used to evaluate its performance. This ensures that the model is tested on unseen data to check its generalization ability.

Answer 61

Applications include image classification (e.g., recognizing objects in images), spam detection in emails, medical diagnosis (e.g., disease prediction), and customer segmentation in marketing.

Answer 62

Logistic regression is a statistical model used for binary classification. It predicts the probability that a given input x belongs to class 1 (y=1). It uses the logistic function to ensure the output is between 0 and 1. It's up to the developer/researcher do decide what threshholds mean what but default is 50%

Answer 63

Logistic regression handles continuous input data (x) and binary output data (y).

Answer 64

The output of logistic regression is the probability of the input data belonging to class 1. A threshold (e.g., 0.5) is then applied to decide the final class label.

Answer 65

Logistic regression uses the log-loss (cross-entropy) cost function, which measures the difference between the predicted probabilities and the actual class labels.

Answer 66

Import the necessary modules: from sklearn.linear_model import LogisticRegression. Split the data using train_test_split. Create the classifier: log_reg = LogisticRegression(). Fit the model: log_reg.fit(X_train, y_train). Predict and evaluate: log_reg.predict() and log_reg.score().

Answer 67

Logistic regression assumes a linear relationship between the input features and the log-odds of the target variable. It can struggle with non-linear data or datasets with overlapping classes.

Answer 68

True positive + True negative/sample size

Answer 69

The perceptron is a linear classifier that updates its weights based on classification errors. It iteratively adjusts weights to find a hyperplane(e.g. a line separating the data points if they were graphed) that separates the classes.

Answer 70

Initialize weights (w) to zero or small random values. For each input (x), calculate the predicted output (y_pred) using a step function. Update weights if there is a misclassification. Repeat until convergence or for a fixed number of iterations.

Answer 71

The perceptron can only solve problems where the data is linearly separable. It fails for non-linear datasets.

Answer 72

MLP is a type of neural network that can model non-linear relationships. It consists of multiple layers of neurons: input layer, one or more hidden layers, and an output layer. It uses activation functions to introduce non-linearity.

Answer 73

How does the MLP overcome the limitations of perceptrons? Back: MLP uses hidden layers and non-linear activation functions (e.g., ReLU, sigmoid) to model complex, non-linear relationships in data. This allows it to solve problems like XOR.

Answer 74

An activation function introduces non-linearity into the model, enabling it to learn complex patterns. Common activation functions include sigmoid, tanh, and ReLU.

Answer 75

Backpropagation is an algorithm used to train neural networks. It calculates the gradient of the loss function with respect to each weight using the chain rule and updates the weights to minimize the error.

Answer 76

MLPs are highly versatile and can model complex patterns. They are suitable for a variety of tasks, including classification, regression, and even unsupervised learning when paired with autoencoders.

Answer 77

Common techniques include stochastic gradient descent (SGD), Adam optimizer, and learning rate schedules. These methods improve convergence and training efficiency.

Answer 78

Choose the number of layers and neurons carefully to avoid overfitting or underfitting. Use regularization techniques like dropout or L2 regularization. Scale input data for faster convergence.

Answer 79

The perceptron is one of the first machine learning algorithms and laid the foundation for modern neural networks. It demonstrated the potential of learning machines but also highlighted the need for multi-layer architectures to handle non-linear problems.

Answer 80

Adjust the cost function by assigning higher weights to minority classes and lower weights to majority classes. This penalizes the model more heavily for misclassifying the minority classes.

Answer 81

Gradient descent is an iterative optimization algorithm used to minimize a loss function (error) by adjusting model parameters. It calculates the gradient of the loss function with respect to the parameters and moves in the opposite direction of the gradient to reduce the error

Answer 82

Start at a random point in parameter space. Calculate the loss at the current point. Take a step in the opposite direction of the steepest gradient. Recalculate the loss at the new point. Repeat until the loss is below a threshold or a maximum number of steps is reached.

Answer 83

Deterministic: The algorithm moves in a specific direction calculated by the gradient. It has no randomness. Stochastic: The algorithm includes randomness by picking a nearby random point and checking the error iteratively.

Answer 84

The L1 norm (Manhattan distance) is the sum of the absolute differences between predicted and actual values. Minimizing the L1 norm promotes sparsity in model parameters and is often used in optimization methods like Lasso regression to encourage simpler models by driving some coefficients to zero.

Answer 85

The L2 norm, also known as Euclidean distance, measures the overall error between predicted values and actual values. In machine learning, we often minimize the squared L2 norm—also called the sum of squared errors (SSE)—to reduce the difference between predictions and targets. This minimization is a common objective in gradient descent, where the model's parameters are updated to reduce the total error over time. The L2 norm of the error vector is the square root of the sum of squared errors.

Answer 86

The gradient represents the direction and rate of the steepest increase of the loss function. Gradient descent takes steps in the opposite direction of the gradient to minimize the loss function.

Answer 87

In 1D, the optimal point occurs where the derivative of the loss function with respect to the parameter equals zero:

Answer 88

In higher dimensions, the gradient consists of partial derivatives with respect to each parameter. Gradient descent finds the direction of steepest descent by combining all partial derivatives into a gradient vector. (Idk if we need to know this)

Answer 89

A confusion matrix is a table used to evaluate the performance of a classification model. It displays the counts of true positives, true negatives, false positives, and false negatives.

Answer 90

Precision (Positive Predictive Value) is the proportion of correctly predicted positive observations out of all observations predicted as positive: Formula: correct Positives / Correct Positives + Incorrect positives

Answer 91

Recall (Sensitivity) is the proportion of actual positive observations correctly classified as positive: formula: Recall= True positives / True Positives +False Negatives

Answer 92

A high recall indicates fewer false negatives but may include more false positives (lower precision). Conversely, high precision minimizes false positives but may miss true positives, lowering recall.

Answer 93

The F1 score is the harmonic mean of precision and recall. It is useful for balancing the trade-off between precision and recall

Answer 94

The ROC curve (Receiver Operating Characteristic curve) shows the trade-off between the true positive rate (sensitivity) and the false positive rate (1 - specificity) as the decision threshold varies.

Answer 95

Sensitivity (Recall): Proportion of true positives correctly identified. Specificity: Proportion of true negatives correctly identified.

Answer 96

The AUC measures the overall performance of a classifier by quantifying the area under the ROC curve. A higher AUC indicates better classifier performance.

Answer 97

Grid search involves exhaustively trying all combinations of predefined parameter values to identify the combination that minimizes error or maximizes accuracy.

Answer 98

Imbalanced datasets bias the model towards the majority class. Solutions include: Sampling techniques (oversampling minority class, undersampling majority class). Using metrics like F1 score or ROC-AUC that account for class imbalance. D

Answer 99

Setting a low threshold increases sensitivity (high true positive rate) but reduces specificity, leading to more false positives.

Answer 100

Setting a high threshold increases specificity (low false positive rate) but reduces sensitivity, leading to more false negatives.

Answer 101

NLP is a field of computer science and artificial intelligence focused on enabling computers to understand, interpret, and generate human language. It often involves tasks such as text similarity, sentiment analysis, topic extraction, summarization, question answering, relationship extraction, and language generation.

Answer 102

Spam detection is a classification problem where the input (X) is email text and the output (y) is whether the email is spam or not. A typical approach involves train-test splitting and identifying words indicative of spam. For example, phrases like "send us your password" are strong spam indicators.

Answer 103

Bayesian spam detection calculates the probability of a phrase being spam based on the likelihood of each word occurring in spam emails. For example, P(“send us your password”|spam) = P(“send”|spam) × P(“us”|spam) × P(“your”|spam) × P(“password”|spam). Each word is treated independently.

Answer 104

Tokenization: Breaks a sentence into individual words or tokens. Stemming: Reduces words to their root forms (e.g., "running" becomes "run"). Lemmatization: Reduces words to their base or dictionary forms (e.g., "better" becomes "good"). These methods group variations of words together to standardize text for analysis.

Answer 105

Different languages have unique structures, making tokenization challenging.

Answer 106

Bag-of-words represents text as a frequency distribution of words without considering grammar or order. It is used in tasks like document classification and text clustering. Each document is represented as a vector of word frequencies or binary indicators.

Answer 107

A sparse matrix represents text where rows are documents (e.g., songs), columns are words, and values are word counts or binary indicators (1 if the word exists, 0 otherwise). This approach is memory-efficient as most entries are zero.

Answer 108

PCA reduces high-dimensional vector representations (e.g., word embeddings or bag-of-words) into lower dimensions for clustering and classification. It identifies key features while preserving variance in the data.

Answer 109

TF-IDF identifies keywords by balancing term frequency in a document and inverse frequency across documents. Formula: TF-IDF = TF(term) × IDF(term). Higher scores indicate words more unique to a document.

Answer 110

Word frequencies in metal lyrics are compared to other genres by plotting a scatterplot of frequency ratios. Points deviating from the diagonal highlight words unique to a genre.

Answer 111

TF-IDF can calculate scores for word pairs (bigrams) or triples (trigrams) to identify multi-word expressions indicative of specific topics or genres. This captures context missed by single-word analysis.

Answer 112

Vector representations enable tasks like clustering similar documents, classifying text, and semantic analysis. For example, vectors can identify similar songs or distinguish topics in a corpus.

Answer 113

Synonymy: Different words with similar meanings (e.g., "big" vs. "large"). Polysemy: Words with multiple meanings (e.g., "bank"). Context dependency: Meaning influenced by surrounding words.

Answer 114

Sparse matrices are high-dimensional with many zeros, making clustering and computational operations inefficient. Dimensionality reduction techniques, such as topic modeling, help address these challenges.

Answer 115

Topic modeling is a statistical method to identify topics within a collection of documents. It reduces dimensionality and summarizes large datasets, providing insights like "this document is 80% sports and 20% education." As is a clustering method there is no need to know in advance what the topics are

Answer 116

Latent Semantic Indexing (LSI): Uses singular value decomposition to identify patterns in relationships between terms and documents. Latent Dirichlet Allocation (LDA): A probabilistic model that assumes documents are mixtures of topics and topics are distributions over words.

Answer 117

Does not generalize well to unseen documents. Sensitive to parameter choices and prone to overfitting. Performs poorly on short texts.

Answer 118

Word embedding represents words as dense vectors in a continuous vector space. They capture semantic and syntactic relationships, enabling tasks like clustering, similarity measurement, and downstream machine learning tasks. In other words, semantic relationships between words become relationships in vectors in the vector space Words that are similar become vectors that are close together

Answer 119

Word embeddings are created by training neural networks to predict the context of words: Use one-hot encoded vectors as inputs. Train a two-layer neural network with fewer hidden neurons than vocabulary size. Minimize the error between predicted and actual next words. The resulting vectors in the hidden layer are the embeddings.

Answer 120

CBOW: Predicts a word based on its surrounding context. Skip-gram: Predicts the context given a single word. Both generate word embeddings, but Skip-gram is better for smaller datasets.

Answer 121

Similar words have similar vectors (e.g., "ship" and "boat"). Capture semantic relationships (e.g., "king - man + woman = queen"). Support multilingual embeddings, aligning words with similar meanings across languages.

Answer 122

Polysemy refers to words with multiple meanings (e.g., "queen" as a monarch or a band). Standard embeddings mix these meanings, but sense embeddings can separate them by training with labeled sense-specific data.

Answer 123

Sense embeddings represent different meanings of a word separately. For example, "queen" could have one vector for the monarchy and another for the band. These embeddings require labeled training data.

Answer 124

Sentence embeddings represent entire sentences as vectors. They capture the overall meaning of sentences, making them useful for tasks like clustering, paraphrase detection, and semantic similarity measurements.

Answer 125

Measuring semantic similarity. Clustering similar texts. Paraphrase mining. Automatic translation. Text summarization.

Answer 126

By aligning embeddings of text and images into the same vector space, models can perform tasks like caption generation and image-to-text retrieval.

Answer 127

Reducing high-dimensional and sparse document-term matrices into meaningful and smaller representations that highlight topics.

Answer 128

Sparse vectors contain many zero values, making it challenging to calculate meaningful distances or similarities.

Answer 129

It simplifies sparse matrices, reducing computational complexity and improving clustering and visualization of text data.

Answer 130

Semantic relationships are meanings reflected as vector operations. Examples: Gender: "king - man + woman = queen" Verb tense: "run - ran" Country-Capital: "France - Paris"

Answer 131

Neural networks: Input a word (one-hot vector). Pass through a hidden layer. Predict the next word (output vector). Update weights to minimize prediction error.

Answer 132

Multilingual embeddings align words with similar meanings across languages, enabling translation and cross-lingual tasks.

Answer 133

Short texts lack enough word diversity, making it difficult for models to reliably identify underlying topics.

Answer 134

Principal Component Analysis (PCA). Singular Value Decomposition (SVD). Autoencoders. These methods also reduce high-dimensional data into lower-dimensional representations.

Answer 135

KNN is a supervised learning algorithm primarily used for classification problems. It assumes that similar data points will exist close to each other based on a distance metric (e.g., Euclidean distance).

Answer 136

KNN determines the class of a new data point by identifying the k nearest neighbors and assigning the class based on the majority class among these neighbors. The parameter k is a hyperparameter of the algorithm.

Answer 137

The hyperparameter k defines the number of nearest neighbors considered for classifying a new data point. A smaller k may result in overfitting, while a larger k may oversimplify the classification.

Answer 138

Compute the distance of the new point to all other points in the dataset. Select the k nearest neighbors. Determine the class of the new point by majority voting among the selected neighbors.

Answer 139

KNN is non-parametric because it makes no assumptions about the underlying data distribution and instead bases predictions directly on the data.

Answer 140

Outliers can negatively influence KNN by skewing the decision boundary, especially if the outlier’s class differs from that of the majority of nearby points. This can lead to incorrect classifications.

Answer 141

Class imbalance can bias KNN towards the majority class, leading to poor performance for the minority class. This can be addressed by weighting the data points based on the inverse of their distance to the query point.

Answer 142

The optimal k can be selected by: Iteratively evaluating performance metrics (e.g., precision, recall) on a test set while varying k. Using the square root of the number of training samples as a heuristic for k. For binary classification, choosing an odd value of k to avoid ties. Maybe further research this to get a stronger understanding

Answer 143

Weighted KNN assigns greater importance to closer neighbors by weighting their contribution inversely to their distance. It is useful when closer points are expected to be more relevant for classification.

Answer 144

KNN is an instance-based learning algorithm that constructs hypotheses directly from training instances. Its complexity grows with the dataset size, making it less efficient for large datasets. It is best suited for low-dimensional data and requires data normalization for consistent performance.

Answer 145

Normalization ensures all features contribute equally to the distance metric, preventing features with larger ranges from dominating the results.

Answer 146

Categorical data can be converted into numerical format using techniques like integer encoding. For example, “Overcast” could be encoded as 0, “Rainy” as 1, and “Sunny” as 2.

Answer 147

Features can be combined into a single dataset using tools like the zip function, which creates tuples of feature values for each observation.

Answer 148

KNN is a simple yet powerful supervised learning algorithm. It is affected by class imbalance, outliers, and feature scaling. Weighted KNN and optimal k selection can improve performance. KNN is computationally expensive for large datasets but works well for small, low-dimensional datasets.

Answer 149

A Support Vector Machine (SVM) is a supervised machine learning model used for classification, regression, and clustering problems. Its goal is to find a hyperplane that separates data points into distinct classes by maximizing the margin between the closest data points of different classes.

Answer 150

Advantages of SVMs include: Effective in high-dimensional spaces. Versatile, as different kernel functions can be specified for decision functions. Well-suited for both linearly separable and non-linearly separable data using the kernel trick. Robust to overfitting when the dimensionality of data is higher than the number of samples.

Answer 151

Support vectors are the data points closest to the hyperplane that influence its position and orientation. These points are critical in defining the decision boundary and maximizing the margin.

Answer 152

The margin is the distance between the hyperplane and the nearest data points from each class. SVM aims to maximize this margin to improve the generalization of the classifier.

Answer 153

A maximum margin classifier selects a decision boundary that maximizes the margin between classes. However, it can be sensitive to outliers, potentially resulting in misclassifications.

Answer 154

SVM handles outliers by introducing a soft margin, which allows for some misclassifications. This approach improves the model’s robustness and generalization by balancing bias and variance.

Answer 155

A soft margin allows for misclassification of some data points to achieve a better tradeoff between maximizing the margin and minimizing classification error, especially in the presence of outliers.

Answer 156

Nonlinear transformation maps data from its original feature space to a higher-dimensional space, making it linearly separable in the transformed space. This transformation allows SVM to handle complex, non-linear relationships.

Answer 157

The kernel trick allows SVM to compute the relationships in a higher-dimensional feature space without explicitly transforming the data. It uses a kernel function to calculate the dot product in this space, reducing computational complexity. ^understand this more if time

Answer 158

A kernel function is a mathematical function that takes two input vectors from the original space and returns their dot product in the transformed feature space. Common kernel functions include: Linear kernel Polynomial kernel Radial Basis Function (RBF) kernel Sigmoid kernel

Answer 159

The second-degree polynomial kernel computes the dot product in the transformed space by evaluating functions of the original components. For example, for inputs A1 and A2, it may use terms like (A1^2, A2^2, and A1×A2).

Answer 160

Regularization parameter (C): Controls the tradeoff between achieving a low training error (low bias) and maintaining simplicity (low variance). Gamma parameter: Defines the influence of a single training example, controlling model complexity and generalization.

Answer 161

SVMs are powerful for linear and nonlinear classification tasks. Support vectors, hyperplanes, and margins are fundamental concepts. Nonlinear transformations and kernel tricks enable handling complex data structures. Practical applications like the IRIS dataset and face recognition illustrate SVM's versatility.

Answer 162

A decision tree is a decision support tool that uses a tree-like model of decisions and their possible consequences, including chance outcomes, resource costs and utility statements. A decision tree is one common way to display an algorithm that only contains conditional statements

Answer 163

The goal is to create a model that pre dicts the value of a target variable based on several input variables. This type of decision tree is called a classification tree.

Answer 164

Classification trees predict categorical outcomes (e.g., "yes" or "no"), while regression trees predict continuous numeric values. This lecture focuses exclusively on classification trees.

Answer 165

To interpret a classification tree, start at the root node and follow the branches based on whether conditions are true or false until reaching a leaf node. The path determines the classification. Typically, a "true" condition moves to the left branch, and a "false" condition moves to the right branch. Leaf nodes represent the final classification or prediction.

Answer 166

The key components are: Root Node: The starting point of the tree. Internal Nodes: Represent decisions based on feature values. Branches: Show the outcome of decisions. Leaf Nodes: Provide the final classification or decision.

Answer 167

Gini impurity measures the likelihood of incorrectly classifying a randomly chosen element if its classification is based on the distribution of classes in a dataset. A Gini impurity of 0 indicates pure leaves, where all samples belong to a single class.

Answer 168

Gini impurity quantifies the purity of a node and helps decide which feature to split on at each step of the tree construction. The feature with the lowest Gini impurity after splitting is chosen to maximize classification accuracy.

Answer 169

To calculate the total Gini impurity for a split: Calculate the Gini impurity for each leaf node. Weight each leaf's impurity by the proportion of samples in that leaf relative to the total samples. Sum the weighted impurities.

Answer 170

Pre-pruning, or early stopping, limits the growth of a decision tree to prevent overfitting by: Setting a maximum tree depth. Requiring a minimum number of samples for splitting a node. Ensuring each leaf has a minimum number of samples. This controls the model complexity and improves generalization.

Answer 171

Post-pruning simplifies an already built decision tree by removing nodes and subtrees that do not improve predictive performance. This is done by comparing error terms before and after removing nodes and retaining the simpler tree if it performs similarly or better.

Answer 172

Feature importance is calculated based on the reduction in impurity achieved by a feature across all splits where it is used. Features contributing the most to reducing impurity are considered the most important. This can be visualized through importance scores.

Answer 173

Overfitting: Pruning (pre- or post-pruning) or limiting tree depth. Bias towards features with more levels: Use impurity measures like Information Gain Ratio. Instability: Use ensemble methods like Random Forests or boosting.

Answer 174

Numeric features are split by: Sorting feature values in ascending order. Calculating potential split points as the average of adjacent values. Evaluating the impurity at each split point and choosing the one with the lowest impurity.

Answer 175

Set parameters such as: max_depth for maximum tree depth. min_samples_split for the minimum samples required to split. min_samples_leaf for the minimum samples per leaf. Train the tree with these constraints. Evaluate performance to ensure the tree generalizes well.

Answer 176

Post-pruning removes redundant nodes or subtrees after the tree is built, simplifying the model and reducing overfitting. It compares error metrics before and after pruning and keeps the simpler structure if predictive accuracy is maintained or improved.

Answer 177

Cost complexity pruning is a method used in decision trees to reduce overfitting by cutting back the tree after it’s been fully grown. A fully grown decision tree tries to perfectly fit the training data, which means it might: Learn noise and outliers Be too complex to generalize well to new, unseen data Pruning helps by trimming branches that don't improve performance on validation data, aiming for a simpler and more generalizable tree. Cost complexity pruning introduces a penalty for having a large tree, balancing two goals: Accuracy: Keep the tree good at predicting. Simplicity: Prefer smaller trees with fewer leaves.

Answer 178

Instance-based learning algorithms are a category of machine learning algorithms that rely on the specific training instances (or examples) to make predictions. Instead of explicitly constructing a general model from the training data, these algorithms store the data and use it directly during the prediction phase. The predictions are typically based on a comparison between the new data point and the stored training examples. Key Features: Storage of Training Data: The algorithm retains all or most of the training data. Similarity Measure: Predictions are made by comparing the new instance to the stored data using a similarity or distance metric, such as Euclidean distance. Lazy Learning: They are often referred to as lazy learning algorithms because they defer the model-building process until a prediction is required. This contrasts with "eager learning," where a model is built during training.

Answer 179

Social networks consist of: Nodes: Represent individuals or entities (e.g., people). Edges: Represent relationships or connections between nodes. Topology: Refers to the structure of the network. Communities: Groups within the network where nodes are densely connected. Centrality: Measures the importance of nodes in the network.

Answer 180

Online social networks provide structured data through: Ego networks: Networks focused on a specific node and its direct connections. Metrics like modularity and centrality: Help identify communities and influential nodes.Tools like Gephi can be used for visualization and analysis.

Answer 181

Alternative edges include: Retweets: Interactions on platforms like Twitter. Mentions: Direct references to nodes. Co-occurrence: Nodes appearing together in contexts such as articles. Citations: References to other works or authors. Co-authorship: Collaborations between authors.

Answer 182

node's central position in the network. Its importance, influence, and power within the network.Centrality is fundamental for identifying influential nodes.

Answer 183

Degree Centrality: Measures the number of edges connected to a node. In-degree: Number of incoming edges. Out-degree: Number of outgoing edges. Eigenvector Centrality: Considers the importance of a node’s neighbors. High centrality means a node is connected to other highly influential nodes. PageRank: A variation of eigenvector centrality where importance is weighted by the out-degree of neighboring nodes. Closeness Centrality: Measures the mean shortest path from a node to all other nodes. Betweenness Centrality: Counts the number of shortest paths passing through a node, indicating its role as a bridge.

Answer 184

Degree centrality measures the number of edges a node has. It is important for: Evaluating a node’s influence (more edges = more connections). Assessing access to information and prestige in directed networks.

Answer 185

Eigenvector centrality accounts for the importance of a node’s neighbors. A node with fewer but influential connections may have higher centrality than a node with many insignificant connections.

Answer 186

PageRank centrality is a variant of eigenvector centrality. It assigns importance based on: The importance of neighboring nodes. Their out-degree (number of outgoing links).It’s famously used by Google’s search engine to rank web pages.

Answer 187

Closeness centrality measures the mean distance from a node to all others. It indicates how quickly information can spread from a node. Limitations include difficulty in comparing nodes across different components of disconnected networks.

Answer 188

Betweenness centrality measures the extent to which a node acts as a bridge in the network. It: Indicates control over the flow of information. Reflects a node’s power as a broker. Can be extended to weight paths inversely, highlighting robustness.

Answer 189

Degree Centrality: Identifying highly connected nodes in undirected or simple networks. Eigenvector/PageRank: Detecting nodes with influential neighbors in undirected/directed networks. Closeness Centrality: Evaluating nodes’ ability to quickly spread information. Betweenness Centrality: Identifying brokers or bottlenecks in communication.

Answer 190

Key contributors include: JL Moreno (1934): Developed sociograms and analyzed social structures like friendship networks. Leonhard Euler (1736): Solved the Königsberg Bridge Problem, founding graph theory. Duncan Watts and Steven Strogatz (1998): Introduced small-world networks. Albert-László Barabási (2003): Studied scale-free networks.

Answer 191

Applications include: Social network analysis to study influence and community detection. Epidemiology for modeling disease spread. Web science for ranking pages and analyzing internet structure. Transportation networks to optimize flow and connectivity.

Answer 192

Emergent properties include: Community structure: Groups of densely connected nodes. Resilience: Ability to withstand node or edge removal. Synchronization: Coordinated behavior across nodes. Power-law distribution: Characteristic of scale-free networks.

Answer 193

Unsupervised learning is a type of machine learning where algorithms learn patterns and structures in data without labeled outcomes. It is used to group or segment data and identify hidden structures.

Answer 194

Unsupervised learning is applied in: Customer segmentation (e.g., identifying single parents, young party-goers). Fraud detection (e.g., analyzing bank transactions, GPS logs, social media bots). Identifying new animal species. Creating classes needed for supervised classification tasks.

Answer 195

The general steps of clustering include: Iterating through all data points. Measuring the distance or similarity between points. Grouping data points into clusters where intra-cluster similarity is higher than inter-cluster similarity.

Answer 196

Clustering is unsupervised because it does not rely on labeled training data. Instead, it identifies inherent patterns or groupings in the dataset. Front: What are key examples of clustering applications? Back: Key examples include: Customer segmentation for marketing. Identifying communities in networks. Grouping similar text documents or images.

Answer 197

Key examples include: Customer segmentation for marketing. Identifying communities in networks. Grouping similar text documents or images.

Answer 198

K-means is a clustering algorithm that partitions data into K clusters. It assigns data points to the cluster with the nearest centroid, which is iteratively updated to minimize intra-cluster variance. The user specifies K.

Answer 199

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) identifies core regions of high density and expands clusters from these regions. Unlike K-means, DBSCAN does not require specifying the number of clusters and can detect noise.

Answer 200

Hierarchical clustering creates a tree of clusters (a dendrogram). It can be: Agglomerative: Bottom-up approach where clusters are merged iteratively. Divisive: Top-down approach where clusters are split iteratively.

Answer 201

Hard clustering: Each data point belongs to one cluster (e.g., K-means). Soft clustering: Data points can belong to multiple clusters with probabilities (e.g., Gaussian Mixture Models).

Answer 202

Distance metrics quantify the similarity or dissimilarity between data points, guiding the formation of clusters. The choice of metric affects the clustering results.

Answer 203

The Euclidean distance (L2 norm) is the straight-line distance between two points in space. It is calculated as: sqrt[(x2-x1)^2 + (y2-y1)^2]

Answer 204

Dimensionality reduction is important because it: Removes noise from the data. Focuses on the features or combinations of features that are actually important. Reduces computational complexity by requiring less number-crunching, making the analysis more efficient.

Answer 205

The two main approaches are: Feature selection: Identifies and retains the most important features in the data. Feature extraction: Combines existing features to create new, informative features.

Answer 206

Variance thresholding is a filter method used to eliminate features with low variance, as they typically contain less information. Steps: Calculate the variance of each feature. Drop features with variance below a set threshold. Ensure features are on the same scale by normalizing or standardizing beforehand.

Answer 207

Forward search is a wrapper method that: Creates models using one feature at a time. Selects the best-performing feature. Iteratively adds one feature at a time to the selected set, testing performance. Repeats until a predefined number of features are chosen.

Answer 208

Recursive feature elimination (RFE) is another wrapper method that: Starts with all features. Removes one feature at a time, building models with the remaining features. Selects the best subset after each iteration. RFE differs from forward search in that it begins with all features and systematically removes the least important ones, while forward search adds features incrementally.

Answer 209

Embedded methods perform feature selection during the model training process. An example is decision trees, which: Split data based on feature importance (e.g., Gini impurity or information gain). Naturally prioritize features that reduce uncertainty or variance the most.

Answer 210

PCA transforms the data into a new coordinate system where: Each new axis (principal component) is orthogonal to the others. Principal components are ordered by the amount of variance they capture. The first principal component captures the most variance, followed by the second, and so on.

Answer 211

PCA is particularly effective with highly correlated features because: It combines them into fewer uncorrelated components. It reduces redundancy while retaining most of the variance in the data.

Answer 212

The worst-case scenario occurs when: All variables are equally important and uncorrelated. PCA still works but does not provide an informative reduction in dimensions.

Answer 213

Steps for PCA: Compute the covariance matrix of the data. Diagonalize the covariance matrix to find its eigenvalues and eigenvectors. Use the eigenvectors as principal components and eigenvalues to measure variance captured. Transform the data into the new coordinate system using the principal components. Retain the first K principal components for dimensionality reduction.

Answer 214

t-SNE (t-distributed Stochastic Neighbor Embedding) is a nonlinear dimensionality reduction technique. Its goal is to: Preserve the pairwise similarities of data points when mapping high-dimensional data into 2D or 3D space. Scatter points in a lower dimension while ensuring the distribution of distances resembles the original.

Answer 215

Key steps in t-SNE: Compute pairwise distances between points in high-dimensional space. Convert distances into probabilities using a Gaussian distribution. Map points to a lower-dimensional space and fit a t-distribution to distances. Minimize the difference between the two distributions using gradient descent.

Answer 216

Limitations of t-SNE include: High memory usage, which makes it unsuitable for large datasets. Lack of interpretability in distances between far-apart clusters. Dependency on hyperparameters, which can affect clustering results.

Answer 217

Differences between UMAP and t-SNE: UMAP runs faster and uses less memory. UMAP can preserve both local and global structures in the data. UMAP allows embeddings in more than three dimensions.

Answer 218

Common problems include: Heavy reliance on hyperparameter tuning. Cluster sizes and distances between clusters are often meaningless. Axes in the resulting visualizations are not interpretable.

Answer 219

Critical assessment is important to: Determine whether the technique achieves the desired goals. Identify where the method succeeds or fails. Suggest improvements or better alternatives for the given data.

Answer 220

Best practices include: Understanding the assumptions and limitations of each method. Preprocessing data (e.g., scaling, removing outliers) appropriately. Experimenting with multiple methods to identify the best fit for the data and use case. Using visualizations and domain knowledge to validate results.

Answer 221

Hierarchical clustering is useful because it represents various degrees of similarity through a tree structure (dendrogram), allowing data to be partitioned at different levels. This is particularly effective if the data has an underlying tree structure.

Answer 222

Dendrograms are tree diagrams that show a hierarchy of clusters. Each node represents a cluster, with single-node clusters called singletons. They visualize how clusters are merged or split at various levels.

Answer 223

Also called agglomerative clustering, it starts with individual data points as separate clusters and merges them iteratively based on a distance matrix until all points form a single cluster or meet a stoppi

Answer 224

The number of dendrograms grows exponentially with the number of leaves, making brute force computation impractical for even moderately sized datasets.

Answer 225

In simple linkage, the distance between clusters is defined as the shortest distance between any two points, one from each cluster. However, this can lead to long, chain-like clusters.

Answer 226

Complete linkage defines the distance as the farthest distance between any two points in the clusters. It tends to break large clusters into smaller ones.

Answer 227

Average linkage calculates the distance between clusters as the average of all pairwise distances between points in the clusters. It is computationally intensive but provides balanced clusters.

Answer 228

Centroid linkage measures the distance between clusters as the distance between their centroids. It is biased toward forming spherical clusters

Answer 229

Ward’s method joins clusters only if the merge minimizes the increase in the total within-cluster variance, making it biased toward spherical clusters as well.

Answer 230

It is widely used in fields like biology (phylogenetic tree construction), document clustering, and image segmentation. In phylogenetics, it helps create evolutionary trees using methods like Maximum Likelihood Estimates and Bayesian Inference.

Answer 231

The cutoff depends on the desired number of clusters or the natural clustering structure in the data. In some cases, it is visually apparent, while in others, it requires experimentation.

Answer 232

DBSCAN groups data into clusters based on local density. A point is part of a cluster if the density around it, defined by a specified radius (eps) and minimum points (MinPts), exceeds a threshold.

Answer 233

Density is defined by two hyperparameters: eps (epsilon): The radius within which points are considered neighbors. MinPts: The minimum number of points required within the radius to form a dense region.

Answer 234

Core points: Points with at least MinPts neighbors within eps. Border points: Points with fewer than MinPts neighbors but within the neighborhood of a core point. Noise points: Points that are neither core nor border points.

Answer 235

All points within a cluster are reachable from each other by paths of length eps, forming clusters of arbitrary shape while identifying outliers (noise points).

Answer 236

DBSCAN can: Handle clusters of different shapes and sizes. Identify noise and outliers effectively. Work well for spatial data and when clusters are non-spherical.

Answer 237

Limitations include: Struggles with varying densities in the same dataset. Highly sensitive to the choice of eps and MinPts. Can fail to identify clusters when data density varies significantly.

Answer 238

The choice of eps determines the size of the neighborhood, and MinPts sets the density threshold. Together, they control the size and shape of clusters. Improper settings can lead to over- or under-clustering

Answer 239

Common techniques include: Using a k-distance plot to identify a natural elbow in distances. Experimenting with multiple values and evaluating cluster validity using metrics like silhouette score.

Answer 240

Partitional clustering methods create a set of non-nested partitions corresponding to clusters. They require fewer comparisons, reducing computational complexity from (in hierarchical clustering) to, where is the number of clusters and is the number of data points.

Answer 241

The K-means algorithm involves the following steps: Input data points and a given number of clusters . Choose random data points as initial cluster centroids. Assign each data point to the closest centroid. Recompute the centroids using the current points in each cluster. Check for a convergence or stopping criterion. If not met, repeat steps 3 to 5.

Answer 242

Possible stopping criteria include: Few or no reassignments of data points to different clusters. Minimal or no change in centroids. Minimal or no change in the sum of squared errors (SSE).

Answer 243

Advantages of K-means include: Efficiency: For data points in dimensions, clusters, and up to iterations, the runtime is , making it scalable for large datasets. Simplicity: It is easy to understand and implement.

Answer 244

Limitations of K-means include: It only works if a centroid can be defined, which may not be possible for categorical data. The need to specify the value of . Sensitivity to outliers. Struggles with clusters of varying size, shape, or density. Dependence on the initial choice of centroids.

Answer 245

Solutions include: Pre-processing: Normalize data (scale to [0, 1]) or standardize it (subtract mean, divide by standard deviation). Eliminate outliers. Post-processing: Eliminate small clusters representing outliers. Split clusters with high SSE or merge clusters with low SSE. Alternatively, use more advanced clustering algorithms, such as Gaussian Mixture Models (GMM).

Answer 246

GMM represents clusters using probability distributions, fitting the mean (μ) and standard deviation (σ) of Gaussian components. Unlike K-means, which assigns points to clusters hard, GMM uses soft assignments, where a point belongs to clusters with certain probabilities. GMM assumes elliptical clusters, while K-means assumes spherical clusters.

Answer 247

The likelihood function in GMM measures the probability of data points given the Gaussian distributions. The algorithm maximizes the log-likelihood to fit the parameters (μ, σ, and π) of the model. This ensures the Gaussian components accurately represent the underlying data structure.

Answer 248

Validation metrics include: External validation: Measures how clustering labels align with ground-truth labels. Internal validation: Evaluates clustering quality using internal measures like cohesion (within-cluster distances) and separation (between-cluster distances). Relative validation: Compares multiple clustering results to identify the best fit.

Answer 249

GMM assigns points softly based on a probability distribution while K means assigns points in a hard manner and definitively classifies them

Answer 250

K means assumes spherical clusters and GMM assumes elliptical clusters

Answer 251

The Silhouette Coefficient is a measure used to evaluate the quality of clustering. It considers how similar a data point is to its own cluster compared to other clusters. For a single data point, the coefficient s is calculated as: s = (b - a) / max(a, b) a = mean distance from the point to all other points in the same cluster (intra-cluster distance). b = mean distance from the point to all points in the nearest neighboring cluster (nearest-cluster distance). close to 1 → The point is well clustered (ideal). s close to 0 → The point is on the boundary between two clusters. s less than 0 → The point may be in the wrong cluster.

Answer 252

The coefficient can be visualized using a bar chart where each bar represents the silhouette value of a data point, grouped by clusters. The average silhouette score can also be plotted to summarize clustering quality.

Answer 253

Preprocessing steps include: Normalization (scaling data to [0, 1]) or standardization (subtracting mean, dividing by standard deviation). Removing or handling outliers to avoid skewed results. Feature selection to focus on relevant variables.

Answer 254

K-means assumes clusters are spherical and equally sized, leading to poor performance when clusters have irregular shapes, varying sizes, or densities. Points near boundaries can also be misclassified due to hard assignments.

Answer 255

Input: Images (e.g., photographs, video frames). Output: High-level information about people, objects, or 3D structures, such as: Object detection Image segmentation 3D image reconstruction Terrain modeling and position tracking (e.g., NASA Spirit rover applications).

Answer 256

Images are represented as matrices of pixel values. For grayscale images: A single 2D matrix (e.g., 400 x 400 pixels). For colored images: Three 2D matrices, one for each color channel (red, green, and blue).

Answer 257

The visual cortex of the brain is organized into layers, with information flowing from one layer to another. Research by Hubel & Wiesel (1959): Neurons in the visual cortex respond to specific patterns, such as lines of particular orientations. Convolutional Neural Networks (CNNs) emulate this layered structure and hierarchical feature detection

Answer 258

A perceptron is the simplest type of neural network. Input: A vector of features (e.g., x1, x2, ..., xK). Output: A binary result (y = 0 or 1). Each input is weighted, summed, and passed through an activation function to determine the output.

Answer 259

Scalability: A 100 x 100 pixel image would require 10,000 parameters per node. Sensitivity: Networks are not robust to small changes in input (e.g., image translation or rotation). Inefficiency: Does not leverage spatial correlations between nearby pixels. D

Answer 260

CNNs use filters (kernels) to identify patterns such as edges or textures in images. Spatial relationships between pixels are preserved via convolutions. Shared weights reduce the number of parameters, enhancing scalability. Robust to transformations (e.g., shifts, rotations) due to hierarchical feature learning.

Answer 261

A convolution involves sliding a filter (kernel) across an image and performing element-wise multiplications (dot products) between the filter and the image patch. Resulting values are stored in a feature map, which highlights the presence of specific features. Positive values in the feature map indicate where the filter detects features.

Answer 262

A feature map represents the regions of an image where a particular feature (e.g., edges, lines) is detected. Positive values: The feature is present in the corresponding area. Negative or zero values: The feature is absent.

Answer 263

Deeper architectures: Introduction of more layers and complex networks (e.g., ResNet, AlexNet). Data availability: Large datasets (e.g., ImageNet) and resources (e.g., GPUs, cloud computing). Software tools: Frameworks like TensorFlow, PyTorch, and Keras simplify implementation. Deep learning: The rise of "deep learning" as a subfield in the 2000s.

Answer 264

Data requirements: Requires large amounts of labeled data. Computational intensity: Demands powerful GPUs or cloud resources. Uncertainty: Struggles with representing uncertainty; easily fooled by adversarial examples. Optimization challenges: Difficult to fine-tune architectures and learning methods. Interpretability: Neural networks often function as black boxes, making it hard to understand decisions.

Answer 265

Filters in CNNs are initialized randomly and updated using the backpropagation algorithm. Loss gradients with respect to the filter weights are computed and used to adjust the weights to minimize the loss. The process repeats over multiple training iterations.

Answer 266

GPUs excel at parallel processing, which is critical for training deep learning models. They accelerate matrix multiplications and convolutions, enabling faster training on large datasets.

Answer 267

Image segmentation divides an image into meaningful segments, such as identifying individual objects or regions. Applications include medical imaging (e.g., tumor detection) and autonomous vehicles (e.g., road and obstacle detection).

Answer 268

Use feature maps as inputs to: Pooling layers: Downsample feature maps to reduce dimensionality while retaining important information. Fully connected layers: Combine extracted features for final predictions (e.g., classification). Customizing networks: Optimize learning rates, activation functions, and other parameters.

Answer 269

Pooling reduces the spatial size of feature maps, improving computational efficiency. Types include max pooling (selects maximum value in a patch) and average pooling (computes the mean value). Helps achieve translation invariance by focusing on dominant features.

Answer 270

Access to large annotated datasets enables better training. Free frameworks (e.g., TensorFlow, PyTorch) simplify development. Cloud services provide scalable infrastructure for model training and deployment.

Answer 271

Creating models that generalize well across diverse datasets. Developing interpretable neural networks to improve trust. Efficiently managing large-scale data and computational costs. Representing uncertainty to prevent overconfidence in predictions.

Answer 272

A feature map is the result of applying a filter (kernel) to an input image. The convolution operation involves sliding the filter over the image and computing a dot product to extract features like edges or textures.

Answer 273

The bias term shifts the result of the convolution operation, allowing the feature map to detect patterns with varying intensities.

Answer 274

ReLU (Rectified Linear Unit) is defined as . It introduces non-linearity, allowing the model to learn complex patterns. It also prevents the vanishing gradient problem seen in sigmoid and tanh activation functions.

Answer 275

Pooling is a downsampling operation that reduces the spatial dimensions of feature maps, making computation more efficient and reducing overfitting. The most common types are: Max-pooling: Takes the maximum value in a region. Average-pooling: Takes the average value in a region.

Answer 276

Downsampling: Reduces computational complexity. Retains important features while discarding irrelevant details. Helps prevent overfitting by generalizing the model.

Answer 277

CNNs convert raw input (e.g., images) into feature vectors via layers. These vectors capture high-level patterns learned from the data. You can extract these features and feed them into traditional ML models (e.g., SVM, Random Forest). This turns image classification into a tabular ML problem with numerical features.

Answer 278

Padding adds extra pixels (usually zeros) around the input image, which: Maintains the spatial dimensions after convolution. Ensures edge information is preserved.

Answer 279

Stride is the step size by which the filter moves across the input image. A larger stride: Reduces the spatial dimensions of the output. Speeds up computation but may lose detailed features.

Answer 280

ReLU: Discards all negative values, potentially causing "dead neurons." Tanh: Its derivatives approach zero for extreme input values, leading to the vanishing gradient problem

Answer 281

Softmax converts raw output scores into probabilities for each class using the formula: It ensures the output probabilities sum to 1, making it suitable for multi-class classification.

Answer 282

The learning rate () controls how much model parameters update during training.

Answer 283

Large learning rates: Can overshoot the optimal solution, missing the minima. Small learning rates: Can make training very slow or get stuck in suboptimal solutions.

Answer 284

Techniques include: Decaying the learning rate over time (e.g., ). Using advanced optimizers like Adam, Adagrad, RMSprop, or incorporating momentum.

Answer 285

Hyperparameters are user-set configurations that control the learning process. Examples include: Filter size and number of filters. Padding and stride values. Learning rate and decay. Dropout rate. Number of epochs and batch size. Activation functions. Number of hidden layers and neurons per layer.

Learning from data Flashcards

(325 cards)