Learning from data Flashcards

1
Q

Define data integration and the goal of data integration

A

Data integration is the practice of combining data from heterogeneous sources into a single coherent
data store.
It’s goal is to provide users with consistent access and delivery of data across a spectrum of
subjects and data structure types

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Define a common user interface(manual data integration)

A

A hands-on approach where data
managers manually handle every step of the integration, from retrieval to presentation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Define middleware data integration

A

Uses middleware software to bridge and facilitate communication between different systems, especially between legacy and newer systems

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Define application based data integration

A

Software applications locate, retrieve and integrate data by making
data from different sources and systems compatible with one another

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Define uniform data access

A

It provides a consistent view of data from diverse sources without moving or
altering it, keeping the data in its original location.D

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Define common data access(Data Warehousing)

A

It retrieves and presents data uniformly while creating and storing a duplicate copy, often in a central repository.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is a pro and a con for a common user interface

A

Reduced cost, requires little maintenance, integrates a small number of data sources, user has total control.

Data must be handled at each stage, scaling for projects require changing code, manual orchestration.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is a pro and a con for middleware data integration

A

Middleware software conducts the integration automatically, and the same way each time.

Middleware needs to be deployed and maintained.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is a pro and a con for application based integration

A

Simplified process, application allows systems to transfer information seamlessly, much of the process is automated.

Requires specialist technical knowledge and
maintenance, complicated setup.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is a pro and a con for uniform data access

A

Lower storage requirements, provides a simplified view of the data to the end user, easier data access

Can compromise data integrity, data host systems are not designed to handle amount and frequency of data requests.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is a pro and a con for common data storage ( data warehousing )

A

Reduced burden on the host system, increased data version management control, can run sophisticated queries on a stored copy of the data without compromising data integrity

Need to find a place to store a copy of the data, increases storage cost, require technical experts to set up the integration, oversee and maintain the data warehouse.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is the difference between supervised and unsupervised learning? Also what is Semi-Supervised learning?

A

Supervised learning algorithms use data with labelled outcomes while unsupervised learning algorithms use data without labelled outcomes.

Semi-supervised learning algorithms use both data with labelled outcomes and without labelled outcomes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is the task of a supervised learning

A

The task is to learn a mapping function from possible inputs to outputs.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is the task of unsupervised learning

A

In unsupervised learning, our task is to try to “make sense of” data, as opposed to learning a mapping. This is as we have inputs but no associated responses.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Strictly define a hyperparameter

A

A hyperparameter is a parameter that is not learned directly from the data but relates to
implementation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Define the training and prediction phase in a ML model

A

In the training phase a ML model can learn the parameters that define this relationship between the features and the outcome variable. The more data the better

In the prediction phase we get new observations, feed these values into our
trained model, and we have a prediction.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What can we use to measure the quality of our predictions

A

Most models will define a loss function which is some quantitative measure
of how close our prediction is to the actual value. In addition there will also be an update rule that will determine how to update the model parameters

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Define the difference between regression and classification

A

Classification deals categorizing data sets into one of a set of predetermined preexisting classes. Regression deals with continuous data e.g. any continuous outcome like loss, revenue, number of years or anything that can be answered with the question, how much?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Look over calculating linear regression coefficients by hand

A

https://ele.exeter.ac.uk/pluginfile.php/4546128/mod_resource/content/0/LfD-L2.pdf

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Define the mean squared error function and how it works

A

The Mean Squared Error (MSE) function is the sum of squared errors divide by the number of values

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What is the difference between explained and unexplained variation in a regression model?

A

The explained variation measures how much of the total variation is captured by the regression model, i.e., how much of the variation in
𝑦 can be explained by the independent variable(s) 𝑥. While the unexplained variation measures the variability in the dependent variable that is not captured by the regression model. It is also known as the error or residual variation.

Total variation is the unexplained variation added to the explained variation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What is the primary objective for a prediction model?

A

The primary objective is to make the best prediction.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What is the main focus for prediction models?

A

The focus is on performance metrics, which measure the quality of the model’s predictions.
Performance metrics usually involve some measure of closeness between ypred and y.
Without focusing on interpretability, we risk having a Black-box model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

How can we determine the accuracy of a prediction model?

A

The closer the predicted values are to the observed
values, the more accurate the prediction is.
The further the predicted values are to the observed
values, the less accurate the prediction is.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

What is the primary objective for an interpretation model?

A

The primary objective is to train a model to find insights from the data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

What is the primary focus for an interpretation model?

A

On the model’s coefficients or feature importance, which reveal which features are most influential in predicting the outcome.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

Why might we want to use polynomials in our linear regression models?

A

Polynomials can help to predict better - they can sometimes better fit the curvature of the actual data.
Polynomials can also help to find variables that explain variation in data better

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

What kind of features can be captured by adding polynomial features to our linear regression models?

A

Higher order features - features that are created by transforming the original features (or predictors) in a non-linear way. Higher-order features help capture more complex, non-linear relationships between the independent variables and the dependent variable that would otherwise be missed by a simple linear model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

Note: adding polynomial features does not mean the algorithm is no longer linear regression. The non-linear relationship between one feature and another is not going to make the algorithm
non-linear, it is the algorithm itself that is still a linear combination of features.

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

What is the Bayes Information Criterion ( BIC )

A

he Bayesian Information Criterion (BIC) is a statistical measure used for model selection. It helps compare different models and identify the one that best balances goodness of fit with model complexity. Specifically, BIC penalizes models with more parameters to avoid overfitting, favoring simpler models when they adequately explain the data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

What is the formula for the Bayes Information Criterion?

A

n.ln(SSE)-n.ln(N)+ln(n).P

SSE = sum squared error
n = number of observations
p is the number of parameters.
ln is natural logarithm

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

Interaction terms can be added to a linear regression modelling algorithm, what are interaction terms?

A

Interaction terms in linear regression modeling are terms added to the model to capture the combined effect of two (or more) predictor variables on the dependent variable. Interaction terms account for the situation where the effect of one predictor on the outcome depends on the level or value of another predictor.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

How can interaction terms be represented in our modelling algorithm

A

If there are two predictor variables, say x1 and x2, then interaction term would be represented by multiplying the two variables together. The resuting equation could look like this

y = β0 + β1X1 + β2X2 + β3(X1 x X2) + ϵ

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

Why should data be split into a training and testing split?

A

In machine learning, a training set is used to train or fit the model, allowing it to learn patterns and relationships from the data. A training set can also be used to learn the optimal parameters for a model

The testing set (or test set) is kept separate and used to evaluate the model’s performance on unseen data. This separation is crucial to prevent overfitting, where the model performs well on the training data but poorly on new, unseen data.

By using both a training and testing set, you ensure the model generalizes well and can make accurate predictions in real-world scenarios.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

How can we use the Test data set (also known as a holdout set) to measure model performance?

A

-We can compare with actual values
-Predict label with model
-Measure error

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

Define cross validation in the test train split and why it should be used

A
  • Instead of using a single training and test set, we use cross validation to calculate the error across
    multiple training and test sets.
  • With cross validation, we split the data into multiple pairs of training and test sets.
  • Average the error across each one of the test set errors.
  • Performance measure will be more statistically significant.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

Define k-fold cross validation and how it works

A

A single parameter called k is the number of groups a data sample is to be split

  • Process:
    1. Shuffle dataset.
    2. Split dataset into k groups.
    3. For each unique group:
    a) Use the group as a test set.
    b) Use the remaining group as a training set.
    c) Fit model on the training set, then evaluate
    model on the test set.
    d) Store the evaluation score, then discard
    the model.
    4. Summarise model performance using the
    model evaluation scores
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
38
Q

Define stratified sampling

A

Stratified sampling is a sampling technique where the samples are selected in the same proportion as they appear in the population.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
39
Q

Why should we implement stratified sampling in conjunction with k-fold cross validation?

A

Implementing stratified sampling in cross validation ensures that the training and test sets have the same proportion of the feature of interest as in the original dataset.
* By doing this with the target feature, we ensure that the cross validation result is a close approximation of the generalisation error

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
40
Q

If Training error and cross validation error metrics are both high what is that a sign of

A

Underfitting, which is where a model is too simple to capture the underlying patterns in the data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
41
Q

If the training error is low but the cross validation error is high what is that a sign of?

A

Overfitting, which is where the model learns the training data instead of relationships in the data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
42
Q

When we are creating a model we want both the training and cross validation errors to be low - what is a good approach to doing this?

A

Good approach: as soon as the cross validation error start to
increase, we stop making the model more complex.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
43
Q

When the model is too complex, e.g has a high polynomial degree too many layers etc what is likely to occur

A

The model is likely to just learn the training data and and will start overfitting

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
44
Q

What are the three sources of model error?

A

Bias
Variance
Irreducible error

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
45
Q

What is the bias in a model?

A

Bias is the tendancy to miss true values with predicting, a model with high bias will not be very accurate and will predict wrong a lot of the time. We want as low bias as possible

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
46
Q

What can high bias be the result of?

A
  • The model misrepresenting the data given missing information.
    -An overly simple model, i.e., bias to the simplicity of the model.
    -The model missing the real patterns in the data

High bias is often linked to underfitting the training model

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
47
Q

What is the variance in a model and what is it characterised by?

A

Variance is the tendency of predictions to fluctuate and is characterised by high sensitivity of output to small changes in input data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
48
Q

What causes a model to have high variance

A

Variance is often caused by overly complex or poorly fit models e.g very high degree polynomial models.
It’s associated with overfitting

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
49
Q

What is irreducible error in a model?

A

Irreducible error in a model refers to the portion of the total error that cannot be reduced or eliminated, no matter how well the model is designed or how much data is used. This error arises from inherent variability or noise in the data that the model cannot capture or explain.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
50
Q

What are some sources of irreducible error?

A

Measurement errors in data collection.
Randomness in the underlying process being modeled.
Unpredictable factors that affect the outcome but are not included in the model.

It is impossible to perfectly model the majority of real world data. Thus, we have to be comfortable
with some measure of error.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
51
Q

What is the bias-variance trade off?

A

Making model adjustments aimed at reducing bias ca often end up increasing variance and vice versa.

Analogous to the complexity tradeoff where we want to choose the right level of complexity to find the best model

The model should be complex enough not to underfit but not so complex that it overfits

“We search for a model that describes the feature target
relationship but not so complex that it fits to spurious patterns”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
52
Q

How does the degree of a polynomial regression relate to the Bias-Variance tradeoff?

A

The higher the degree of a polynomial regression, the more complex that model is (lower bias, higher
variance).

  • At lower degrees: the predictions are too rigid to capture the curved pattern in the data (bias).
  • At higher degrees: the predictions fluctuate wildly because of the model’s sensitivity (variance).

As the degree of the polynomial increases:

Bias decreases, because the model becomes more flexible and better captures the training data.

Variance increases, because the model becomes more sensitive to noise and variations in the data.

The optimal model has sufficient complexity to describe the data without overfitting.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
53
Q

Define a cost function

A

A cost function in machine learning is a mathematical function that measures the error or difference between the predicted output of a model and the actual target values. It serves as a quantitative measure of how well or poorly a machine learning model performs. The goal of training a model is to minimize this cost function.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
54
Q

Define Linear Model Regularisation

A

To regularize linear models we can add a regularization strength parameter directly into the cost function

This parameter(lambda) adds a penalty proportionate to size(magnitude or numerical value) of the estimated model parameter

  • When λ is large, stronger parameters are penalised. Thus, a more
    complex model will be penalised.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
55
Q

How does the introduction of a regularisation term to a linear model effect the bias variance tradeoff?

A

The regularisation strength parameter λ allows us to manage the complexity tradeoff.
* More regularisation introduces a simpler model or more bias.
* Less regularisation makes the model more complex and increases variance.

If the model overfits (variance is too high), regularisation can improve generalisation error and reduce
variance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
56
Q

How is the penalty λ applied in Ridge regression?

A

In ridge regression, the penalty λ is applied proportionally to squared coefficient values.

  • This penalty imposes bias on the model and reduces variance
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
57
Q

How can we find the best value λ when performing regression on a linear model?

A

We can use cross validation - it’s best practice to scale features

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
58
Q

What is LASSO regression and how is the penalty λ applied

A
  • In LASSO (Least Absolute Shrinkage and Selection Operator), the penalty λ is applied proportionally
    to absolute coefficient values
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
59
Q

What is L1 and L2 regularisation?

A

LASSO and Ridge regression are also known as L1 regularisation and L2 regularisation, respectively

The names L1 and L2 regularisation come from the L1 and L2 norm of a vector w, respectively

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
60
Q

How can we calculate the norm for L1 regularisation?

A

The L1-norm is calculated as the sum of the absolute vector values, where the absolute value of a
scalar uses the notation |wN|.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
61
Q

How can we calculate the norm for L2 regularisation?

A

The L2-norm is calculated as the square root of the sum of squared vector values

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
62
Q

Why should we use L1 vs L2 regularisation?

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
63
Q

What is feature selection and why is it important?

A

Feature selection is figuring out which one of our features are important to include in the model

Reducing the number of features can prevent overfitting.
* For some models, fewer features can improve fitting time and/or results.
* Identifying the most important features can improve model interpretability.
* Feature selection can also be performed by removing features.
* Remove feature one at a time, measure the predictive results using cross validation, if the feature
elimination improves the cross validation results, or doesn’t increase the error much, that feature
can be removed.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
64
Q

How does regularisation perform feature selection

A

Regularisation performs feature selection by shrinking the contribution of features.
* For L1-regularisation, this is accomplished by driving some coefficients to zero. 220 6. Linear Model Sel

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
65
Q

What is data integration or data fusion?

A

Data integration or data fusion is the assumption that all data can be retrieved from the same source, enabling efficient analysis without worrying about disparate sources.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
66
Q

What is data cleaning, and why is it necessary?

A

Data cleaning is the process of detecting and correcting corrupt, inaccurate, or incomplete records in a dataset. It ensures the dataset is reliable, accurate, and usable for analysis by removing or rectifying errors, outliers, and inconsistencies.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
67
Q

What are missing values in a dataset, and how are they represented?

A

Missing values are data points expected in the dataset but absent. In pandas, they are often represented as NaN (Not a Number). Other representations include None, 9999, or N/A. Missing values can arise from human error, skipped survey questions, or database management issues.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
68
Q

What are the three types of missing data?

A

Missing Completely At Random (MCAR): Missing values occur randomly, unrelated to observed or unobserved data (e.g., random sensor failure). - The probability of missing values is equal for all units

Missing At Random (MAR): probability of a missing value depends on observed data but not the missing data itself (e.g., sensor failure during high wind speeds).

Missing Not At Random (MNAR): happens when we know exactly which data object will have missing values. In this case, the probability of missing values is related to the actual missing data.(e.g., tampered sensors near a polluting power plant).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
69
Q

What are the four approaches to dealing with missing values?

A

Keep as-is:

Used when the tools or goals can handle missing values directly, such as the KNN algorithm.
This approach ensures the dataset remains intact and avoids introducing biases during preprocessing.

Remove rows with missing values:

Suitable for MCAR situations, where missing values are completely random.
Avoid in MAR and MNAR cases, as removing rows risks introducing bias by excluding specific subsets of data.

Remove columns with missing values:

Effective when missing rates exceed 25%, especially for non-critical attributes.
Avoid for critical attributes, as their removal can compromise the analysis.

Impute missing values:

Replace missing values with central tendencies (mean, median, or mode), subgroup averages (for MAR), or regression models (for MNAR).
Be cautious of introducing bias during imputation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
70
Q

What are the Guiding principles when dealing with missing values and what considerations should we make

A

We should aim to preserve data and information and minimize bias introduction.

Additional considerations include:
-What are our analytical goals?(clustering classification etc)
-what analytical tools are we using
-what is the cause of our missing values?
-What type of missing values are we dealing with?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
71
Q

What is an outlier, and what are the common causes?

A

An outlier is a data point that significantly differs from others. Causes include:

Errors: Data entry or measurement mistakes.
Legitimacy: True but extreme values that may skew results.
Fraud: Deliberate manipulation requiring scrutiny.
Random errors: Unavoidable fluctuations or inconsistencies in measurements due to chance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
72
Q

How can outliers be detected and managed?

A

Outliers can be detected using the interquartile range (IQR):

Define outliers as values outside the range:
Q1 - 1.5 * (Q3 - Q1) <-IQR
OR
Q3 + 1.5 * IQR.
Management strategies:
Do nothing (if analysis is robust).

Replace with caps (upper or lower bound).
-ideal when analysis is sensitive to outliers and retaining all data objects is vital

Apply log transformation - for particularly skewed data.

Remove outliers
* Worst option due to potential loss of information. It should be done when other methods are
inapplicable and when data is correct, but outlier values are excessively distinct

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
73
Q

What are systematic errors, and why are they challenging?

A

Systematic errors are consistent, repeatable errors linked to specific sources (e.g., faulty instruments). They are difficult to detect as they may go unnoticed in data but can cause bias in analyses. Outlier detection can sometimes help identify these errors.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
74
Q

Why is data transformation necessary?

A

Data Transformation is the last stage of data preprocessing before using analytic tools. It ensures our dataset meets key
characteristics and is ready for analysis!

Data transformation adjusts data ranges to:

Meet model assumptions.
Improve algorithm stability.
Ensure fair contribution from all features.
Common techniques include standardization, normalization, log transformation, and categorical-to-numerical conversions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
75
Q

What is the difference between standardization and normalization?

A

Standardization: Rescales data to have a mean of 0 and a standard deviation of 1.

Normalization: Rescales data to a specific range, typically [0, 1].

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
76
Q

When is a log transformation used, and why?

A

Log transformation is used to address skewness and extreme values. It is particularly useful for:

Features varying over orders of magnitude.

Cases where ratios are more critical than absolute differences.

Formula X^1 = log(x)
^ i think this is dx

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
77
Q

Define an error in Data

A

A discrepancy or deviation in measured data from the actual or true values, which can arise due to various factors during data collection or measurement.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
78
Q

What is discretisation

A

The process of converting continuous data (numerical values) into discrete intervals or categories. This transformation helps simplify the data, making it easier to analyze, interpret, or use in certain machine learning algorithms that perform better with categorical data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
79
Q

What are some techniques for transforming data from numerical to categorical and what is an advantage and disadvantage

A

Binary coding:
Categorical data is represented by unique numbers for each category. These numbers are then converted into it’s binary form with each digit of this binary number having it’s own data column(feature?)
-Works well with high-cardinality categorical variables and reduces dimensionality
-will not preserve ordinal relationships

Ranking transformation (ordinal encoding)- Data is transformed based on the order or rank of the values, useful where hierarchies are significant
-preserves the natural ordering of data
-not suitable for nominal data and may lost info about relative distances

Attribute conversion - Transforms existing attributes (features) into new attributes that are better suited for machine learning models. This is a broader technique that encompasses several types of transformations, including scaling, normalizing, and converting attributes into different representations.
-Can improve model performance by making data more interpretable
-requires domain knowledge and risks introducing bias

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
80
Q

What is smoothing in data transformation, and how is it applied?

A

Smoothing reduces noise and fluctuations in data to reveal underlying trends. Common methods include:

Moving average: averaging data points in successive subsets to create smoothed values.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
81
Q

What is the role of bar charts in data visualization?

A

Bar charts display categorical data distributions. Tips for effective use:

Avoid too many bars.

Use horizontal layouts for readability when necessary.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
82
Q

When are line plots and scatter plots used in visualization?

A

Line plots: Best for showing trends over time. Avoid markers; use legends on curves.

Scatter plots: Visualize relationships between variables, often with multiple variables for richer insights.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
83
Q

Why should pie charts be used cautiously?

A

Pie charts display proportions but are often misleading due to poor readability and comparison challenges. Alternative visualizations like bar charts are recommended.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
84
Q

What is classifcation in machine learning?

A

Classification is a supervised learning task where an algorithm is trained on labeled data to predict the class of new data points. It assigns inputs (x) to predefined categories (y), such as predicting whether an email is spam or not. The classifier will choose the class that has the highest percent chance of being accurate

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
85
Q

How is the train-test split used in machine learning?

A

The dataset is split into two parts: the training set (usually 80%) is used to train the model, and the test set (usually 20%) is used to evaluate its performance. This ensures that the model is tested on unseen data to check its generalization ability.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
86
Q

What are some common applications of classification

A

Applications include image classification (e.g., recognizing objects in images), spam detection in emails, medical diagnosis (e.g., disease prediction), and customer segmentation in marketing.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
87
Q

How does logistic regression work?

A

Logistic regression is a statistical model used for binary classification. It predicts the probability that a given input x belongs to class 1 (y=1). It uses the logistic function to ensure the output is between 0 and 1. It’s up to the developer/researcher do decide what threshholds mean what but default is 50%

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
88
Q

What kind of data does logistic regression handle?

A

Logistic regression handles continuous input data (x) and binary output data (y).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
89
Q

How do you interpret the logistic regression model’s output?

A

The output of logistic regression is the probability of the input data belonging to class 1. A threshold (e.g., 0.5) is then applied to decide the final class label.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
90
Q

What is the cost function used in logistic regression?

A

Logistic regression uses the log-loss (cross-entropy) cost function, which measures the difference between the predicted probabilities and the actual class labels.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
91
Q

How is logistic regression implemented in scikit-learn?

A

Import the necessary modules: from sklearn.linear_model import LogisticRegression.

Split the data using train_test_split.

Create the classifier: log_reg = LogisticRegression().

Fit the model: log_reg.fit(X_train, y_train).

Predict and evaluate: log_reg.predict() and log_reg.score().

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
92
Q

What are the weaknesses of logistic regression?

A

Logistic regression assumes a linear relationship between the input features and the log-odds of the target variable. It can struggle with non-linear data or datasets with overlapping classes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
93
Q

What is the forumla for logistic regression accuracy?

A

True positive + True negative/sample size

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
94
Q

What is the perceptron algorithm?

A

The perceptron is a linear classifier that updates its weights based on classification errors. It iteratively adjusts weights to find a hyperplane(e.g. a line separating the data points if they were graphed) that separates the classes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
95
Q

How is a perceptron trained?

A

Initialize weights (w) to zero or small random values.

For each input (x), calculate the predicted output (y_pred) using a step function.

Update weights if there is a misclassification.

Repeat until convergence or for a fixed number of iterations.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
96
Q

What are the limitations of the perceptron?

A

The perceptron can only solve problems where the data is linearly separable. It fails for non-linear datasets.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
97
Q

What is the multi-layer perceptron (MLP)?

A

MLP is a type of neural network that can model non-linear relationships. It consists of multiple layers of neurons: input layer, one or more hidden layers, and an output layer. It uses activation functions to introduce non-linearity.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
98
Q

How does the MLP overcome the limitations of perceptrons?

A

How does the MLP overcome the limitations of perceptrons?
Back: MLP uses hidden layers and non-linear activation functions (e.g., ReLU, sigmoid) to model complex, non-linear relationships in data. This allows it to solve problems like XOR.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
99
Q

What is the activation function in an MLP?

A

An activation function introduces non-linearity into the model, enabling it to learn complex patterns. Common activation functions include sigmoid, tanh, and ReLU.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
100
Q

What is the backpropagation algorithm?

A

Backpropagation is an algorithm used to train neural networks. It calculates the gradient of the loss function with respect to each weight using the chain rule and updates the weights to minimize the error.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
101
Q

What are the strengths of MLPs?

A

MLPs are highly versatile and can model complex patterns. They are suitable for a variety of tasks, including classification, regression, and even unsupervised learning when paired with autoencoders.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
102
Q

What are the common optimization techniques used in MLP training?

A

Common techniques include stochastic gradient descent (SGD), Adam optimizer, and learning rate schedules. These methods improve convergence and training efficiency.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
103
Q

What are some practical considerations when using MLPs?

A

Choose the number of layers and neurons carefully to avoid overfitting or underfitting.

Use regularization techniques like dropout or L2 regularization.

Scale input data for faster convergence.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
104
Q

What is the importance of the perceptron in the history of AI?

A

The perceptron is one of the first machine learning algorithms and laid the foundation for modern neural networks. It demonstrated the potential of learning machines but also highlighted the need for multi-layer architectures to handle non-linear problems.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
105
Q

How can we use a cost function to help deal with class imbalance inaccuracies?

A

Adjust the cost function by assigning higher weights to minority classes and lower weights to majority classes. This penalizes the model more heavily for misclassifying the minority classes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
106
Q

What is gradient descent, and why is it used in machine learning?

A

Gradient descent is an iterative optimization algorithm used to minimize a loss function (error) by adjusting model parameters. It calculates the gradient of the loss function with respect to the parameters and moves in the opposite direction of the gradient to reduce the error

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
107
Q

What are the key steps in the gradient descent algorithm?

A

Start at a random point in parameter space.

Calculate the loss at the current point.

Take a step in the direction of the steepest gradient.

Recalculate the loss at the new point.

Repeat until the loss is below a threshold or a maximum number of steps is reached.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
108
Q

What is the difference between stochastic and deterministic parameter-fitting methods?

A

Deterministic: The algorithm moves in a specific direction calculated by the gradient. It has no randomness.

Stochastic: The algorithm includes randomness by picking a nearby random point and checking the error iteratively.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
109
Q

What is the L1 norm, and how is it related to optimization methods?

A

The L1 norm (Manhattan distance) is the sum of the absolute differences between predicted and actual values. Minimizing the L1 norm promotes sparsity in model parameters and is often used in optimization methods like Lasso regression to encourage simpler models by driving some coefficients to zero.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
110
Q

What is the L2 norm, and how is it related to gradient descent?

A

The L2 norm (Euclidean distance) is a measure of the error between predicted values and actual values. Minimizing the L2 norm (squared error) is a common objective in gradient descent to optimize model parameters.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
111
Q

What is the role of the gradient in gradient descent?

A

The gradient represents the direction and rate of the steepest increase of the loss function. Gradient descent takes steps in the opposite direction of the gradient to minimize the loss function.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
112
Q

In one-dimensional parameter space, how is the optimal point found using gradient descent?

A

In 1D, the optimal point occurs where the derivative of the loss function with respect to the parameter equals zero:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
113
Q

How does gradient descent generalize to two or more dimensions

A

In higher dimensions, the gradient consists of partial derivatives with respect to each parameter. Gradient descent finds the direction of steepest descent by combining all partial derivatives into a gradient vector.
(Idk if we need to know this)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
114
Q

What is a confusion matrix, and what is it used for?

A

A confusion matrix is a table used to evaluate the performance of a classification model. It displays the counts of true positives, true negatives, false positives, and false negatives.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
115
Q

Define precision in the context of a confusion matrix.

A

Precision (Positive Predictive Value) is the proportion of correctly predicted positive observations out of all observations predicted as positive:

Formula:
correct Positives / Correct Positives + Incorrect positives

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
116
Q

Define recall (sensitivity) in the context of a confusion matrix.

A

Recall (Sensitivity) is the proportion of actual positive observations correctly classified as positive:

formula:
Recall=
True positives / TruePositives+FalseNegatives

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
117
Q

What is the trade-off between precision and recall?

A

A high recall indicates fewer false negatives but may include more false positives (lower precision). Conversely, high precision minimizes false positives but may miss true positives, lowering recall.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
118
Q

What is the F1 score, and why is it useful?

A

The F1 score is the harmonic mean of precision and recall. It is useful for balancing the trade-off between precision and recall

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
119
Q

What is the ROC curve, and what does it represent?

A

The ROC curve (Receiver Operating Characteristic curve) shows the trade-off between the true positive rate (sensitivity) and the false positive rate (1 - specificity) as the decision threshold varies.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
120
Q

What is sensitivity and specificity in binary classification?

A

Sensitivity (Recall): Proportion of true positives correctly identified.

Specificity: Proportion of true negatives correctly identified.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
121
Q

What does the Area Under the Curve (AUC) represent in the ROC curve?

A

The AUC measures the overall performance of a classifier by quantifying the area under the ROC curve. A higher AUC indicates better classifier performance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
122
Q

How is grid search used for parameter fitting?

A

Grid search involves exhaustively trying all combinations of predefined parameter values to identify the combination that minimizes error or maximizes accuracy.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
123
Q

How can imbalanced datasets affect classifier performance, and how can this issue be addressed?

A

Imbalanced datasets bias the model towards the majority class. Solutions include:

Sampling techniques (oversampling minority class, undersampling majority class).

Using metrics like F1 score or ROC-AUC that account for class imbalance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
124
Q

What happens when the classifier threshold is set very low?

A

Setting a low threshold increases sensitivity (high true positive rate) but reduces specificity, leading to more false positives.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
125
Q

What happens when the classifier threshold is set very high?

A

Setting a high threshold increases specificity (low false positive rate) but reduces sensitivity, leading to more false negatives.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
126
Q

Lecture 9

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
127
Q

What is Natural Language Processing (NLP)?

A

NLP is a field of computer science and artificial intelligence focused on enabling computers to understand, interpret, and generate human language. It often involves tasks such as text similarity, sentiment analysis, topic extraction, summarization, question answering, relationship extraction, and language generation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
128
Q

What is spam detection and how is it performed?

A

Spam detection is a classification problem where the input (X) is email text and the output (y) is whether the email is spam or not. A typical approach involves train-test splitting and identifying words indicative of spam. For example, phrases like “send us your password” are strong spam indicators.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
129
Q

How does Bayesian spam detection work?

A

Bayesian spam detection calculates the probability of a phrase being spam based on the likelihood of each word occurring in spam emails. For example, P(“send us your password”|spam) = P(“send”|spam) × P(“us”|spam) × P(“your”|spam) × P(“password”|spam). Each word is treated independently.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
130
Q

What are tokenization, stemming, and lemmatization, and why are they important?

A

Tokenization: Breaks a sentence into individual words or tokens.

Stemming: Reduces words to their root forms (e.g., “running” becomes “run”).

Lemmatization: Reduces words to their base or dictionary forms (e.g., “better” becomes “good”).
These methods group variations of words together to standardize text for analysis.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
131
Q

Why is tokenization language-dependent?

A

Different languages have unique structures, making tokenization challenging.

132
Q

What is a bag-of-words representation, and how is it used?

A

Bag-of-words represents text as a frequency distribution of words without considering grammar or order. It is used in tasks like document classification and text clustering. Each document is represented as a vector of word frequencies or binary indicators.

133
Q

What is a sparse matrix representation in NLP?

A

A sparse matrix represents text where rows are documents (e.g., songs), columns are words, and values are word counts or binary indicators (1 if the word exists, 0 otherwise). This approach is memory-efficient as most entries are zero.

134
Q

How is PCA(Principle Component Analysis) applied to NLP data?

A

PCA reduces high-dimensional vector representations (e.g., word embeddings or bag-of-words) into lower dimensions for clustering and classification. It identifies key features while preserving variance in the data.

135
Q

What is TF-IDF(Term Frequency-Inverse Document Frequency), and how does it work?

A

TF-IDF identifies keywords by balancing term frequency in a document and inverse frequency across documents. Formula: TF-IDF = TF(term) × IDF(term). Higher scores indicate words more unique to a document.

136
Q

How can word frequencies compare text corpora

A

Word frequencies in metal lyrics are compared to other genres by plotting a scatterplot of frequency ratios. Points deviating from the diagonal highlight words unique to a genre.

137
Q

How can TF-IDF be extended to bigrams and trigrams?

A

TF-IDF can calculate scores for word pairs (bigrams) or triples (trigrams) to identify multi-word expressions indicative of specific topics or genres. This captures context missed by single-word analysis.

138
Q

What are some applications of vector representations in NLP?

A

Vector representations enable tasks like clustering similar documents, classifying text, and semantic analysis. For example, vectors can identify similar songs or distinguish topics in a corpus.

139
Q

What are semantic challenges in NLP?

A

Synonymy: Different words with similar meanings (e.g., “big” vs. “large”).

Polysemy: Words with multiple meanings (e.g., “bank”).

Context dependency: Meaning influenced by surrounding words.

140
Q

Lecture 10

A
141
Q

What are the challenges with sparse document-term matrices?

A

Sparse matrices are high-dimensional with many zeros, making clustering and computational operations inefficient. Dimensionality reduction techniques, such as topic modeling, help address these challenges.

142
Q

What is topic modeling, and why is it useful?

A

Topic modeling is a statistical method to identify topics within a collection of documents. It reduces dimensionality and summarizes large datasets, providing insights like “this document is 80% sports and 20% education.”
As is a clustering method there is no need to know in advance what the topics are

143
Q

What are two commonly used topic modeling techniques?

A

Latent Semantic Indexing (LSI): Uses singular value decomposition to identify patterns in relationships between terms and documents.

Latent Dirichlet Allocation (LDA): A probabilistic model that assumes documents are mixtures of topics and topics are distributions over words.

144
Q

What are the limitations of topic modeling?

A

Does not generalize well to unseen documents.

Sensitive to parameter choices and prone to overfitting.

Performs poorly on short texts.

145
Q

What is a word embedding, and why is it significant?

A

Word embedding represents words as dense vectors in a continuous vector space. They capture semantic and syntactic relationships, enabling tasks like clustering, similarity measurement, and downstream machine learning tasks.
In other words, semantic relationships between words become relationships in vectors in the vector space
Words that are similar become vectors that are close together

146
Q

How are word embeddings generated?

A

Word embeddings are created by training neural networks to predict the context of words:

Use one-hot encoded vectors as inputs.

Train a two-layer neural network with fewer hidden neurons than vocabulary size.

Minimize the error between predicted and actual next words.

The resulting vectors in the hidden layer are the embeddings.

147
Q

What is the difference between Continuous Bag of Words (CBOW) and Skip-gram models?

A

CBOW: Predicts a word based on its surrounding context.

Skip-gram: Predicts the context given a single word.

Both generate word embeddings, but Skip-gram is better for smaller datasets.

148
Q

What are some properties of word embeddings?

A

Similar words have similar vectors (e.g., “ship” and “boat”).

Capture semantic relationships (e.g., “king - man + woman = queen”).

Support multilingual embeddings, aligning words with similar meanings across languages.

149
Q

What is polysemy, and how does it affect word embeddings?

A

Polysemy refers to words with multiple meanings (e.g., “queen” as a monarch or a band). Standard embeddings mix these meanings, but sense embeddings can separate them by training with labeled sense-specific data.

150
Q

What are sense embeddings?

A

Sense embeddings represent different meanings of a word separately. For example, “queen” could have one vector for the monarchy and another for the band. These embeddings require labeled training data.

151
Q

What are sentence embeddings, and how are they different from word embeddings?

A

Sentence embeddings represent entire sentences as vectors. They capture the overall meaning of sentences, making them useful for tasks like clustering, paraphrase detection, and semantic similarity measurements.

152
Q

What are some applications of sentence embeddings?

A

Measuring semantic similarity.

Clustering similar texts.

Paraphrase mining.

Automatic translation.

Text summarization.

153
Q

How can sentence embeddings integrate text and images?

A

By aligning embeddings of text and images into the same vector space, models can perform tasks like caption generation and image-to-text retrieval.

154
Q

What is the core challenge addressed by topic modeling?

A

Reducing high-dimensional and sparse document-term matrices into meaningful and smaller representations that highlight topics.

155
Q

Why is it hard to cluster sparse vector representations of text?

A

Sparse vectors contain many zero values, making it challenging to calculate meaningful distances or similarities.

156
Q

How does dimensionality reduction improve text analysis?

A

It simplifies sparse matrices, reducing computational complexity and improving clustering and visualization of text data.

157
Q

What are semantic relationships in word embeddings? Provide examples.

A

Semantic relationships are meanings reflected as vector operations. Examples:

Gender: “king - man + woman = queen”

Verb tense: “run - ran”

Country-Capital: “France - Paris”

158
Q

How do neural networks predict next words in word embedding training?

A

Neural networks:

Input a word (one-hot vector).

Pass through a hidden layer.

Predict the next word (output vector).

Update weights to minimize prediction error.

159
Q

What is the importance of multilingual embeddings?

A

Multilingual embeddings align words with similar meanings across languages, enabling translation and cross-lingual tasks.

160
Q

What are the drawbacks of using topic models for short texts?

A

Short texts lack enough word diversity, making it difficult for models to reliably identify underlying topics.

161
Q

What are common dimensionality reduction methods besides topic modeling?

A

Principal Component Analysis (PCA).

Singular Value Decomposition (SVD).

Autoencoders.

These methods also reduce high-dimensional data into lower-dimensional representations.

162
Q

Lecture 11(Watch this if you cba)

A
163
Q

What type of algorithm is KNN, and what is its main assumption?

A

KNN is a supervised learning algorithm primarily used for classification problems. It assumes that similar data points will exist close to each other based on a distance metric (e.g., Euclidean distance).

164
Q

How does KNN determine the class of a new data point?

A

KNN determines the class of a new data point by identifying the k nearest neighbors and assigning the class based on the majority class among these neighbors. The parameter k is a hyperparameter of the algorithm.

165
Q

What is the role of the hyperparameter k in KNN?

A

The hyperparameter k defines the number of nearest neighbors considered for classifying a new data point. A smaller k may result in overfitting, while a larger k may oversimplify the classification.

166
Q

Describe the steps to classify a point using KNN.

A

Compute the distance of the new point to all other points in the dataset.

Select the k nearest neighbors.

Determine the class of the new point by majority voting among the selected neighbors.

167
Q

Why is KNN considered a non-parametric algorithm?

A

KNN is non-parametric because it makes no assumptions about the underlying data distribution and instead bases predictions directly on the data.

168
Q

How do outliers affect KNN?c

A

Outliers can negatively influence KNN by skewing the decision boundary, especially if the outlier’s class differs from that of the majority of nearby points. This can lead to incorrect classifications.

169
Q

How does class imbalance impact KNN, and how can it be addressed?

A

Class imbalance can bias KNN towards the majority class, leading to poor performance for the minority class. This can be addressed by weighting the data points based on the inverse of their distance to the query point.

170
Q

How can you select the optimal value of k in KNN?c

A

The optimal k can be selected by:

Iteratively evaluating performance metrics (e.g., precision, recall) on a test set while varying k.

Using the square root of the number of training samples as a heuristic for k.

For binary classification, choosing an odd value of k to avoid ties.

Maybe further research this to get a stronger understanding

171
Q

What is Weighted KNN, and when would you use it?

A

Weighted KNN assigns greater importance to closer neighbors by weighting their contribution inversely to their distance. It is useful when closer points are expected to be more relevant for classification.

172
Q

How does KNN differ from other machine learning algorithms?

A

KNN is an instance-based learning algorithm that constructs hypotheses directly from training instances. Its complexity grows with the dataset size, making it less efficient for large datasets. It is best suited for low-dimensional data and requires data normalization for consistent performance.

173
Q

Why is normalization important in KNN?

A

Normalization ensures all features contribute equally to the distance metric, preventing features with larger ranges from dominating the results.

174
Q

how can categorical data be handled in KNN

A

Categorical data can be converted into numerical format using techniques like integer encoding. For example, “Overcast” could be encoded as 0, “Rainy” as 1, and “Sunny” as 2.

175
Q

How are features combined in a KNN dataset?

A

Features can be combined into a single dataset using tools like the zip function, which creates tuples of feature values for each observation.

176
Q

What are some key points about K nearest Neighbour classification?

A

KNN is a simple yet powerful supervised learning algorithm.

It is affected by class imbalance, outliers, and feature scaling.

Weighted KNN and optimal k selection can improve performance.

KNN is computationally expensive for large datasets but works well for small, low-dimensional datasets.

177
Q

Lecture 12 06/11

A
178
Q

What is a Support Vector Machine (SVM)?

A

A Support Vector Machine (SVM) is a supervised machine learning model used for classification, regression, and clustering problems. Its goal is to find a hyperplane that separates data points into distinct classes by maximizing the margin between the closest data points of different classes.

179
Q

What are the advantages of using SVMs?

A

Advantages of SVMs include:

Effective in high-dimensional spaces.

Versatile, as different kernel functions can be specified for decision functions.

Well-suited for both linearly separable and non-linearly separable data using the kernel trick.

Robust to overfitting when the dimensionality of data is higher than the number of samples.

180
Q

What are support vectors in SVM?

A

Support vectors are the data points closest to the hyperplane that influence its position and orientation. These points are critical in defining the decision boundary and maximizing the margin.

181
Q

What is the margin in SVM?

A

The margin is the distance between the hyperplane and the nearest data points from each class. SVM aims to maximize this margin to improve the generalization of the classifier.

182
Q

What is a maximum margin classifier?

A

A maximum margin classifier selects a decision boundary that maximizes the margin between classes. However, it can be sensitive to outliers, potentially resulting in misclassifications.

183
Q

How does SVM handle outliers?

A

SVM handles outliers by introducing a soft margin, which allows for some misclassifications. This approach improves the model’s robustness and generalization by balancing bias and variance.

184
Q

What is a soft margin in SVM?

A

A soft margin allows for misclassification of some data points to achieve a better tradeoff between maximizing the margin and minimizing classification error, especially in the presence of outliers.

185
Q

What is the purpose of nonlinear transformation in SVM?

A

Nonlinear transformation maps data from its original feature space to a higher-dimensional space, making it linearly separable in the transformed space. This transformation allows SVM to handle complex, non-linear relationships.

186
Q

How does the kernel trick work in SVM?

A

The kernel trick allows SVM to compute the relationships in a higher-dimensional feature space without explicitly transforming the data. It uses a kernel function to calculate the dot product in this space, reducing computational complexity.
^understand this more if time

187
Q

What is a kernel function?

A

A kernel function is a mathematical function that takes two input vectors from the original space and returns their dot product in the transformed feature space. Common kernel functions include:

Linear kernel

Polynomial kernel

Radial Basis Function (RBF) kernel

Sigmoid kernel

188
Q

How does the second-degree polynomial kernel function work?

A

The second-degree polynomial kernel computes the dot product in the transformed space by evaluating functions of the original components. For example, for inputs A1 and A2, it may use terms like (A1^2, A2^2, and A1×A2).

189
Q

What is the bias-variance tradeoff in SVM?

A

Regularization parameter (C): Controls the tradeoff between achieving a low training error (low bias) and maintaining simplicity (low variance).

Gamma parameter: Defines the influence of a single training example, controlling model complexity and generalization.

190
Q

What are the key takeaways about SVM?

A

SVMs are powerful for linear and nonlinear classification tasks.

Support vectors, hyperplanes, and margins are fundamental concepts.

Nonlinear transformations and kernel tricks enable handling complex data structures.

Practical applications like the IRIS dataset and face recognition illustrate SVM’s versatility.

191
Q

Lecture 13 11/11

A
192
Q

What is a decision tree?

A

A decision tree is a decision support tool that uses a tree-like model of decisions and their possible consequences, including chance outcomes, resource costs and utility statements. A decision tree is one common way to display an algorithm that only contains conditional statements

193
Q

What is the goal of decision tree learning?

A

The goal is to create a model that pre dicts the value of a target variable based on several input variables. This type of decision tree is called a classification tree.

194
Q

What is the difference between classification trees and regression trees?

A

Classification trees predict categorical outcomes (e.g., “yes” or “no”), while regression trees predict continuous numeric values. This lecture focuses exclusively on classification trees.

195
Q

How do you interpret a classification tree?

A

To interpret a classification tree, start at the root node and follow the branches based on whether conditions are true or false until reaching a leaf node. The path determines the classification. Typically, a “true” condition moves to the left branch, and a “false” condition moves to the right branch. Leaf nodes represent the final classification or prediction.

196
Q

What are the key components of a decision tree?

A

The key components are:

Root Node: The starting point of the tree.

Internal Nodes: Represent decisions based on feature values.

Branches: Show the outcome of decisions.

Leaf Nodes: Provide the final classification or decision.

197
Q

What is Gini impurity?

A

Gini impurity measures the likelihood of incorrectly classifying a randomly chosen element if its classification is based on the distribution of classes in a dataset. A Gini impurity of 0 indicates pure leaves, where all samples belong to a single class.

198
Q

Why is Gini impurity used in building decision trees?

A

Gini impurity quantifies the purity of a node and helps decide which feature to split on at each step of the tree construction. The feature with the lowest Gini impurity after splitting is chosen to maximize classification accuracy.

199
Q

How do you calculate the weighted Gini impurity for a split?

A

To calculate the total Gini impurity for a split:

Calculate the Gini impurity for each leaf node.

Weight each leaf’s impurity by the proportion of samples in that leaf relative to the total samples.

Sum the weighted impurities.

200
Q

What is pre-pruning in decision trees?

A

Pre-pruning, or early stopping, limits the growth of a decision tree to prevent overfitting by:

Setting a maximum tree depth.

Requiring a minimum number of samples for splitting a node.

Ensuring each leaf has a minimum number of samples. This controls the model complexity and improves generalization.

201
Q

What is post-pruning in decision trees?

A

Post-pruning simplifies an already built decision tree by removing nodes and subtrees that do not improve predictive performance. This is done by comparing error terms before and after removing nodes and retaining the simpler tree if it performs similarly or better.

202
Q

How is feature importance determined in decision trees?

A

Feature importance is calculated based on the reduction in impurity achieved by a feature across all splits where it is used. Features contributing the most to reducing impurity are considered the most important. This can be visualized through importance scores.

203
Q

What are common issues with decision trees, and how are they mitigated?

A

Overfitting: Pruning (pre- or post-pruning) or limiting tree depth.

Bias towards features with more levels: Use impurity measures like Information Gain Ratio.

Instability: Use ensemble methods like Random Forests or boosting.

204
Q

How do you split numeric features in decision trees?

A

Numeric features are split by:

Sorting feature values in ascending order.

Calculating potential split points as the average of adjacent values.

Evaluating the impurity at each split point and choosing the one with the lowest impurity.

205
Q

What are the steps to implement pre-pruning in practice?

A

Set parameters such as:

max_depth for maximum tree depth.

min_samples_split for the minimum samples required to split.

min_samples_leaf for the minimum samples per leaf.

Train the tree with these constraints.

Evaluate performance to ensure the tree generalizes well.

206
Q

How does post-pruning improve decision tree performance?

A

Post-pruning removes redundant nodes or subtrees after the tree is built, simplifying the model and reducing overfitting. It compares error metrics before and after pruning and keeps the simpler structure if predictive accuracy is maintained or improved.

207
Q

What is the cost complexity pruning method?

A

Cost complexity pruning introduces a penalty parameter to balance tree complexity and error. It minimizes the cost function:
, where is the classification error and is the number of nodes. Trees are pruned iteratively, and is chosen using cross-validation.

208
Q

What is an instance based learning algorithm?

A

Instance-based learning algorithms are a category of machine learning algorithms that rely on the specific training instances (or examples) to make predictions. Instead of explicitly constructing a general model from the training data, these algorithms store the data and use it directly during the prediction phase. The predictions are typically based on a comparison between the new data point and the stored training examples.

Key Features:
Storage of Training Data: The algorithm retains all or most of the training data.
Similarity Measure: Predictions are made by comparing the new instance to the stored data using a similarity or distance metric, such as Euclidean distance.
Lazy Learning: They are often referred to as lazy learning algorithms because they defer the model-building process until a prediction is required. This contrasts with “eager learning,” where a model is built during training.

209
Q

Lecture 14 13/11

A
210
Q

What are the key components of a social network?

A

Social networks consist of:

Nodes: Represent individuals or entities (e.g., people).

Edges: Represent relationships or connections between nodes.

Topology: Refers to the structure of the network.

Communities: Groups within the network where nodes are densely connected.

Centrality: Measures the importance of nodes in the network.

211
Q

How do online social networks provide data for analysis?

A

Online social networks provide structured data through:

Ego networks: Networks focused on a specific node and its direct connections.

Metrics like modularity and centrality: Help identify communities and influential nodes.Tools like Gephi can be used for visualization and analysis.

212
Q

What types of alternative edges are used to construct proxy social networks?

A

Alternative edges include:

Retweets: Interactions on platforms like Twitter.

Mentions: Direct references to nodes.

Co-occurrence: Nodes appearing together in contexts such as articles.

Citations: References to other works or authors.

Co-authorship: Collaborations between authors.

213
Q

What does centrality measure in a network?

A

node’s central position in the network.

Its importance, influence, and power within the network.Centrality is fundamental for identifying influential nodes.

214
Q

List and define the key types of centrality.

A

Degree Centrality: Measures the number of edges connected to a node.

In-degree: Number of incoming edges.

Out-degree: Number of outgoing edges.

Eigenvector Centrality: Considers the importance of a node’s neighbors. High centrality means a node is connected to other highly influential nodes.

PageRank: A variation of eigenvector centrality where importance is weighted by the out-degree of neighboring nodes.

Closeness Centrality: Measures the mean shortest path from a node to all other nodes.

Betweenness Centrality: Counts the number of shortest paths passing through a node, indicating its role as a bridge.

215
Q

Why is degree centrality important, and what does it measure?

A

Degree centrality measures the number of edges a node has. It is important for:

Evaluating a node’s influence (more edges = more connections).

Assessing access to information and prestige in directed networks.

216
Q

How does eigenvector centrality differ from degree centrality?

A

Eigenvector centrality accounts for the importance of a node’s neighbors. A node with fewer but influential connections may have higher centrality than a node with many insignificant connections.

217
Q

How does PageRank centrality work, and where is it applied?

A

PageRank centrality is a variant of eigenvector centrality. It assigns importance based on:

The importance of neighboring nodes.

Their out-degree (number of outgoing links).It’s famously used by Google’s search engine to rank web pages.

218
Q

What does closeness centrality measure, and what are its limitations?

A

Closeness centrality measures the mean distance from a node to all others. It indicates how quickly information can spread from a node. Limitations include difficulty in comparing nodes across different components of disconnected networks.

219
Q

Why is betweenness centrality significant in a network?

A

Betweenness centrality measures the extent to which a node acts as a bridge in the network. It:

Indicates control over the flow of information.

Reflects a node’s power as a broker.

Can be extended to weight paths inversely, highlighting robustness.

220
Q

What scenarios favor different centrality measures?

A

Degree Centrality: Identifying highly connected nodes in undirected or simple networks.

Eigenvector/PageRank: Detecting nodes with influential neighbors in undirected/directed networks.

Closeness Centrality: Evaluating nodes’ ability to quickly spread information.

Betweenness Centrality: Identifying brokers or bottlenecks in communication.

221
Q

Who contributed significantly to the early study of networks?

A

Key contributors include:

JL Moreno (1934): Developed sociograms and analyzed social structures like friendship networks.

Leonhard Euler (1736): Solved the Königsberg Bridge Problem, founding graph theory.

Duncan Watts and Steven Strogatz (1998): Introduced small-world networks.

Albert-László Barabási (2003): Studied scale-free networks.

222
Q

What are some applications of network science in real-world scenarios?

A

Applications include:

Social network analysis to study influence and community detection.

Epidemiology for modeling disease spread.

Web science for ranking pages and analyzing internet structure.

Transportation networks to optimize flow and connectivity.

223
Q

What emergent properties can arise in complex networks?

A

Emergent properties include:

Community structure: Groups of densely connected nodes.

Resilience: Ability to withstand node or edge removal.

Synchronization: Coordinated behavior across nodes.

Power-law distribution: Characteristic of scale-free networks.

224
Q

There is going to be an exam question on
networks g, look particularly at measures of closeness

A
225
Q

Lecture 15 18/11

A
226
Q

What is unsupervised learning?

A

Unsupervised learning is a type of machine learning where algorithms learn patterns and structures in data without labeled outcomes. It is used to group or segment data and identify hidden structures.

227
Q

In which scenarios is unsupervised learning applied?

A

Unsupervised learning is applied in:

Customer segmentation (e.g., identifying single parents, young party-goers).

Fraud detection (e.g., analyzing bank transactions, GPS logs, social media bots).

Identifying new animal species.

Creating classes needed for supervised classification tasks.

228
Q

What are the general steps of clustering?

A

The general steps of clustering include:

Iterating through all data points.

Measuring the distance or similarity between points.

Grouping data points into clusters where intra-cluster similarity is higher than inter-cluster similarity.

229
Q

Why is clustering considered unsupervised learning?

A

Clustering is unsupervised because it does not rely on labeled training data. Instead, it identifies inherent patterns or groupings in the dataset.

Front: What are key examples of clustering applications?
Back: Key examples include:

Customer segmentation for marketing.

Identifying communities in networks.

Grouping similar text documents or images.

230
Q

What are key examples of clustering applications?

A

Key examples include:

Customer segmentation for marketing.

Identifying communities in networks.

Grouping similar text documents or images.

231
Q

What is K-means clustering?

A

K-means is a clustering algorithm that partitions data into K clusters. It assigns data points to the cluster with the nearest centroid, which is iteratively updated to minimize intra-cluster variance. The user specifies K.

232
Q

What is DBSCAN, and how does it differ from K-means?

A

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) identifies core regions of high density and expands clusters from these regions. Unlike K-means, DBSCAN does not require specifying the number of clusters and can detect noise.

233
Q

What is hierarchical clustering?

A

Hierarchical clustering creates a tree of clusters (a dendrogram). It can be:

Agglomerative: Bottom-up approach where clusters are merged iteratively.

Divisive: Top-down approach where clusters are split iteratively.

234
Q

What is the difference between hard and soft clustering?

A

Hard clustering: Each data point belongs to one cluster (e.g., K-means).

Soft clustering: Data points can belong to multiple clusters with probabilities (e.g., Gaussian Mixture Models).

235
Q

Why are distance metrics important in clustering?

A

Distance metrics quantify the similarity or dissimilarity between data points, guiding the formation of clusters. The choice of metric affects the clustering results.

236
Q

What is the Euclidean distance?

A

The Euclidean distance (L2 norm) is the straight-line distance between two points in space. It is calculated as:
sqrt[(x2-x1)^2 + (y2-y1)^2]

237
Q

Lecture 16 20/11

A
238
Q

Why is dimensionality reduction important in data analysis?

A

Dimensionality reduction is important because it:

Removes noise from the data.

Focuses on the features or combinations of features that are actually important.

Reduces computational complexity by requiring less number-crunching, making the analysis more efficient.

239
Q

What are the two main approaches to dimensionality reduction?

A

The two main approaches are:

Feature selection: Identifies and retains the most important features in the data.

Feature extraction: Combines existing features to create new, informative features.

240
Q

What is variance thresholding, and when is it used?

A

Variance thresholding is a filter method used to eliminate features with low variance, as they typically contain less information. Steps:

Calculate the variance of each feature.

Drop features with variance below a set threshold.

Ensure features are on the same scale by normalizing or standardizing beforehand.

241
Q

How does forward search work for feature selection?

A

Forward search is a wrapper method that:

Creates models using one feature at a time.

Selects the best-performing feature.

Iteratively adds one feature at a time to the selected set, testing performance.

Repeats until a predefined number of features are chosen.

242
Q

What is recursive feature elimination, and how does it differ from forward search?

A

Recursive feature elimination (RFE) is another wrapper method that:

Starts with all features.

Removes one feature at a time, building models with the remaining features.

Selects the best subset after each iteration.
RFE differs from forward search in that it begins with all features and systematically removes the least important ones, while forward search adds features incrementally.

243
Q

What are embedded methods in feature selection?

A

Embedded methods perform feature selection during the model training process. An example is decision trees, which:

Split data based on feature importance (e.g., Gini impurity or information gain).

Naturally prioritize features that reduce uncertainty or variance the most.

244
Q

What is the core idea behind Principal Component Analysis (PCA)?

A

PCA transforms the data into a new coordinate system where:

Each new axis (principal component) is orthogonal to the others.

Principal components are ordered by the amount of variance they capture.

The first principal component captures the most variance, followed by the second, and so on.

245
Q

How does PCA handle highly correlated features?

A

PCA is particularly effective with highly correlated features because:

It combines them into fewer uncorrelated components.

It reduces redundancy while retaining most of the variance in the data.

246
Q

What is the worst-case scenario for PCA?

A

The worst-case scenario occurs when:

All variables are equally important and uncorrelated.

PCA still works but does not provide an informative reduction in dimensions.

247
Q

Describe the technical steps to perform PCA.

A

Steps for PCA:

Compute the covariance matrix of the data.

Diagonalize the covariance matrix to find its eigenvalues and eigenvectors.

Use the eigenvectors as principal components and eigenvalues to measure variance captured.

Transform the data into the new coordinate system using the principal components.

Retain the first K principal components for dimensionality reduction.

248
Q

What is t-SNE, and what is its primary goal?

A

t-SNE (t-distributed Stochastic Neighbor Embedding) is a nonlinear dimensionality reduction technique. Its goal is to:

Preserve the pairwise similarities of data points when mapping high-dimensional data into 2D or 3D space.

Scatter points in a lower dimension while ensuring the distribution of distances resembles the original.

249
Q

What are the key steps in t-SNE?

A

Key steps in t-SNE:

Compute pairwise distances between points in high-dimensional space.

Convert distances into probabilities using a Gaussian distribution.

Map points to a lower-dimensional space and fit a t-distribution to distances.

Minimize the difference between the two distributions using gradient descent.

250
Q

What are the limitations of t-SNE?

A

Limitations of t-SNE include:

High memory usage, which makes it unsuitable for large datasets.

Lack of interpretability in distances between far-apart clusters.

Dependency on hyperparameters, which can affect clustering results.

251
Q

It is probably worth watching a video on t-SNE

A
252
Q

How does UMAP differ from t-SNE?

A

Differences between UMAP and t-SNE:

UMAP runs faster and uses less memory.

UMAP can preserve both local and global structures in the data.

UMAP allows embeddings in more than three dimensions.

253
Q

What are common problems with both t-SNE and UMAP?

A

Common problems include:

Heavy reliance on hyperparameter tuning.

Cluster sizes and distances between clusters are often meaningless.

Axes in the resulting visualizations are not interpretable.

254
Q

Why is it important to critically assess dimensionality reduction techniques?

A

Critical assessment is important to:

Determine whether the technique achieves the desired goals.

Identify where the method succeeds or fails.

Suggest improvements or better alternatives for the given data.

255
Q

What are some best practices when using dimensionality reduction methods?

A

Best practices include:

Understanding the assumptions and limitations of each method.

Preprocessing data (e.g., scaling, removing outliers) appropriately.

Experimenting with multiple methods to identify the best fit for the data and use case.

Using visualizations and domain knowledge to validate results.

256
Q

Lecture 17 25/11

A
257
Q

What is the main advantage of hierarchical clustering?

A

Hierarchical clustering is useful because it represents various degrees of similarity through a tree structure (dendrogram), allowing data to be partitioned at different levels. This is particularly effective if the data has an underlying tree structure.

258
Q

What are dendrograms in hierarchical clustering?

A

Dendrograms are tree diagrams that show a hierarchy of clusters. Each node represents a cluster, with single-node clusters called singletons. They visualize how clusters are merged or split at various levels.

259
Q

What is the “bottom-up” approach in hierarchical clustering?

A

Also called agglomerative clustering, it starts with individual data points as separate clusters and merges them iteratively based on a distance matrix until all points form a single cluster or meet a stoppi

260
Q

Why can’t we use a brute force approach to create all possible dendrograms?

A

The number of dendrograms grows exponentially with the number of leaves, making brute force computation impractical for even moderately sized datasets.

261
Q

How is distance between clusters computed in simple linkage?

A

In simple linkage, the distance between clusters is defined as the shortest distance between any two points, one from each cluster. However, this can lead to long, chain-like clusters.

262
Q

How does complete linkage define the distance between clusters?

A

Complete linkage defines the distance as the farthest distance between any two points in the clusters. It tends to break large clusters into smaller ones.

263
Q

What is average linkage in hierarchical clustering?

A

Average linkage calculates the distance between clusters as the average of all pairwise distances between points in the clusters. It is computationally intensive but provides balanced clusters.

264
Q

What is centroid linkage?

A

Centroid linkage measures the distance between clusters as the distance between their centroids. It is biased toward forming spherical clusters

265
Q

What is Ward’s method in hierarchical clustering?

A

Ward’s method joins clusters only if the merge minimizes the increase in the total within-cluster variance, making it biased toward spherical clusters as well.

266
Q

What are common applications of hierarchical clustering?

A

It is widely used in fields like biology (phylogenetic tree construction), document clustering, and image segmentation. In phylogenetics, it helps create evolutionary trees using methods like Maximum Likelihood Estimates and Bayesian Inference.

267
Q

How do you determine the “right” cutoff in a dendrogram?

A

The cutoff depends on the desired number of clusters or the natural clustering structure in the data. In some cases, it is visually apparent, while in others, it requires experimentation.

268
Q

What is the core idea of DBSCAN?

A

DBSCAN groups data into clusters based on local density. A point is part of a cluster if the density around it, defined by a specified radius (eps) and minimum points (MinPts), exceeds a threshold.

269
Q

How does DBSCAN define density?

A

Density is defined by two hyperparameters:

eps (epsilon): The radius within which points are considered neighbors.

MinPts: The minimum number of points required within the radius to form a dense region.

270
Q

What are the types of points in DBSCAN?

A

Core points: Points with at least MinPts neighbors within eps.

Border points: Points with fewer than MinPts neighbors but within the neighborhood of a core point.

Noise points: Points that are neither core nor border points.

271
Q

What is the result of a DBSCAN clustering?

A

All points within a cluster are reachable from each other by paths of length eps, forming clusters of arbitrary shape while identifying outliers (noise points).

272
Q

What are the strengths of DBSCAN?

A

DBSCAN can:

Handle clusters of different shapes and sizes.

Identify noise and outliers effectively.

Work well for spatial data and when clusters are non-spherical.

273
Q

What are the limitations of DBSCAN?

A

Limitations include:

Struggles with varying densities in the same dataset.

Highly sensitive to the choice of eps and MinPts.

Can fail to identify clusters when data density varies significantly.

274
Q

How do eps and MinPts interact in DBSCAN?

A

The choice of eps determines the size of the neighborhood, and MinPts sets the density threshold. Together, they control the size and shape of clusters. Improper settings can lead to over- or under-clustering

275
Q

How can you determine optimal values for eps and MinPts IN DBSCAN?

A

Common techniques include:

Using a k-distance plot to identify a natural elbow in distances.

Experimenting with multiple values and evaluating cluster validity using metrics like silhouette score.

276
Q

Lecture 18 27/11

A
277
Q

What is the main advantage of partitional clustering methods over hierarchical clustering?

A

Partitional clustering methods create a set of non-nested partitions corresponding to clusters. They require fewer comparisons, reducing computational complexity from (in hierarchical clustering) to, where is the number of clusters and is the number of data points.

278
Q

What are the key steps of the K-means clustering algorithm?

A

The K-means algorithm involves the following steps:

Input data points and a given number of clusters .

Choose random data points as initial cluster centroids.

Assign each data point to the closest centroid.

Recompute the centroids using the current points in each cluster.

Check for a convergence or stopping criterion. If not met, repeat steps 3 to 5.

279
Q

What are the possible stopping criteria in K-means clustering?

A

Possible stopping criteria include:

Few or no reassignments of data points to different clusters.

Minimal or no change in centroids.

Minimal or no change in the sum of squared errors (SSE).

280
Q

What are the advantages of K-means clustering?

A

Advantages of K-means include:

Efficiency: For data points in dimensions, clusters, and up to iterations, the runtime is , making it scalable for large datasets.

Simplicity: It is easy to understand and implement.

281
Q

What are the limitations of K-means clustering?

A

Limitations of K-means include:

It only works if a centroid can be defined, which may not be possible for categorical data.

The need to specify the value of .

Sensitivity to outliers.

Struggles with clusters of varying size, shape, or density.

Dependence on the initial choice of centroids.

282
Q

How can the limitations of K-means clustering be addressed?

A

Solutions include:
Pre-processing:

Normalize data (scale to [0, 1]) or standardize it (subtract mean, divide by standard deviation).

Eliminate outliers.

Post-processing:

Eliminate small clusters representing outliers.

Split clusters with high SSE or merge clusters with low SSE.

Alternatively, use more advanced clustering algorithms, such as Gaussian Mixture Models (GMM).

283
Q

What distinguishes Gaussian Mixture Models (GMM) from K-means clustering?

A

GMM represents clusters using probability distributions, fitting the mean (μ) and standard deviation (σ) of Gaussian components. Unlike K-means, which assigns points to clusters hard, GMM uses soft assignments, where a point belongs to clusters with certain probabilities. GMM assumes elliptical clusters, while K-means assumes spherical clusters.

284
Q

What is the role of the likelihood function in GMM?

A

The likelihood function in GMM measures the probability of data points given the Gaussian distributions. The algorithm maximizes the log-likelihood to fit the parameters (μ, σ, and π) of the model. This ensures the Gaussian components accurately represent the underlying data structure.

285
Q

What are the main clustering validation metrics?

A

Validation metrics include:

External validation: Measures how clustering labels align with ground-truth labels.

Internal validation: Evaluates clustering quality using internal measures like cohesion (within-cluster distances) and separation (between-cluster distances).

Relative validation: Compares multiple clustering results to identify the best fit.

286
Q

Between K means and GMM which method assigns points softly to clusters?

A

GMM assigns points softly based on a probability distribution while K means assigns points in a hard manner and definitively classifies them

287
Q

Out of K means and GMM which clustering method assumes spherical clusters and which assumes elliptical?

A

K means assumes spherical clusters and GMM assumes elliptical clusters

288
Q

How is the Silhouette Coefficient used in clustering validation?

A

The Silhouette Coefficient measures clustering quality:

: Mean distance within the same cluster.

: Mean distance to the nearest cluster.

The coefficient is calculated as:

b - a / max(a,b)

where:
a = mean distance within clusters
b= mean distance to the nearest cluster

Values range from -1 (poor clustering) to 1 (excellent clustering).

289
Q

How can the Silhouette Coefficient be visualized?

A

The coefficient can be visualized using a bar chart where each bar represents the silhouette value of a data point, grouped by clusters. The average silhouette score can also be plotted to summarize clustering quality.

290
Q

What are some common preprocessing steps for clustering?

A

Preprocessing steps include:

Normalization (scaling data to [0, 1]) or standardization (subtracting mean, dividing by standard deviation).

Removing or handling outliers to avoid skewed results.

Feature selection to focus on relevant variables.

291
Q

Why does K-means struggle with clusters of varying sizes, shapes, or densities?

A

K-means assumes clusters are spherical and equally sized, leading to poor performance when clusters have irregular shapes, varying sizes, or densities. Points near boundaries can also be misclassified due to hard assignments.

292
Q

Lecture 19

A
293
Q

What is the input and output of computer vision systems?

A

Input: Images (e.g., photographs, video frames).

Output: High-level information about people, objects, or 3D structures, such as:

Object detection

Image segmentation

3D image reconstruction

Terrain modeling and position tracking (e.g., NASA Spirit rover applications).

294
Q

How are images represented in computer vision?

A

Images are represented as matrices of pixel values.

For grayscale images: A single 2D matrix (e.g., 400 x 400 pixels).

For colored images: Three 2D matrices, one for each color channel (red, green, and blue).

295
Q

What inspired deep neural networks, particularly in computer vision?

A

The visual cortex of the brain is organized into layers, with information flowing from one layer to another.

Research by Hubel & Wiesel (1959): Neurons in the visual cortex respond to specific patterns, such as lines of particular orientations.

Convolutional Neural Networks (CNNs) emulate this layered structure and hierarchical feature detection

296
Q

What is a perceptron, and how does it process input?

A

A perceptron is the simplest type of neural network.

Input: A vector of features (e.g., x1, x2, …, xK).

Output: A binary result (y = 0 or 1).

Each input is weighted, summed, and passed through an activation function to determine the output.

297
Q

What are the limitations of representing images as high-dimensional vectors for neural networks?

A

Scalability: A 100 x 100 pixel image would require 10,000 parameters per node.

Sensitivity: Networks are not robust to small changes in input (e.g., image translation or rotation).

Inefficiency: Does not leverage spatial correlations between nearby pixels.

298
Q

How do convolutional neural networks (CNNs) address these limitations?

A

CNNs use filters (kernels) to identify patterns such as edges or textures in images.

Spatial relationships between pixels are preserved via convolutions.

Shared weights reduce the number of parameters, enhancing scalability.

Robust to transformations (e.g., shifts, rotations) due to hierarchical feature learning.

299
Q

What is a convolution in CNNs?

A

A convolution involves sliding a filter (kernel) across an image and performing element-wise multiplications (dot products) between the filter and the image patch.

Resulting values are stored in a feature map, which highlights the presence of specific features.

Positive values in the feature map indicate where the filter detects features.

300
Q

What is a feature map in CNNs?

A

A feature map represents the regions of an image where a particular feature (e.g., edges, lines) is detected.

Positive values: The feature is present in the corresponding area.

Negative or zero values: The feature is absent.

301
Q

What are the key advancements in computer vision since 1980?

A

Deeper architectures: Introduction of more layers and complex networks (e.g., ResNet, AlexNet).

Data availability: Large datasets (e.g., ImageNet) and resources (e.g., GPUs, cloud computing).

Software tools: Frameworks like TensorFlow, PyTorch, and Keras simplify implementation.

Deep learning: The rise of “deep learning” as a subfield in the 2000s.

302
Q

What are the limitations of deep learning in computer vision?

A

Data requirements: Requires large amounts of labeled data.

Computational intensity: Demands powerful GPUs or cloud resources.

Uncertainty: Struggles with representing uncertainty; easily fooled by adversarial examples.

Optimization challenges: Difficult to fine-tune architectures and learning methods.

Interpretability: Neural networks often function as black boxes, making it hard to understand decisions.

303
Q

How does CNN backpropagation work for learning filters?

A

Filters in CNNs are initialized randomly and updated using the backpropagation algorithm.

Loss gradients with respect to the filter weights are computed and used to adjust the weights to minimize the loss.

The process repeats over multiple training iterations.

304
Q

What role do GPUs play in deep learning for computer vision?

A

GPUs excel at parallel processing, which is critical for training deep learning models.

They accelerate matrix multiplications and convolutions, enabling faster training on large datasets.

305
Q

What is image segmentation, and why is it important?

A

Image segmentation divides an image into meaningful segments, such as identifying individual objects or regions.

Applications include medical imaging (e.g., tumor detection) and autonomous vehicles (e.g., road and obstacle detection).

306
Q

What is the next step after creating a feature map in CNNs?

A

Use feature maps as inputs to:

Pooling layers: Downsample feature maps to reduce dimensionality while retaining important information.

Fully connected layers: Combine extracted features for final predictions (e.g., classification).

Customizing networks: Optimize learning rates, activation functions, and other parameters.

307
Q

Why are pooling layers used in CNNs?

A

Pooling reduces the spatial size of feature maps, improving computational efficiency.

Types include max pooling (selects maximum value in a patch) and average pooling (computes the mean value).

Helps achieve translation invariance by focusing on dominant features.

308
Q

How does computer vision benefit from modern resources and tools?

A

Access to large annotated datasets enables better training.

Free frameworks (e.g., TensorFlow, PyTorch) simplify development.

Cloud services provide scalable infrastructure for model training and deployment.

309
Q

What challenges remain in computer vision and deep learning?

A

Creating models that generalize well across diverse datasets.

Developing interpretable neural networks to improve trust.

Efficiently managing large-scale data and computational costs.

Representing uncertainty to prevent overconfidence in predictions.

310
Q

Lecture 20

A
311
Q

What is a feature map in a CNN?

A

A feature map is the result of applying a filter (kernel) to an input image. The convolution operation involves sliding the filter over the image and computing a dot product to extract features like edges or textures.

312
Q

What does the bias term (b) do in a CNN?

A

The bias term shifts the result of the convolution operation, allowing the feature map to detect patterns with varying intensities.

313
Q

What is the ReLU function, and why is it used?

A

ReLU (Rectified Linear Unit) is defined as . It introduces non-linearity, allowing the model to learn complex patterns. It also prevents the vanishing gradient problem seen in sigmoid and tanh activation functions.

314
Q

What is pooling, and why is it used in CNNs?

A

Pooling is a downsampling operation that reduces the spatial dimensions of feature maps, making computation more efficient and reducing overfitting. The most common types are:

Max-pooling: Takes the maximum value in a region.

Average-pooling: Takes the average value in a region.

315
Q

What is the impact of downsampling on CNN performance?

A

Downsampling:

Reduces computational complexity.

Retains important features while discarding irrelevant details.

Helps prevent overfitting by generalizing the model.

316
Q

How does the CNN output transition into a traditional ML problem?

A

After extracting features and downsampling, the CNN output is often flattened into a vector and fed into a classifier like:

Logistic regression.

Fully connected layers of a neural network for final predictions.

317
Q

What is padding in CNNs, and why is it used?

A

Padding adds extra pixels (usually zeros) around the input image, which:

Maintains the spatial dimensions after convolution.

Ensures edge information is preserved.

318
Q

What is stride, and how does it affect the output?

A

Stride is the step size by which the filter moves across the input image. A larger stride:

Reduces the spatial dimensions of the output.

Speeds up computation but may lose detailed features.

319
Q

What are some limitations of ReLU and tanh activation functions?

A

ReLU: Discards all negative values, potentially causing “dead neurons.”

Tanh: Its derivatives approach zero for extreme input values, leading to the vanishing gradient problem

320
Q

How does the softmax function work in classifiers?

A

Softmax converts raw output scores into probabilities for each class using the formula:

It ensures the output probabilities sum to 1, making it suitable for multi-class classification.

321
Q

What is the learning rate in gradient descent?

A

The learning rate () controls how much model parameters update during training.

322
Q

What are the effects of large and small learning rates?

A

Large learning rates: Can overshoot the optimal solution, missing the minima.

Small learning rates: Can make training very slow or get stuck in suboptimal solutions.

323
Q

What techniques are used to adjust the learning rate?

A

Techniques include:

Decaying the learning rate over time (e.g., ).

Using advanced optimizers like Adam, Adagrad, RMSprop, or incorporating momentum.

324
Q

What are hyperparameters in a CNN? Provide examples.

A

Hyperparameters are user-set configurations that control the learning process. Examples include:

Filter size and number of filters.

Padding and stride values.

Learning rate and decay.

Dropout rate.

Number of epochs and batch size.

Activation functions.

Number of hidden layers and neurons per layer.

325
Q

: Why is customization important in neural networks?

A