Learning from data Flashcards
Define data integration and the goal of data integration
Data integration is the practice of combining data from heterogeneous sources into a single coherent
data store.
It’s goal is to provide users with consistent access and delivery of data across a spectrum of
subjects and data structure types
Define a common user interface(manual data integration)
A hands-on approach where data
managers manually handle every step of the integration, from retrieval to presentation.
Define middleware data integration
Uses middleware software to bridge and facilitate communication between different systems, especially between legacy and newer systems
Define application based data integration
Software applications locate, retrieve and integrate data by making
data from different sources and systems compatible with one another
Define uniform data access
It provides a consistent view of data from diverse sources without moving or
altering it, keeping the data in its original location.D
Define common data access(Data Warehousing)
It retrieves and presents data uniformly while creating and storing a duplicate copy, often in a central repository.
What is a pro and a con for a common user interface
Reduced cost, requires little maintenance, integrates a small number of data sources, user has total control.
Data must be handled at each stage, scaling for projects require changing code, manual orchestration.
What is a pro and a con for middleware data integration
Middleware software conducts the integration automatically, and the same way each time.
Middleware needs to be deployed and maintained.
What is a pro and a con for application based integration
Simplified process, application allows systems to transfer information seamlessly, much of the process is automated.
Requires specialist technical knowledge and
maintenance, complicated setup.
What is a pro and a con for uniform data access
Lower storage requirements, provides a simplified view of the data to the end user, easier data access
Can compromise data integrity, data host systems are not designed to handle amount and frequency of data requests.
What is a pro and a con for common data storage ( data warehousing )
Reduced burden on the host system, increased data version management control, can run sophisticated queries on a stored copy of the data without compromising data integrity
Need to find a place to store a copy of the data, increases storage cost, require technical experts to set up the integration, oversee and maintain the data warehouse.
What is the difference between supervised and unsupervised learning? Also what is Semi-Supervised learning?
Supervised learning algorithms use data with labelled outcomes while unsupervised learning algorithms use data without labelled outcomes.
Semi-supervised learning algorithms use both data with labelled outcomes and without labelled outcomes.
What is the task of a supervised learning
The task is to learn a mapping function from possible inputs to outputs.
What is the task of unsupervised learning
In unsupervised learning, our task is to try to “make sense of” data, as opposed to learning a mapping. This is as we have inputs but no associated responses.
Strictly define a hyperparameter
A hyperparameter is a parameter that is not learned directly from the data but relates to
implementation
Define the training and prediction phase in a ML model
In the training phase a ML model can learn the parameters that define this relationship between the features and the outcome variable. The more data the better
In the prediction phase we get new observations, feed these values into our
trained model, and we have a prediction.
What can we use to measure the quality of our predictions
Most models will define a loss function which is some quantitative measure
of how close our prediction is to the actual value. In addition there will also be an update rule that will determine how to update the model parameters
Define the difference between regression and classification
Classification deals categorizing data sets into one of a set of predetermined preexisting classes. Regression deals with continuous data e.g. any continuous outcome like loss, revenue, number of years or anything that can be answered with the question, how much?
Look over calculating linear regression coefficients by hand
https://ele.exeter.ac.uk/pluginfile.php/4546128/mod_resource/content/0/LfD-L2.pdf
Define the mean squared error function and how it works
The Mean Squared Error (MSE) function is the sum of squared errors divide by the number of values
What is the difference between explained and unexplained variation in a regression model?
The explained variation measures how much of the total variation is captured by the regression model, i.e., how much of the variation in
𝑦 can be explained by the independent variable(s) 𝑥. While the unexplained variation measures the variability in the dependent variable that is not captured by the regression model. It is also known as the error or residual variation.
Total variation is the unexplained variation added to the explained variation
What is the primary objective for a prediction model?
The primary objective is to make the best prediction.
What is the main focus for prediction models?
The focus is on performance metrics, which measure the quality of the model’s predictions.
Performance metrics usually involve some measure of closeness between ypred and y.
Without focusing on interpretability, we risk having a Black-box model.
How can we determine the accuracy of a prediction model?
The closer the predicted values are to the observed
values, the more accurate the prediction is.
The further the predicted values are to the observed
values, the less accurate the prediction is.
What is the primary objective for an interpretation model?
The primary objective is to train a model to find insights from the data.
What is the primary focus for an interpretation model?
On the model’s coefficients or feature importance, which reveal which features are most influential in predicting the outcome.
Why might we want to use polynomials in our linear regression models?
Polynomials can help to predict better - they can sometimes better fit the curvature of the actual data.
Polynomials can also help to find variables that explain variation in data better
What kind of features can be captured by adding polynomial features to our linear regression models?
Higher order features - features that are created by transforming the original features (or predictors) in a non-linear way. Higher-order features help capture more complex, non-linear relationships between the independent variables and the dependent variable that would otherwise be missed by a simple linear model.
Note: adding polynomial features does not mean the algorithm is no longer linear regression. The non-linear relationship between one feature and another is not going to make the algorithm
non-linear, it is the algorithm itself that is still a linear combination of features.
What is the Bayes Information Criterion ( BIC )
he Bayesian Information Criterion (BIC) is a statistical measure used for model selection. It helps compare different models and identify the one that best balances goodness of fit with model complexity. Specifically, BIC penalizes models with more parameters to avoid overfitting, favoring simpler models when they adequately explain the data.
What is the formula for the Bayes Information Criterion?
n.ln(SSE)-n.ln(N)+ln(n).P
SSE = sum squared error
n = number of observations
p is the number of parameters.
ln is natural logarithm
Interaction terms can be added to a linear regression modelling algorithm, what are interaction terms?
Interaction terms in linear regression modeling are terms added to the model to capture the combined effect of two (or more) predictor variables on the dependent variable. Interaction terms account for the situation where the effect of one predictor on the outcome depends on the level or value of another predictor.
How can interaction terms be represented in our modelling algorithm
If there are two predictor variables, say x1 and x2, then interaction term would be represented by multiplying the two variables together. The resuting equation could look like this
y = β0 + β1X1 + β2X2 + β3(X1 x X2) + ϵ
Why should data be split into a training and testing split?
In machine learning, a training set is used to train or fit the model, allowing it to learn patterns and relationships from the data. A training set can also be used to learn the optimal parameters for a model
The testing set (or test set) is kept separate and used to evaluate the model’s performance on unseen data. This separation is crucial to prevent overfitting, where the model performs well on the training data but poorly on new, unseen data.
By using both a training and testing set, you ensure the model generalizes well and can make accurate predictions in real-world scenarios.
How can we use the Test data set (also known as a holdout set) to measure model performance?
-We can compare with actual values
-Predict label with model
-Measure error
Define cross validation in the test train split and why it should be used
- Instead of using a single training and test set, we use cross validation to calculate the error across
multiple training and test sets. - With cross validation, we split the data into multiple pairs of training and test sets.
- Average the error across each one of the test set errors.
- Performance measure will be more statistically significant.
Define k-fold cross validation and how it works
A single parameter called k is the number of groups a data sample is to be split
- Process:
1. Shuffle dataset.
2. Split dataset into k groups.
3. For each unique group:
a) Use the group as a test set.
b) Use the remaining group as a training set.
c) Fit model on the training set, then evaluate
model on the test set.
d) Store the evaluation score, then discard
the model.
4. Summarise model performance using the
model evaluation scores
Define stratified sampling
Stratified sampling is a sampling technique where the samples are selected in the same proportion as they appear in the population.
Why should we implement stratified sampling in conjunction with k-fold cross validation?
Implementing stratified sampling in cross validation ensures that the training and test sets have the same proportion of the feature of interest as in the original dataset.
* By doing this with the target feature, we ensure that the cross validation result is a close approximation of the generalisation error
If Training error and cross validation error metrics are both high what is that a sign of
Underfitting, which is where a model is too simple to capture the underlying patterns in the data
If the training error is low but the cross validation error is high what is that a sign of?
Overfitting, which is where the model learns the training data instead of relationships in the data
When we are creating a model we want both the training and cross validation errors to be low - what is a good approach to doing this?
Good approach: as soon as the cross validation error start to
increase, we stop making the model more complex.
When the model is too complex, e.g has a high polynomial degree too many layers etc what is likely to occur
The model is likely to just learn the training data and and will start overfitting
What are the three sources of model error?
Bias
Variance
Irreducible error
What is the bias in a model?
Bias is the tendancy to miss true values with predicting, a model with high bias will not be very accurate and will predict wrong a lot of the time. We want as low bias as possible
What can high bias be the result of?
- The model misrepresenting the data given missing information.
-An overly simple model, i.e., bias to the simplicity of the model.
-The model missing the real patterns in the data
High bias is often linked to underfitting the training model
What is the variance in a model and what is it characterised by?
Variance is the tendency of predictions to fluctuate and is characterised by high sensitivity of output to small changes in input data
What causes a model to have high variance
Variance is often caused by overly complex or poorly fit models e.g very high degree polynomial models.
It’s associated with overfitting
What is irreducible error in a model?
Irreducible error in a model refers to the portion of the total error that cannot be reduced or eliminated, no matter how well the model is designed or how much data is used. This error arises from inherent variability or noise in the data that the model cannot capture or explain.
What are some sources of irreducible error?
Measurement errors in data collection.
Randomness in the underlying process being modeled.
Unpredictable factors that affect the outcome but are not included in the model.
It is impossible to perfectly model the majority of real world data. Thus, we have to be comfortable
with some measure of error.
What is the bias-variance trade off?
Making model adjustments aimed at reducing bias ca often end up increasing variance and vice versa.
Analogous to the complexity tradeoff where we want to choose the right level of complexity to find the best model
The model should be complex enough not to underfit but not so complex that it overfits
“We search for a model that describes the feature target
relationship but not so complex that it fits to spurious patterns”
How does the degree of a polynomial regression relate to the Bias-Variance tradeoff?
The higher the degree of a polynomial regression, the more complex that model is (lower bias, higher
variance).
- At lower degrees: the predictions are too rigid to capture the curved pattern in the data (bias).
- At higher degrees: the predictions fluctuate wildly because of the model’s sensitivity (variance).
As the degree of the polynomial increases:
Bias decreases, because the model becomes more flexible and better captures the training data.
Variance increases, because the model becomes more sensitive to noise and variations in the data.
The optimal model has sufficient complexity to describe the data without overfitting.
Define a cost function
A cost function in machine learning is a mathematical function that measures the error or difference between the predicted output of a model and the actual target values. It serves as a quantitative measure of how well or poorly a machine learning model performs. The goal of training a model is to minimize this cost function.
Define Linear Model Regularisation
To regularize linear models we can add a regularization strength parameter directly into the cost function
This parameter(lambda) adds a penalty proportionate to size(magnitude or numerical value) of the estimated model parameter
- When λ is large, stronger parameters are penalised. Thus, a more
complex model will be penalised.
How does the introduction of a regularisation term to a linear model effect the bias variance tradeoff?
The regularisation strength parameter λ allows us to manage the complexity tradeoff.
* More regularisation introduces a simpler model or more bias.
* Less regularisation makes the model more complex and increases variance.
If the model overfits (variance is too high), regularisation can improve generalisation error and reduce
variance.
How is the penalty λ applied in Ridge regression?
In ridge regression, the penalty λ is applied proportionally to squared coefficient values.
- This penalty imposes bias on the model and reduces variance
How can we find the best value λ when performing regression on a linear model?
We can use cross validation - it’s best practice to scale features
What is LASSO regression and how is the penalty λ applied
- In LASSO (Least Absolute Shrinkage and Selection Operator), the penalty λ is applied proportionally
to absolute coefficient values
What is L1 and L2 regularisation?
LASSO and Ridge regression are also known as L1 regularisation and L2 regularisation, respectively
The names L1 and L2 regularisation come from the L1 and L2 norm of a vector w, respectively
How can we calculate the norm for L1 regularisation?
The L1-norm is calculated as the sum of the absolute vector values, where the absolute value of a
scalar uses the notation |wN|.
How can we calculate the norm for L2 regularisation?
The L2-norm is calculated as the square root of the sum of squared vector values
Why should we use L1 vs L2 regularisation?
What is feature selection and why is it important?
Feature selection is figuring out which one of our features are important to include in the model
Reducing the number of features can prevent overfitting.
* For some models, fewer features can improve fitting time and/or results.
* Identifying the most important features can improve model interpretability.
* Feature selection can also be performed by removing features.
* Remove feature one at a time, measure the predictive results using cross validation, if the feature
elimination improves the cross validation results, or doesn’t increase the error much, that feature
can be removed.
How does regularisation perform feature selection
Regularisation performs feature selection by shrinking the contribution of features.
* For L1-regularisation, this is accomplished by driving some coefficients to zero. 220 6. Linear Model Sel
What is data integration or data fusion?
Data integration or data fusion is the assumption that all data can be retrieved from the same source, enabling efficient analysis without worrying about disparate sources.
What is data cleaning, and why is it necessary?
Data cleaning is the process of detecting and correcting corrupt, inaccurate, or incomplete records in a dataset. It ensures the dataset is reliable, accurate, and usable for analysis by removing or rectifying errors, outliers, and inconsistencies.
What are missing values in a dataset, and how are they represented?
Missing values are data points expected in the dataset but absent. In pandas, they are often represented as NaN (Not a Number). Other representations include None, 9999, or N/A. Missing values can arise from human error, skipped survey questions, or database management issues.
What are the three types of missing data?
Missing Completely At Random (MCAR): Missing values occur randomly, unrelated to observed or unobserved data (e.g., random sensor failure). - The probability of missing values is equal for all units
Missing At Random (MAR): probability of a missing value depends on observed data but not the missing data itself (e.g., sensor failure during high wind speeds).
Missing Not At Random (MNAR): happens when we know exactly which data object will have missing values. In this case, the probability of missing values is related to the actual missing data.(e.g., tampered sensors near a polluting power plant).
What are the four approaches to dealing with missing values?
Keep as-is:
Used when the tools or goals can handle missing values directly, such as the KNN algorithm.
This approach ensures the dataset remains intact and avoids introducing biases during preprocessing.
Remove rows with missing values:
Suitable for MCAR situations, where missing values are completely random.
Avoid in MAR and MNAR cases, as removing rows risks introducing bias by excluding specific subsets of data.
Remove columns with missing values:
Effective when missing rates exceed 25%, especially for non-critical attributes.
Avoid for critical attributes, as their removal can compromise the analysis.
Impute missing values:
Replace missing values with central tendencies (mean, median, or mode), subgroup averages (for MAR), or regression models (for MNAR).
Be cautious of introducing bias during imputation.
What are the Guiding principles when dealing with missing values and what considerations should we make
We should aim to preserve data and information and minimize bias introduction.
Additional considerations include:
-What are our analytical goals?(clustering classification etc)
-what analytical tools are we using
-what is the cause of our missing values?
-What type of missing values are we dealing with?
What is an outlier, and what are the common causes?
An outlier is a data point that significantly differs from others. Causes include:
Errors: Data entry or measurement mistakes.
Legitimacy: True but extreme values that may skew results.
Fraud: Deliberate manipulation requiring scrutiny.
Random errors: Unavoidable fluctuations or inconsistencies in measurements due to chance.
How can outliers be detected and managed?
Outliers can be detected using the interquartile range (IQR):
Define outliers as values outside the range:
Q1 - 1.5 * (Q3 - Q1) <-IQR
OR
Q3 + 1.5 * IQR.
Management strategies:
Do nothing (if analysis is robust).
Replace with caps (upper or lower bound).
-ideal when analysis is sensitive to outliers and retaining all data objects is vital
Apply log transformation - for particularly skewed data.
Remove outliers
* Worst option due to potential loss of information. It should be done when other methods are
inapplicable and when data is correct, but outlier values are excessively distinct
What are systematic errors, and why are they challenging?
Systematic errors are consistent, repeatable errors linked to specific sources (e.g., faulty instruments). They are difficult to detect as they may go unnoticed in data but can cause bias in analyses. Outlier detection can sometimes help identify these errors.
Why is data transformation necessary?
Data Transformation is the last stage of data preprocessing before using analytic tools. It ensures our dataset meets key
characteristics and is ready for analysis!
Data transformation adjusts data ranges to:
Meet model assumptions.
Improve algorithm stability.
Ensure fair contribution from all features.
Common techniques include standardization, normalization, log transformation, and categorical-to-numerical conversions.
What is the difference between standardization and normalization?
Standardization: Rescales data to have a mean of 0 and a standard deviation of 1.
Normalization: Rescales data to a specific range, typically [0, 1].
When is a log transformation used, and why?
Log transformation is used to address skewness and extreme values. It is particularly useful for:
Features varying over orders of magnitude.
Cases where ratios are more critical than absolute differences.
Formula X^1 = log(x)
^ i think this is dx
Define an error in Data
A discrepancy or deviation in measured data from the actual or true values, which can arise due to various factors during data collection or measurement.
What is discretisation
The process of converting continuous data (numerical values) into discrete intervals or categories. This transformation helps simplify the data, making it easier to analyze, interpret, or use in certain machine learning algorithms that perform better with categorical data.
What are some techniques for transforming data from numerical to categorical and what is an advantage and disadvantage
Binary coding:
Categorical data is represented by unique numbers for each category. These numbers are then converted into it’s binary form with each digit of this binary number having it’s own data column(feature?)
-Works well with high-cardinality categorical variables and reduces dimensionality
-will not preserve ordinal relationships
Ranking transformation (ordinal encoding)- Data is transformed based on the order or rank of the values, useful where hierarchies are significant
-preserves the natural ordering of data
-not suitable for nominal data and may lost info about relative distances
Attribute conversion - Transforms existing attributes (features) into new attributes that are better suited for machine learning models. This is a broader technique that encompasses several types of transformations, including scaling, normalizing, and converting attributes into different representations.
-Can improve model performance by making data more interpretable
-requires domain knowledge and risks introducing bias
What is smoothing in data transformation, and how is it applied?
Smoothing reduces noise and fluctuations in data to reveal underlying trends. Common methods include:
Moving average: averaging data points in successive subsets to create smoothed values.
What is the role of bar charts in data visualization?
Bar charts display categorical data distributions. Tips for effective use:
Avoid too many bars.
Use horizontal layouts for readability when necessary.
When are line plots and scatter plots used in visualization?
Line plots: Best for showing trends over time. Avoid markers; use legends on curves.
Scatter plots: Visualize relationships between variables, often with multiple variables for richer insights.
Why should pie charts be used cautiously?
Pie charts display proportions but are often misleading due to poor readability and comparison challenges. Alternative visualizations like bar charts are recommended.
What is classifcation in machine learning?
Classification is a supervised learning task where an algorithm is trained on labeled data to predict the class of new data points. It assigns inputs (x) to predefined categories (y), such as predicting whether an email is spam or not. The classifier will choose the class that has the highest percent chance of being accurate
How is the train-test split used in machine learning?
The dataset is split into two parts: the training set (usually 80%) is used to train the model, and the test set (usually 20%) is used to evaluate its performance. This ensures that the model is tested on unseen data to check its generalization ability.
What are some common applications of classification
Applications include image classification (e.g., recognizing objects in images), spam detection in emails, medical diagnosis (e.g., disease prediction), and customer segmentation in marketing.
How does logistic regression work?
Logistic regression is a statistical model used for binary classification. It predicts the probability that a given input x belongs to class 1 (y=1). It uses the logistic function to ensure the output is between 0 and 1. It’s up to the developer/researcher do decide what threshholds mean what but default is 50%
What kind of data does logistic regression handle?
Logistic regression handles continuous input data (x) and binary output data (y).
How do you interpret the logistic regression model’s output?
The output of logistic regression is the probability of the input data belonging to class 1. A threshold (e.g., 0.5) is then applied to decide the final class label.
What is the cost function used in logistic regression?
Logistic regression uses the log-loss (cross-entropy) cost function, which measures the difference between the predicted probabilities and the actual class labels.
How is logistic regression implemented in scikit-learn?
Import the necessary modules: from sklearn.linear_model import LogisticRegression.
Split the data using train_test_split.
Create the classifier: log_reg = LogisticRegression().
Fit the model: log_reg.fit(X_train, y_train).
Predict and evaluate: log_reg.predict() and log_reg.score().
What are the weaknesses of logistic regression?
Logistic regression assumes a linear relationship between the input features and the log-odds of the target variable. It can struggle with non-linear data or datasets with overlapping classes.
What is the forumla for logistic regression accuracy?
True positive + True negative/sample size
What is the perceptron algorithm?
The perceptron is a linear classifier that updates its weights based on classification errors. It iteratively adjusts weights to find a hyperplane(e.g. a line separating the data points if they were graphed) that separates the classes.
How is a perceptron trained?
Initialize weights (w) to zero or small random values.
For each input (x), calculate the predicted output (y_pred) using a step function.
Update weights if there is a misclassification.
Repeat until convergence or for a fixed number of iterations.
What are the limitations of the perceptron?
The perceptron can only solve problems where the data is linearly separable. It fails for non-linear datasets.
What is the multi-layer perceptron (MLP)?
MLP is a type of neural network that can model non-linear relationships. It consists of multiple layers of neurons: input layer, one or more hidden layers, and an output layer. It uses activation functions to introduce non-linearity.
How does the MLP overcome the limitations of perceptrons?
How does the MLP overcome the limitations of perceptrons?
Back: MLP uses hidden layers and non-linear activation functions (e.g., ReLU, sigmoid) to model complex, non-linear relationships in data. This allows it to solve problems like XOR.
What is the activation function in an MLP?
An activation function introduces non-linearity into the model, enabling it to learn complex patterns. Common activation functions include sigmoid, tanh, and ReLU.
What is the backpropagation algorithm?
Backpropagation is an algorithm used to train neural networks. It calculates the gradient of the loss function with respect to each weight using the chain rule and updates the weights to minimize the error.
What are the strengths of MLPs?
MLPs are highly versatile and can model complex patterns. They are suitable for a variety of tasks, including classification, regression, and even unsupervised learning when paired with autoencoders.
What are the common optimization techniques used in MLP training?
Common techniques include stochastic gradient descent (SGD), Adam optimizer, and learning rate schedules. These methods improve convergence and training efficiency.
What are some practical considerations when using MLPs?
Choose the number of layers and neurons carefully to avoid overfitting or underfitting.
Use regularization techniques like dropout or L2 regularization.
Scale input data for faster convergence.
What is the importance of the perceptron in the history of AI?
The perceptron is one of the first machine learning algorithms and laid the foundation for modern neural networks. It demonstrated the potential of learning machines but also highlighted the need for multi-layer architectures to handle non-linear problems.
How can we use a cost function to help deal with class imbalance inaccuracies?
Adjust the cost function by assigning higher weights to minority classes and lower weights to majority classes. This penalizes the model more heavily for misclassifying the minority classes.
What is gradient descent, and why is it used in machine learning?
Gradient descent is an iterative optimization algorithm used to minimize a loss function (error) by adjusting model parameters. It calculates the gradient of the loss function with respect to the parameters and moves in the opposite direction of the gradient to reduce the error
What are the key steps in the gradient descent algorithm?
Start at a random point in parameter space.
Calculate the loss at the current point.
Take a step in the direction of the steepest gradient.
Recalculate the loss at the new point.
Repeat until the loss is below a threshold or a maximum number of steps is reached.
What is the difference between stochastic and deterministic parameter-fitting methods?
Deterministic: The algorithm moves in a specific direction calculated by the gradient. It has no randomness.
Stochastic: The algorithm includes randomness by picking a nearby random point and checking the error iteratively.
What is the L1 norm, and how is it related to optimization methods?
The L1 norm (Manhattan distance) is the sum of the absolute differences between predicted and actual values. Minimizing the L1 norm promotes sparsity in model parameters and is often used in optimization methods like Lasso regression to encourage simpler models by driving some coefficients to zero.
What is the L2 norm, and how is it related to gradient descent?
The L2 norm (Euclidean distance) is a measure of the error between predicted values and actual values. Minimizing the L2 norm (squared error) is a common objective in gradient descent to optimize model parameters.
What is the role of the gradient in gradient descent?
The gradient represents the direction and rate of the steepest increase of the loss function. Gradient descent takes steps in the opposite direction of the gradient to minimize the loss function.
In one-dimensional parameter space, how is the optimal point found using gradient descent?
In 1D, the optimal point occurs where the derivative of the loss function with respect to the parameter equals zero:
How does gradient descent generalize to two or more dimensions
In higher dimensions, the gradient consists of partial derivatives with respect to each parameter. Gradient descent finds the direction of steepest descent by combining all partial derivatives into a gradient vector.
(Idk if we need to know this)
What is a confusion matrix, and what is it used for?
A confusion matrix is a table used to evaluate the performance of a classification model. It displays the counts of true positives, true negatives, false positives, and false negatives.
Define precision in the context of a confusion matrix.
Precision (Positive Predictive Value) is the proportion of correctly predicted positive observations out of all observations predicted as positive:
Formula:
correct Positives / Correct Positives + Incorrect positives
Define recall (sensitivity) in the context of a confusion matrix.
Recall (Sensitivity) is the proportion of actual positive observations correctly classified as positive:
formula:
Recall=
True positives / TruePositives+FalseNegatives
What is the trade-off between precision and recall?
A high recall indicates fewer false negatives but may include more false positives (lower precision). Conversely, high precision minimizes false positives but may miss true positives, lowering recall.
What is the F1 score, and why is it useful?
The F1 score is the harmonic mean of precision and recall. It is useful for balancing the trade-off between precision and recall
What is the ROC curve, and what does it represent?
The ROC curve (Receiver Operating Characteristic curve) shows the trade-off between the true positive rate (sensitivity) and the false positive rate (1 - specificity) as the decision threshold varies.
What is sensitivity and specificity in binary classification?
Sensitivity (Recall): Proportion of true positives correctly identified.
Specificity: Proportion of true negatives correctly identified.
What does the Area Under the Curve (AUC) represent in the ROC curve?
The AUC measures the overall performance of a classifier by quantifying the area under the ROC curve. A higher AUC indicates better classifier performance.
How is grid search used for parameter fitting?
Grid search involves exhaustively trying all combinations of predefined parameter values to identify the combination that minimizes error or maximizes accuracy.
How can imbalanced datasets affect classifier performance, and how can this issue be addressed?
Imbalanced datasets bias the model towards the majority class. Solutions include:
Sampling techniques (oversampling minority class, undersampling majority class).
Using metrics like F1 score or ROC-AUC that account for class imbalance.
What happens when the classifier threshold is set very low?
Setting a low threshold increases sensitivity (high true positive rate) but reduces specificity, leading to more false positives.
What happens when the classifier threshold is set very high?
Setting a high threshold increases specificity (low false positive rate) but reduces sensitivity, leading to more false negatives.
Lecture 9
What is Natural Language Processing (NLP)?
NLP is a field of computer science and artificial intelligence focused on enabling computers to understand, interpret, and generate human language. It often involves tasks such as text similarity, sentiment analysis, topic extraction, summarization, question answering, relationship extraction, and language generation.
What is spam detection and how is it performed?
Spam detection is a classification problem where the input (X) is email text and the output (y) is whether the email is spam or not. A typical approach involves train-test splitting and identifying words indicative of spam. For example, phrases like “send us your password” are strong spam indicators.
How does Bayesian spam detection work?
Bayesian spam detection calculates the probability of a phrase being spam based on the likelihood of each word occurring in spam emails. For example, P(“send us your password”|spam) = P(“send”|spam) × P(“us”|spam) × P(“your”|spam) × P(“password”|spam). Each word is treated independently.
What are tokenization, stemming, and lemmatization, and why are they important?
Tokenization: Breaks a sentence into individual words or tokens.
Stemming: Reduces words to their root forms (e.g., “running” becomes “run”).
Lemmatization: Reduces words to their base or dictionary forms (e.g., “better” becomes “good”).
These methods group variations of words together to standardize text for analysis.