Clustering Flashcards
What is generalisation in machine learning
Generalisation in machine learning refers to the ability of a trained model to make accurate predictions on unseen data, i.e. data that the model has not encountered during training
Why is evaluating model performance on the training data problematic
Evaluating model performance on the training data is problematic because it can lead to overfitting, where the model becomes too complex and adapts too well to the training data, resulting in poor preformance on new, unseen data
What is bias in machine learning
Bias refers to the difference between the mode’s predictions and the true values or measurements. A model with high bias tends to underfit the data, meaning it is not complex enough to capture patterns in the data
What is variance in machine learning
Variance refers to the variability or spread of the model’s predictions in contrast to the true values or measurements. A model with high variance tends to overfit the data, meaning it is too complex and captures noise in the training data
How do bias and variance affect model performance
Bias and variance affect model performance by creating a trade-off between underfitting and overfitting. Models with high bias tend to underfit the data and have poor performance on both the training and test data, while models with high variance tend to overfit the data and have excellent performance on the training data but poor performance on the test data. Therefore, it is essential to strike a balance between bias and variance to achieve optimal model performance
What is model complexity in machine learning
Model complexity refers to the level of sophistication or intricacy of the model in capturing the patterns or relationships in the data. A more complex model may have more parameters or features and represent more complex functions, while a simpler model has fewer parameters and features and represents simpler functions
What are the average percentages of training data and test data used for models
80% training data and 20% test data
What is unsupervised learning in machine learning?
Unsupervised learning is a type of machine learning where the data is unlabelled and untagged, and the goal is to find patterns or structure in the data without any guidance or supervision.
What is dimensionality reduction in unsupervised learning?
Dimensionality reduction is a technique in unsupervised learning that reduces the number of features or variables in the data while preserving most of the relevant information. Principal Component Analysis (PCA) and Singular Value Decomposition (SVD) are examples of dimensionality reduction techniques.
What are autoencoders in unsupervised learning?
Autoencoders are neural networks used in unsupervised learning to learn compressed representations of the data by encoding the data into a lower-dimensional space and decoding it back to its original form. Autoencoders can be used for dimensionality reduction, data compression, and image denoising.
What is clustering in unsupervised learning?
Clustering is a technique in unsupervised learning used to group a set of objects into clusters based on their similarities or dissimilarities. Exclusive or non-overlapping clustering techniques assign each object to only one cluster, while overlapping clustering techniques allow objects to belong to multiple clusters. Hierarchical and probabilistic clustering are examples of clustering techniques.
What are association rules in unsupervised learning?
Association rules are used in unsupervised learning to discover interesting relationships or patterns in the data. They are used in market basket analysis to find correlations between items purchased together and recommend items to customers based on their purchase history.
What is the notion of distance in machine learning?
The notion of distance in machine learning is used to measure the dissimilarity or similarity between objects based on their features or characteristics. The distance can be measured in various units, such as Euclidean distance, Manhattan distance, or cosine similarity.
How do we use distance to differentiate objects?
We use distance to differentiate objects by measuring the differences in their features or characteristics. For instance, an orange and a lime are both round, but the lime is smaller, so we can measure the distance between them in units of radius, such as centimeters.
How can we represent a binary variable using distance?
We can represent a binary variable using distance by assigning a distance of 0 if the variable is present and 1 if the variable is absent. For instance, a pepper is not hollow, so we can assign a distance of 0, while a bell pepper is hollow, so we can assign a distance of 1.
How can we represent a continuous variable using distance?
We can represent a continuous variable using distance by measuring the differences between the values of the variable for different objects. For instance, we can measure the volume of empty space inside a pepper and assign a distance based on the differences between the volumes for different peppers
What are some examples of distance metrics used in machine learning?
Examples of distance metrics used in machine learning include Euclidean distance, Manhattan distance, cosine similarity, and Jaccard distance. These metrics measure the distance or dissimilarity between objects based on their features or characteristics.
What are vector operations in machine learning?
Vector operations in machine learning involve performing mathematical operations on vectors, which are arrays of numbers or values.
What is the basis for conducting vector operations?
Vector operations are conducted on a component-by-component basis. This means that each component or element of the vectors is treated independently and the operations are performed on them separately.