tenta Flashcards

Question

what is the difference between supervised and unsupervised learning?

Answer 1

In supervised learning, the algorithm is trained on a labeled dataset, where each input has a corresponding output. The goal is to learn a mapping from inputs to outputs. In unsupervised learning, the algorithm is given unlabeled data and must find patterns or relationships within the data without explicit guidance on the output.

Answer 2

Fitting a line to describe the relationship between variables. “The goal of linear regression is to find the best-fitting linear relationship that can be used for making predictions.” Main idea: If classes of points can be separated by a line, you can use a linear model to classify data points. Is best suited for problems where the goal is to predict a continuous value.

Answer 3

An algorithm used for classification tasks. SVM helps draw a line, as linear regression in a way to separate data objects from different groups. The line drawn is called a decision boundary and it is drawn to have the maximum distance between the decision boundary and the nearest data point from either group.

Answer 4

Difference from Linear regression. Linear regression deals with a continuous output for example predicting prices. while the SVM is of a categorical label. (Is this a spam email or not?) SVM is designed to find the optimal decision boundary to separate different classes.

Answer 5

groups of data points usually of similar values.

Answer 6

K-means clustering is an unsupervised learning algorithm that falls under descriptive modeling. Iteratively work towards finding the optimum cluster centers for a specified number of clusters / groups. Data points belong to a cluster that is defined by the closest centroid.

Answer 7

Groups data points that are close together. Density-based spatial clustering.

Answer 8

Hierarchical clustering is a type of clustering algorithm used in unsupervised machine learning to group similar items into clusters. The term "hierarchical" is used because the algorithm creates a hierarchy of clusters. This clustering technique builds a tree-like structure of clusters, known as a dendrogram, which visually represents the relationships and similarities between different data points.

Answer 9

DM technique that identifies interesting relationships, patterns / associations among a set of items in large datasets. For example: An association between which products are frequently purchased together?

Answer 10

Support == A measure of how frequently a set of items appear in the dataset.

Answer 11

Confidence Confidence is that if there is a rule 𝐵𝑒𝑒𝑓,𝐶ℎ𝑖𝑐𝑘𝑒𝑛→𝐴𝑝𝑝𝑙𝑒 and has a confidence of 33%, we mean that if there is beef and chicken bought together, there is 33% chance that there are also apples in the shopping cart.

Answer 12

Lift gives us a metric about how good a rule is. If the lift is >1 then the rule is better than guessing. If the lift is ≤1 the rule is pretty much as good as guessing.

Answer 13

Predictive

Answer 14

Descriptive

Answer 15

Nominal Attributes: Nominal attributes are categorical variables that represent different categories or groups with no inherent order or ranking among them. Nominal data can be represented by labels or names, and mathematical operations like addition or subtraction are not meaningful in this context. Ordinal Attributes: Ordinal attributes, on the other hand, represent categories with a clear order or ranking, but the intervals between the categories may not be uniform or meaningful. While there is a meaningful order among the categories, the differences between them may not be consistent.

Answer 16

Interval: The distance between each step is the same size but with no absolute zero. Zero is arbitrary. (0 Celsius is not absence of temperature) Ratio: As Interval, But zero is a meaningful property indicating the absence of a property.

Answer 17

Cross-Validation is a technique used to assess the preformance of a predictive model. In a typical ML model the dataset is divided into traning and test set. Cross validation divides these further multiple times into subsets using each subset as a test set.

Answer 18

The advantage of using cross-validation is a more reliable way of estimating a models performance. This provides a accurate assessment of how well a model will do on unseen data. "Overfitting" is when a model does good on seen data and bad on unseen.

Answer 19

The training data is used to train the model, during training the model learns patterns, relationships and attributes. When predictions are iteratively made on the training data, the difference between the prediction and the actual outcome is used to update the models parameters (optimization) Training data is typically 70-80% of the dataset.

Answer 20

The test set is used to evaluate the performance of the trained model on new, unseen data. It stimulates how the model is to preform when applied to real-world scenarios.

Answer 21

The objective is to ensure that the proportion between classes is maintained correctly in different subsets of the data. This is done to prevent biases. For example in cross validation each subset that is created aims to have a proportional distribution of classes as the whole dataset. The same is true for when splitting the data into training and test sets

Answer 22

Re-Weighing adjusts the weights assigned to different instances / classes in a dataset. This is done to address class imbalances. This makes underrepresented classes more influential during learning and helps the model prioritize learning patterns from the minority class.

Answer 23

The idea behind Support Vector Regression is to extend the concept of SVM. To predict continuous outcomes rather than handle classification problems. The objective of SVR is the same as linear regression.

Answer 24

Linear regression assumes a straight-line relationship whereas SVR is more flexible and can capture non-linear relationships. SVR is also less sensitive to outliers, data points that deviate significantly from the overall pattern.

Answer 25

Computationally it is less expensive. If real-time prediction is needed in a large dataset Linear regression might be preferable. Easy to implement, interpret. Works fine if the data is not to noisy, truly linear or has few features.

Answer 26

Treatment group is a subset of individuals exposed to something that we would like to know effects that group. A new interface, a new drug etc. The goal is to observe the impact of this "treatment". The Control Group serves as a comparison / baseline for the treatment group. Subset of individuals that are not exposed to the "treatment". This makes it possible to infer causality by studying the impact of the treatment and compare the results with the control group.

Answer 27

An algorithm that can be used for both regression and classification.

Answer 28

In k-NN classification, the goal is to assign an object to a specific group or category. Imagine the object asking its nearby neighbours for advice on which group it belongs to. The "k" in k-NN represents how many neighbours it asks. The object then joins the group that the majority of its closest neighbours belong to. If it only asks one neighbour (k = 1), it simply joins the same group as that single closest neighbour.

Answer 29

In a machine learning context, "distance" generally refers to a measure of dissimilarity or similarity between two data points in a feature space. The idea is to quantify how far apart or close together two instances are in the context of the given features.

Answer 30

1. Reduces variation. This helps group together variations of the same word. This is the main objective of stemming. 2. Simplifies analysis. It simplifies the analysis of text data by focusing on the core meaning of words. 3.Computational Efficiency: Stemming can improve computational efficiency since the reduced dimensionality makes subsequent processing faster.

Answer 31

1. Over-Stemming and Under-Stemming: Stemming algorithms may sometimes over-stem (remove too many letters, leading to loss of meaning) or under-stem (leave too many letters, failing to reduce related words to the same stem). 2. Loss of Interpretability: The stemmed words may not be easily interpretable, making it challenging to understand the original context. 3. Language Dependence: Stemming algorithms are language-dependent, and the effectiveness may vary across different languages.

Answer 32

Mean and median imputation are methods used to replace missing values in a dataset with the mean or median of the observed values for that variable. These techniques are commonly employed in data preprocessing to handle missing data, ensuring that the dataset remains suitable for analysis.

Answer 33

Mean Average error. Average difference mellan prediction och outcome

Answer 34

Mean Square Error Take the difference between the actual and predicted values for each data point. Square each of these differences. Take the average of the squared differences.

Answer 35

TP / (TP + FN)

Answer 36

TP / (TP + FP)

Answer 37

2 * (Recall * precision) / (Recall + precision)

Answer 38

Manhattan distance. (abs(x1 - x2) + (y1 - y2))

Answer 39

Decision Trees (DTs) are a predictive supervised learning method used for classification and regression. The goal is to create a model that predicts the value of a target variable by learning simple decision rules inferred from the data features.

Answer 40

Mean Square Error (MSE) and Mean Absolute Error (MAE) are commonly used metrics for evaluating the performance of regression models. These metrics are useful when you're dealing when the goal is to predict a continuous numerical value, as opposed to a classification task. MAE is less sensitive to outliers compared to MSE because it doesn't square the differences.

Answer 41

Put the distances from the new data point to the rest. Choose the k shortest. if [yes,no,yes,no] Calc average distance for yes and no. Group to the shortest alternative.

Answer 42

Ta bort om saknas