Machine Learning Flashcards
Cosine Similarity
Measures the cosine of the angle between two vectors to determine the similarity between two items.
Manhattan Distance
Calculates the distance between points in a grid-based layout as the sum of the absolute differences of their Cartesian coordinates.
Jaccard Similarity
Compares the similarity and diversity of sample sets, calculating the size of the intersection divided by the size of the union of the sets.
Spearman’s Rank Correlation
A measure of rank correlation that assesses how well the relationship between two variables can be described using a monotonic function.
K-Nearest Neighbors (KNN)
A classification algorithm that stores all cases and classifies new cases based on a majority vote of its k nearest neighbors.
Matrix Factorization
A collaborative filtering technique using decompositions like SVD to predict missing entries in a user-item interaction matrix.
Content-Based Filtering
Recommends items based on their similarity to items previously liked by the user, using the features of the items themselves.
Cold Start Problem
A challenge in recommendation systems where there is insufficient data on new users or items to make accurate recommendations.
Item-to-Item Collaborative Filtering
A form of collaborative filtering based on calculating the similarity between items using ratings given by users.
Hamming Distance
Measures the distance between two strings of equal length by counting the number of positions at which the corresponding symbols differ.
Supervised Learning
A type of machine learning where the model is trained on a labeled dataset, learning to predict the output from the input data.
Unsupervised Learning
Learning from data that has not been labeled, categorized, or classified, aiming to identify significant patterns.
Regression
A statistical method used in machine learning for predicting continuous outcomes based on previous data.
Classification
A process in machine learning for categorizing data into predefined classes or categories.
Decision Trees
A decision support tool that uses a tree-like model of decisions and their possible consequences or probability event outcomes.
Random Forest
An ensemble learning method for classification, regression, and other tasks that operates by constructing multiple decision trees at training time.
Neural Networks
Computing systems vaguely inspired by the biological neural networks that constitute animal brains, capable of pattern recognition and data classification.
Gradient Descent
An optimization algorithm used to minimize some function by iteratively moving in the direction of steepest descent as defined by the negative of the gradient.
Overfitting
A modeling error in machine learning where a function is too closely fitted to a limited set of data points and fails to generalize to new data.
Cross-Validation
A technique for assessing how the results of a statistical analysis will generalize to an independent data set, commonly used in settings where the goal is prediction and one wants to estimate how accurately a predictive model will perform in practice.
Collaborative filtering
Collaborative filtering is a technique used in recommendation systems to predict the preferences of a user by collecting preferences or taste information from many users. The underlying assumption of the collaborative filtering approach is that if a person A has the same opinion as a person B on an issue, A is more likely to have B’s opinion on a different issue than that of a randomly chosen person.
Pearson Correlation
Pearson correlation measures the linear relationship between two variables, providing a value between -1 and 1. A score of 1 means a perfect positive correlation, -1 indicates a perfect negative correlation, and 0 means no correlation. It’s commonly used in statistics to assess the strength and direction of two continuous variables’ relationships.
Euclidean Distance
Euclidean distance is the “straight-line” distance between two points in Euclidean space. In terms of data points, it represents the geometric distance in multidimensional space, calculated using the Pythagorean theorem. It is often used in clustering and classification to determine how similar or dissimilar data points are to each other.