Data Science Algorithms Flashcards
Linear Regression
Linear regression is a statistical method that is used to understand the relationship between two continuous variables. It assumes a linear relationship between the input variables (X) and a single output variable (Y).
Logistic Regression
Logistic regression is a classification algorithm used when the response variable is categorical. It models the probability that each input belongs to a particular category.
Decision Trees
Decision trees split the data into multiple sets based on different conditions. They are used in both classification and regression tasks and are easy to understand and interpret.
Random Forest
A Random Forest is an ensemble technique that uses many decision trees. It provides a robust prediction by averaging the predictions of individual decision trees.
Naive Bayes
This is a probabilistic classifier based on applying Bayes’ theorem with strong (naive) independence assumptions between the features.
K-Nearest Neighbors (KNN)
KNN is a simple algorithm that stores all available cases and classifies new cases based on a similarity measure (e.g., distance functions).
Support Vector Machines (SVM)
SVMs are a set of supervised learning methods used for classification and regression. They aim to find a hyperplane in an N-dimensional space that distinctly classifies the data points.
Principal Component Analysis (PCA)
PCA is a dimensionality reduction technique that is often used to reduce the dimensionality of large data sets, by transforming a large set of variables into a smaller one that still contains most of the information in the large set.
K-Means Clustering
K-Means is an unsupervised learning algorithm used to partition a dataset into K clusters. Each observation belongs to the cluster with the nearest mean.
Hierarchical Clustering
This is another unsupervised learning algorithm that is used to group together the objects that are similar to each other and dissimilar to the objects belonging to another cluster.
Gradient Boosting Algorithms
These are machine learning techniques for regression and classification problems, which produce a prediction model in the form of an ensemble of weak prediction models.
Deep Learning Algorithms
These are a set of algorithms that use artificial neural networks with multiple layers between the input and output. They can model complex non-linear relationships and are particularly powerful for large-scale and high-dimensional data.
Algorithmic Process
At its core, an algorithm is a step-by-step procedure or set of rules to be followed in calculations or other problem-solving operations. It’s the logic behind the analysis and how the objective will be achieved.
Data Inputs
These are the datasets that the algorithm will use. The quality and nature of the data can significantly affect the outcome. This can include structured data (like numerical data in databases) or unstructured data (like text or images).
Data Processing
The algorithm will usually need to clean and preprocess the data. This can involve dealing with missing values, outliers, or scaling the data to make it suitable for the algorithm to process.