Machine_Learning_Flash_Cards Flashcards
Keyword/Topic
Definition
Encoding (One-Hot, Bag of Words)
Encoding techniques used to convert categorical data or text into numerical format. One-hot encoding represents each category as a binary vector. Bag of words represents text as a count of word occurrences.
Colour Palettes
Diverging Color Palettes:
Used to show data that has a meaningful midpoint (e.g., positive vs. negative values). Colors transition from one hue to a neutral midpoint and then to another hue.
Sequential Color Palettes:
Designed for ordered data, where colors transition gradually from light to dark (or vice versa) to represent increasing values.
Categorical Color Palettes:
Used for distinct, non-ordered categories. Each category is represented by a unique color to ensure clear differentiation.
CSV and HDF5
CSV is a plain-text format storing tabular data, while HDF5 is a binary format designed for large datasets, supporting hierarchical data structures.
Stevens’ Scales
Nominal:
Categories with no inherent order
Ordinal:
Categories with a meaningful order, but intervals
Interval:
Ordered categories with equal intervals, but no true zero point (e.g., temperature in Celsius or Fahrenheit).
Ratio:
Ordered categories with equal intervals and a true zero point, allowing for meaningful ratios (e.g., height, weight, age).
Spearman vs Pearson
Spearman measures rank correlation (monotonic relationships). Pearson measures linear correlation between two continuous variables.
Data Wrangling
The process of cleaning and transforming raw data into a usable format for analysis, including handling missing values, filtering, and merging datasets.
MAR, MCAR, MNAR
MCAR (Missing Completely At Random):
Missing data is random and unrelated to any variables.
MAR (Missing At Random):
Missing data depends on observed variables.
MNAR (Missing Not At Random):
Missing data depends on the missing values themselves.
Misclassification
A classification error where a data point is assigned the wrong class label.
Variance
A measure of the spread of data points. High variance indicates a model is sensitive to fluctuations in the training data.
Bagging
An ensemble method that reduces variance by training multiple models on bootstrapped datasets and averaging their predictions.
K-Means (Elbow Method)
A clustering algorithm. The Elbow Method helps determine the optimal number of clusters by plotting the sum of squared errors for different k values.
Boosting
1.Initialize the model with equal weights for all data points.
2.Train a weak learner (e.g., decision tree).
3.Calculate errors and assign higher weights to misclassified points.
4.Train the next weak learner on the weighted data.
5.Combine all weak learners to form a strong model.
Agglomerative Clustering
A bottom-up hierarchical clustering method where clusters are merged iteratively based on similarity.
Cluster Assumption
Assumes data points within the same cluster share similar properties, foundational in unsupervised learning.
Single
Complete
Average
Maximum Likelihood Estimator (MLE)
A method to estimate parameters by maximizing the likelihood function, representing the probability of observing the data given the model.
Regression vs Classification
Regression predicts continuous values; classification predicts discrete class labels.
Mean Square Error (MSE)
A loss function for regression that calculates the average squared difference between predicted and actual values.
K-Fold Cross-Validation
A model validation method that splits data into k subsets, training on k-1 and validating on the remaining fold iteratively.
Gradient Descent & SGD
Optimization algorithms: Gradient Descent minimizes loss by adjusting weights iteratively. Stochastic Gradient Descent updates weights using single examples.
L1 and L2 Regularization
L1 adds a penalty proportional to the absolute value of weights (sparse solutions). L2 penalizes the square of weights (weight decay).
K-Nearest Neighbors (KNN)
A classification/regression algorithm that predicts based on the majority vote or average of the k closest data points.
Logistic Regression (Softmax)
A classification algorithm. Softmax generalizes logistic regression for multi-class classification.
No Free Lunch Theorem
No algorithm is universally best for all problems. Performance depends on the specific dataset.
Bias
Bias is the error introduced by approximating real-world problems with simplified models. High bias leads to underfitting.
Overfitting
A model captures noise in the training data, performing poorly on new data. Regularization and validation sets help mitigate this.
Validation Set
A subset of data used to tune model hyperparameters and assess performance during training.
Testing Set
A hold-out dataset used to evaluate the final performance of a trained model.
Reduce Variance in Trees
Techniques like bagging, random forests, and pruning reduce overfitting and variance in decision trees.