Machine_Learning_Flash_Cards Flashcards
Keyword/Topic
Definition
Encoding (One-Hot, Bag of Words)
Encoding techniques used to convert categorical data or text into numerical format. One-hot encoding represents each category as a binary vector. Bag of words represents text as a count of word occurrences.
Colour Palettes
Diverging Color Palettes:
Used to show data that has a meaningful midpoint (e.g., positive vs. negative values). Colors transition from one hue to a neutral midpoint and then to another hue.
Sequential Color Palettes:
Designed for ordered data, where colors transition gradually from light to dark (or vice versa) to represent increasing values.
Categorical Color Palettes:
Used for distinct, non-ordered categories. Each category is represented by a unique color to ensure clear differentiation.
CSV and HDF5
CSV is a plain-text format storing tabular data, while HDF5 is a binary format designed for large datasets, supporting hierarchical data structures.
Stevens’ Scales
Nominal:
Categories with no inherent order
Ordinal:
Categories with a meaningful order, but intervals
Interval:
Ordered categories with equal intervals, but no true zero point (e.g., temperature in Celsius or Fahrenheit).
Ratio:
Ordered categories with equal intervals and a true zero point, allowing for meaningful ratios (e.g., height, weight, age).
Spearman vs Pearson
Spearman measures rank correlation (monotonic relationships). Pearson measures linear correlation between two continuous variables.
Data Wrangling
The process of cleaning and transforming raw data into a usable format for analysis, including handling missing values, filtering, and merging datasets.
MAR, MCAR, MNAR
MCAR (Missing Completely At Random):
Missing data is random and unrelated to any variables.
MAR (Missing At Random):
Missing data depends on observed variables.
MNAR (Missing Not At Random):
Missing data depends on the missing values themselves.
Misclassification
A classification error where a data point is assigned the wrong class label.
Variance
A measure of the spread of data points. High variance indicates a model is sensitive to fluctuations in the training data.
Bagging
An ensemble method that reduces variance by training multiple models on bootstrapped datasets and averaging their predictions.
K-Means (Elbow Method)
A clustering algorithm. The Elbow Method helps determine the optimal number of clusters by plotting the sum of squared errors for different k values.
Boosting
1.Initialize the model with equal weights for all data points.
2.Train a weak learner (e.g., decision tree).
3.Calculate errors and assign higher weights to misclassified points.
4.Train the next weak learner on the weighted data.
5.Combine all weak learners to form a strong model.
Agglomerative Clustering
A bottom-up hierarchical clustering method where clusters are merged iteratively based on similarity.
Cluster Assumption
Assumes data points within the same cluster share similar properties, foundational in unsupervised learning.
Single
Complete
Average