Preprocessing & EDA Flashcards

Question

Feature selection

Answer 1

A process of choosing a subset of relevant features from a larger set of features in a dataset. It aims to reduce dimensionality, improve model performance, and enhance interpretability by focusing on the most informative features. Feature selection techniques include filter methods, wrapper methods, and embedded methods, which assess feature importance based on statistical measures, model performance, or feature relevance to the target variable. Filter Methods - Information Gain: Measures how much information a feature provides about the target variable. - Chi-Squared Test: Evaluates the independence between a feature and the target variable. - Correlation Analysis: Identifies features that are highly correlated with the target variable or strongly correlated among themselves (potentially leading to redundancy). - Variance Threshold: Removes features with low variance, as they are unlikely to carry much predictive information. Wrapper Methods - Forward Selection: Starts with an empty feature set and iteratively adds the feature that most improves model performance. - Backward Elimination: Starts with all features and iteratively removes the least important feature until performance drops below a threshold. - Recursive Feature Elimination (RFE): A variant of backward elimination that uses a model to rank feature importance and recursively eliminates the least important ones. Embedded Methods - Regularization (L1/Lasso, L2/Ridge): Penalizes model complexity, forcing coefficients of less important features towards zero. - Decision Trees and Tree-Based Ensembles: Tree-based algorithms (like Random Forests) provide feature importance scores that can be used for selection.

Answer 2

A frequency table is a way to summarize and organize data by showing how often each unique value (or a range of values) appears within a dataset. One column lists all the unique values observed in your data, or groups them into intervals/categories if you have many distinct values. Another column lists the count of how many times each value or category appears in your data. In classification problems, frequency tables of your target variable classes can reveal if you have a class imbalance issue (e.g., many more positive examples than negative). For categorical features, frequency can be a factor in determining possible one-hot encoding strategies. For instance, rare categories might be grouped into an "Other" category to avoid overly sparse features. Frequency tables are fundamental in understanding text data. Analyzing word frequencies helps identify stop words (common words like "the", "a") for removal, and reveals which words are most significant in a corpus (body of text). Below are ways to check frequency in pyhton: frequency_table = data_series.value_counts() frequency_table = df.groupby('color')['size'].value_counts() frequencies = Counter(data)

Answer 3

A graphical representation of data where values are represented by colors. It uses a color gradient (e.g., from blue to red) to show where there are high and low concentrations of values in a table-like arrangement. The human eye is excellent at picking out patterns in color variations. Heatmaps leverage this to reveal hot spots, trends, and clusters in data that might be hard to see in a plain table of numbers.

Answer 4

A technique used in image processing to enhance the contrast and visibility of details in an image. It redistributes pixel intensities in the image histogram to achieve a more uniform distribution, which can improve the appearance of images with low contrast or uneven lighting conditions. Histogram equalization is commonly used as a preprocessing step in computer vision tasks.

Answer 5

Histograms are graphical representations of the distribution of continuous numerical data, where data values are grouped into bins or intervals, and the height of each bar represents the frequency or relative frequency of data points falling within each bin. Histograms provide insights into the shape, central tendency, and spread of the data distribution, allowing for visual assessment of characteristics such as symmetry, skewness, and multimodality. They are commonly used for exploring the distribution of variables, identifying patterns or outliers, and assessing data quality. Histograms are particularly effective for visualizing large datasets and detecting patterns in continuous data.

Answer 6

Imbalanced data refers to datasets where the classes you want to predict have a significantly unequal distribution – for instance, in fraud detection, most transactions are legitimate, with only a tiny percentage being fraudulent. This imbalance can cause standard classification algorithms to prioritize learning the majority class, leading to poor performance in identifying the less frequent but often more important minority class (like the fraudulent cases). Addressing this imbalance is crucial for building machine learning models that can accurately detect the patterns associated with the minority class and make reliable predictions in real-world scenarios. 1. Resampling Techniques * Oversampling: Replicates samples from the minority class to increase its representation. * Random Oversampling: Randomly duplicates minority class examples. * SMOTE (Synthetic Minority Over-sampling Technique): Creates synthetic minority class samples based on similarities within the existing minority samples. * Undersampling: Removes samples from the majority class to reduce its representation. * Random Undersampling: Randomly deletes majority class samples. * NearMiss: Identifies majority class samples closest to the minority class border and removes those, aiding in better decision boundary definition. 2. Algorithmic Techniques * Cost-Sensitive Learning: Assigns higher misclassification costs to errors on the minority class, forcing the model to prioritize them. * Ensemble Methods: Combining multiple models trained on balanced subsets of data, often improves performance on the minority class. 3. Data-Level Approaches * Synthetic Data Generation: Creates artificial samples for the minority class using techniques like SMOTE or more sophisticated approaches from deep learning (e.g., GANs). 4. Hybrid Approaches * Combining Resampling and Algorithmic Methods: Often the most effective strategy (e.g., oversampling followed by applying cost-sensitive learning).

Answer 7

Intensity transformations are image processing techniques used to modify the brightness and contrast of images by adjusting pixel intensities. Common intensity transformations include gamma correction, logarithmic transformation, and contrast stretching. These transformations can enhance the visual quality of images and improve the performance of computer vision algorithms.

Answer 8

Label encoding is a technique used to convert categorical labels or target variables into numerical representations. Each unique label is assigned a unique integer value, allowing categorical data to be effectively used in machine learning algorithms that require numerical input. Label encoding is commonly used for ordinal categorical variables with inherent order.

Answer 9

Line plots are graphs that display data points connected by straight lines, representing changes in values over time or another continuous variable. They typically have an x-axis representing time or a continuous variable and a y-axis representing the values being measured. Line plots are commonly used to visualize trends, patterns, and relationships in time-series data or continuous data. They allow for the examination of how data changes over time or across different conditions.

Answer 10

A presence of high correlation between predictor variables (features) in a regression analysis. It can lead to unstable estimates of regression coefficients, reduced interpretability of the model, and inflated standard errors. Multicollinearity can be detected using statistical measures such as correlation coefficients or variance inflation factors (VIF) and can be addressed through feature selection or regularization techniques.

Answer 11

Negative sampling is a clever training trick used in machine learning, particularly in natural language processing and recommender systems. Imagine you're trying to teach a model to understand the vast world of words or products. Showing it every single true/positive example would be overwhelming and slow. Negative sampling is like strategic flashcards – instead of showing everything, it carefully selects a few "wrong" examples (negative samples) to contrast with the true ones. This helps the model learn the boundaries between categories more efficiently. For example, while teaching a word embedding model, you might show it "cat" as a positive example and a few random, unrelated words as negative examples. Let's use word embeddings as an example: Positive Pair: You start with a target word (e.g., "cat") and a true context word that appears near it in your real text data (e.g., "pet"). This is your positive example. Generating Negative Samples: Instead of showing the model every other word that isn't a context word, you strategically sample a few negative examples. This is where different methods exist: Random Sampling: Simply pick a few random words from the vocabulary. Frequency-based Sampling: Words that occur very frequently (like "the", "and") are more likely to be chosen as negative examples. This helps balance the focus on rarer words. Training Update: The model is shown the positive pair and the negative samples. Its goal is to learn to: Assign a high similarity score to the positive pair. Assign low similarity scores to the negative pairs. Why this is efficient? Reduced Computations: Instead of updating the model based on every single word it isn't associated with, we focus on a few informative negative examples. This speeds up training significantly for large vocabularies.

Answer 12

Normalization is the process of converting an actual range of values which a numerical feature can take, into a standard range of values, typically in the interval [-1, 1] or [0, 1]. The process of dividing a frequency by a sample size to get a probability.

Answer 13

Used for encoding nominal data. One-hot encoding is a technique used to convert categorical variables or features into binary vectors, where each unique category is represented by a binary indicator variable. In the one-hot encoding scheme, a single attribute is set to 1 for the corresponding category and 0 for all other categories. One-hot encoding allows categorical information to be effectively incorporated into machine learning models without imposing ordinality or hierarchy among categories. Usualy means creating seperate columns for each value (treating them as features) and denoting each record with 0 or 1 value indicating if this feature is present in this record

Answer 14

Outliers are data points that deviate significantly from the rest of the dataset. They may arise due to measurement errors, data corruption, or genuine but rare phenomena. Outliers can distort statistical analyses and machine learning models, leading to biased results or decreased predictive accuracy. Identifying and handling outliers is essential in data preprocessing to ensure the robustness and reliability of analyses and models. Handling Outliers: - Consider using domain specific knowlage if they are useful - Winsorization: Replace outliers with the nearest non-outlier values (e.g., 5th and 95th percentiles). - Trimming: Remove a certain percentage of data points from the tails of the distribution. - Imputation: Replace outliers with estimated values based on interpolation or other statistical methods. - Transform skewed data distributions using techniques such as logarithmic or power transformations. - Standardize or normalize features to ensure they have similar scales and reduce the impact of outliers on the model. - Focus on models that do well with outliers. Examples: decision tree, random forest, kernel regression etc.

Answer 15

Padding is a technique used in image processing and natural language processing (NLP) to add extra information or space around the edges of data samples. In image processing, padding is often applied to ensure that all images have the same dimensions, facilitating batch processing and convolutional operations in neural networks. In NLP, padding is used to standardize the length of text sequences for efficient processing in recurrent neural networks (RNNs) and transformers.

Answer 16

Pair plots are grid-like arrangements of scatter plots that visualize pairwise relationships between different variables in a dataset. Each scatter plot in a pair plot represents the relationship between two variables, and the diagonal typically shows histograms or kernel density estimates of each variable. Pair plots are useful for identifying patterns, correlations, and potential interactions between multiple variables in a dataset. They allow for a comprehensive exploration of relationships within the dataset by examining how variables relate to each other.

Answer 17

Parallel coordinates plots are graphical representations of multivariate data using parallel axes, where each axis represents a different variable. Data points are represented as lines connecting values on each axis, allowing for the visualization of relationships and patterns across multiple variables simultaneously. Parallel coordinates plots are useful for exploring high-dimensional datasets and identifying clusters or trends across multiple variables. They provide a way to visualize the relationships between variables and identify patterns or outliers in the data.

Answer 18

Pie charts are circular statistical graphics divided into slices to illustrate numerical proportions. Each slice represents a proportion of the whole dataset, with the size of the slice corresponding to the relative magnitude of the proportion it represents. Pie charts are commonly used to visualize the distribution of categorical data and highlight the relative contributions of different categories to the whole. However, they can be less effective than other visualization types, such as bar charts, for accurately comparing proportions or displaying complex datasets.

Answer 19

A technique used to address class imbalance in classification datasets, where one class is significantly more prevalent than others. Resampling methods involve either oversampling the minority class (adding duplicates or synthetic samples) or undersampling the majority class (removing samples) to balance the class distribution. Resampling helps prevent bias towards the majority class and improves model performance on imbalanced datasets.

Answer 20

Scatter plots are graphs that display individual data points as dots on a two-dimensional plane, with one variable plotted on the x-axis and another on the y-axis. They visually represent the relationship between two variables, showing patterns such as correlation, clustering, or outliers. Scatter plots are commonly used to explore relationships between two variables and identify trends or patterns in the data. They provide a visual way to examine the association between variables and identify any potential relationships or trends.

Answer 21

Standardization (or z-score normalization) is the procedure during which the feature values are rescaled so that they have the properties of a standard normal distribution with μ = 0 and q = 1, where μ is the mean (the average value of the feature, averaged over all examples in the dataset) and q is the standard deviation from the mean.

Answer 22

There’s no definitive answer to this question. Usually, if your dataset is not too big and you have time, you can try both and see which one performs better for your task. If you don’t have time to run multiple experiments, as a rule of thumb: * unsupervised learning algorithms, in practice, more often benefit from standardization than from normalization; * standardization is also preferred for a feature if the values this feature takes are distributed close to a normal distribution (so-called bell curve); * again, standardization is preferred for a feature if it can sometimes have extremely high or low values (outliers); this is because normalization will “squeeze” the normal values into a very small range; * in all other cases, normalization is preferable.

Answer 23

Numerical measures used to describe and summarize the main features of a dataset. Common summary statistics include measures of central tendency (e.g., mean, median, mode) and measures of dispersion or variability (e.g., standard deviation, range, interquartile range). Summary statistics provide insights into the distribution, spread, and shape of data, facilitating comparisons, hypothesis testing, and decision-making. They are essential tools for data exploration, interpretation, and communication in both descriptive and inferential statistics.

Answer 24

Thresholding and simple segmentation are image processing techniques used to separate objects or regions of interest from the background in digital images. Thresholding involves setting a threshold value and classifying pixels as foreground (object) or background based on their intensity values. Simple segmentation techniques partition an image into regions based on certain criteria, such as color, texture, or intensity gradients.

Answer 25

VIFF, or Variance Inflation Factor, is a metric used to measure multicollinearity in regression analysis. It quantifies how much the variance of the estimated regression coefficients is inflated due to collinearity among predictor variables. A high VIF value indicates strong multicollinearity, suggesting that the corresponding predictor variable may be redundant or highly correlated with other variables in the model. VIF values above a certain threshold (often 5 or 10) are considered indicative of multicollinearity issues that may affect the stability and reliability of regression estimates.

Answer 26

Violin plots are graphical representations of the distribution of numerical data, combining aspects of box plots and kernel density plots. They show the median, quartiles, and kernel density estimation of the data, providing insights into both the central tendency and the spread of the data. Violin plots are useful for comparing distributions of multiple groups or variables and visualizing the shape and variability of the data. They offer a way to assess the distribution of data and compare groups or variables visually.

Answer 27

Word embedding is a technique used to represent words or tokens as dense vectors in a high-dimensional space, where semantically similar words are mapped to nearby vector representations. Word embeddings capture contextual relationships and semantic meaning of words (HOW?), enabling better representation of language in natural language processing (NLP) tasks such as text classification, sentiment analysis, and machine translation. Popular word embedding models include Word2Vec, GloVe, and FastText.

Answer 28

Words are represented as points in a multi-dimensional space. Each dimension in this space represents a feature or attribute, and the position of a word in this space is determined by its relationship with those features. The process of placing a word in a space of given features and determining its position is typically done through unsupervised learning algorithms, such as Word2Vec or GloVe. The relationship between words and features can be quantified as a vector, which has both magnitude and direction. The magnitude indicates the strength of the relationship, while the direction signifies the type of relation between the words. The distance between words in this multi-dimensional space reflects their semantic similarity or dissimilarity. Words that are similar in meaning tend to be closer together, while those with different meanings are farther apart.By analyzing the distances and directions between words and specific features, biases can be detected. Analogy or biases arise when certain words are disproportionately associated with particular attributes, such as gender or race. We can quantify bias by calculating various metrics, such as cosine similarity or distance, between word embeddings representing sensitive attribute-related terms and non-sensitive terms.

Answer 29

The process of transforming, cleaning, and preparing raw data into a usable format for analysis or building models. This involves tasks like handling missing values, correcting errors, and formatting the data.

Answer 30

A technique used in convolutional neural networks (CNNs) and other signal processing applications to add zeros around the edges of input data. Padding ensures that the spatial dimensions of input data are preserved during convolutional operations, preventing information loss at the edges of the input. Zero padding is commonly used to control the spatial size of feature maps and to facilitate the application of convolutional filters across input data.

Answer 31

clustering technique used to group similar data points into clusters based on their pairwise distances or similarities. Unlike partitioning methods like K-means clustering, hierarchical clustering organizes the data points into a hierarchical tree-like structure, known as a dendrogram. Hierarchical clustering can be agglomerative, where each data point starts as a single cluster and is successively merged with its nearest neighbor clusters, or divisive, where all data points start in a single cluster and are successively split into smaller clusters. Hierarchical clustering does not require specifying the number of clusters in advance, making it suitable for exploratory data analysis and visualization of hierarchical relationships within the data. It is commonly used in fields such as biology, ecology, and social sciences to analyze similarities and groupings in complex datasets. unsupervised machine learning algorithm that builds a hierarchy of clusters for your data. There are two main approaches: Agglomerative (Bottom-up): Starts with each data point as its own cluster. Iteratively merges the closest clusters together until all data points belong to a single cluster. Divisive (Top-down): Starts with all data points in one cluster. Iteratively splits the most dissimilar clusters into smaller ones until each data point is its own cluster. How It Works (Focusing on Agglomerative) Distance Calculation: Choose a distance metric to measure how similar/dissimilar data points or clusters are (e.g., Euclidean distance, Manhattan distance). Merging (Agglomerative): Find the two closest clusters based on the chosen metric. Merge them into a new cluster. Linkage Criteria: Decide how to calculate the distance between clusters that contain multiple points (common methods include): Single-linkage: Uses the distance between the closest pair of points from each cluster. Complete-linkage: Uses the distance between the furthest pair of points. Average-linkage: Uses the average distance between all point pairs from the two clusters. Repeat: Continue merging the closest clusters until the desired number of clusters is reached or a stopping condition is met. Results: The Dendrogram The output of hierarchical clustering is often visualized as a dendrogram, a tree-like diagram showing the relationships between clusters. Height on the dendrogram represents the distance at which clusters were merged (higher up means clusters were less similar). Advantages No Need to Pre-Specify Clusters: The number of clusters emerges naturally from the data. Dendrogram Provides Insights: The dendrogram offers a visual way to understand the hierarchical structure of the data. Flexibility: Works with various distance metrics and linkage criteria.

Answer 32

Thresholding and simple segmentation are image processing techniques used to separate objects or regions of interest from the background in digital images. Thresholding involves setting a threshold value and classifying pixels as foreground (object) or background based on their intensity values. Simple segmentation techniques partition an image into regions based on certain criteria, such as color, texture, or intensity gradients.

Answer 33

VIFF, or Variance Inflation Factor, is a metric used to measure multicollinearity in regression analysis. It quantifies how much the variance of the estimated regression coefficients is inflated due to collinearity among predictor variables. A high VIF value indicates strong multicollinearity, suggesting that the corresponding predictor variable may be redundant or highly correlated with other variables in the model. VIF values above a certain threshold (often 5 or 10) are considered indicative of multicollinearity issues that may affect the stability and reliability of regression estimates.

Answer 34

Violin plots are graphical representations of the distribution of numerical data, combining aspects of box plots and kernel density plots. They show the median, quartiles, and kernel density estimation of the data, providing insights into both the central tendency and the spread of the data. Violin plots are useful for comparing distributions of multiple groups or variables and visualizing the shape and variability of the data. They offer a way to assess the distribution of data and compare groups or variables visually.

Answer 35

Words are represented as points in a multi-dimensional space. Each dimension in this space represents a feature or attribute, and the position of a word in this space is determined by its relationship with those features. Word embedding is a technique used to represent words or tokens as dense vectors in a high-dimensional space, where semantically similar words are mapped to nearby vector representations. Word embeddings capture contextual relationships and semantic meaning of words, enabling better representation of language in natural language processing (NLP) tasks such as text classification, sentiment analysis, and machine translation. Popular word embedding models include Word2Vec, GloVe, and FastText. The process of placing a word in a space of given features and determining its position is typically done through unsupervised learning algorithms, such as Word2Vec or GloVe. The relationship between words and features can be quantified as a vector, which has both magnitude and direction. The magnitude indicates the strength of the relationship, while the direction signifies the type of relation between the words. The distance between words in this multi-dimensional space reflects their semantic similarity or dissimilarity. Words that are similar in meaning tend to be closer together, while those with different meanings are farther apart.By analyzing the distances and directions between words and specific features, biases can be detected. Analogy or biases arise when certain words are disproportionately associated with particular attributes, such as gender or race. We can quantify bias by calculating various metrics, such as cosine similarity or distance, between word embeddings representing sensitive attribute-related terms and non-sensitive terms.

Answer 36

The process of transforming, cleaning, and preparing raw data into a usable format for analysis or building models. This involves tasks like handling missing values, correcting errors, and formatting the data.

Answer 37

A technique used in convolutional neural networks (CNNs) and other signal processing applications to add zeros around the edges of input data. Padding ensures that the spatial dimensions of input data are preserved during convolutional operations, preventing information loss at the edges of the input. Zero padding is commonly used to control the spatial size of feature maps and to facilitate the application of convolutional filters across input data.

Answer 38

Cluster sampling is a probability sampling technique where a population is first divided into naturally occurring groups called clusters. These clusters should ideally be heterogeneous (diverse) within themselves, while being relatively homogeneous (similar) to each other, representing smaller versions of the overall population. Then, instead of selecting random individuals across the whole population, a researcher randomly selects a number of entire clusters. luster sampling has both applications and considerations within the realm of machine learning (ML): Applications: Data Reduction: Cluster sampling can reduce the size of massive datasets. Instead of training an ML model on the entire dataset, you can train on representative samples from selected clusters, speeding up training and potentially improving model performance if clusters are well-defined. Semi-Supervised Learning: In scenarios with limited labeled data, cluster sampling can focus labeling efforts. By labeling a few examples from each cluster, you provide the model with a broader picture of the data distribution, potentially improving its ability to generalize to unseen data. Anomaly Detection: Cluster sampling can aid in identifying unusual patterns clustered together in specific regions of the dataset, highlighting potential anomalies that would be harder to spot in the entire dataset. Considerations: Biases: As with any sampling method, cluster sampling risks introducing bias if the clusters themselves don't accurately reflect the overall population. Careful cluster definition is crucial to mitigate this. Heterogeneity within Clusters: If clusters are too diverse internally, samples drawn from them may not be representative. Understanding the internal composition of your clusters is vital. Computational Cost: While cluster sampling can reduce data size, the process of clustering itself can be computationally expensive for large datasets. Overall, cluster sampling can be a useful tool in ML, but it's important to use it strategically: consider the nature of your dataset, whether pre-existing clusters are meaningful, and potential biases before adopting this sampling technique.

Answer 39

Data reduction technique used to decrease the number of samples or observations in a dataset by aggregating or summarizing the data. It involves selecting a subset of data points from the original dataset, typically by random selection or systematic sampling. Downsampling is often used to address class imbalance in machine learning tasks, where the number of samples in one class is significantly higher than in others. By reducing the number of samples in the majority class, downsampling helps to balance the class distribution and improve the performance of classification models.

Preprocessing & EDA Flashcards

(63 cards)