Domain 2: Exploratory Data Analysis Flashcards
Why should you address missing data?
Models may fail to train properly or could result in biased predictions if the missing data is not addressed.
How do you identify missing data?
Various tools and function can check for missing values.
What are three options to handle missing data?
Imputation: replace missing values w/ a statistic
Dropping: remove rows/columns with missing values
Predicting: use another ML model to predict/fill in missing values
_____ refers to inaccurate, incomplete, or inconsistent data that can mislead or confuse your machine learning model.
Corrupt data
Ways to handle corrupt data.
Validation rules, cleanup, and outlier detection
_____ are commonly used words in a language that are often removed from text data before training text-based machine learning models. Examples include ‘the’, ‘is’, ‘at’, etc.
Stop words
Why should you remove stop words?
Removing stop words helps in reducing the dimensionality of the text data and increases the model’s focus on words with more significant meaning.
How do you remove stop words?
Identify them first, then filter them out
_____ is about structuring data in a way that is suitable for the problem at hand and for the machine learning model to interpret.
Data formatting
What kind of service can help transform and format data?
AWS Glue
_____is the process of scaling individual samples to have unit norm. This process can help in the performance of machine learning models.
Normalization
_____ is a strategy used to increase the diversity of data available for training models without actually collecting new data. This technique is particularly useful for tasks such as image and speech recognition, where input data can be modified slightly to create new training examples.
Data augmentation
_____ is crucial when different features have different ranges. This ensures that the model treats each feature equally.
Feature scaling
What are two common methods of scaling?
Min-Max Scaling (Normalization) - transforms by scaling each feature to a given range
Standardization (z-score normalization) - Centers the feature columns at mean 0 with standard deviation 1, so that the feature columns take the form of a normal distribution
T/F: Sometimes, you may have to label some data using human experts prior to developing an ML model.
True - in a case where you don’t have sufficient labeled data
What tool has an auto-labeling feature?
Mechanical Turk
In natural language processing (NLP), text data requires conversion into _____format before being inputted to machine learning algorithms.
numerical
What are some common techniques to convert text data into numerical format?
Bag of Words (BoW): Represents text by the frequency of each word.
Term Frequency-Inverse Document Frequency (TF-IDF): Considers the frequency of a term in relation to its frequency across multiple documents, reducing the influence of common terms.
Word Embeddings: Such as Word2Vec or GloVe, represent words in a high-dimensional space where the distance between words conveys semantic similarity.
_____ data can be represented as a time-series of audio signals.
Speech
Common techniques to extract features include:
Mel-Frequency Cepstral Coefficients (MFCCs): Captures the short-term power spectrum of sound.
Spectrogram: A visual way to represent the signal strength, or “loudness”, of a signal over time at various frequencies that are present in a waveform.
_____ data requires features that can capture the visual information contained within.
Image
Techniques for feature extraction from image data:
Color Histograms: Captures the distribution of colors in an image.
Edge Detection: Detects significant transition in color.
Convolutional Neural Networks (CNNs): Automatically discover the internal features from raw images.
ImageNet: Used for training computer vision models.
LibriSpeech: An audio corpus for speech recognition research.
The Universal Dependencies: A collection of annotated text corpora in over 70 languages.
Popular public datasets
When working with public datasets, feature extraction methods are typically determined by _____.
the nature of the data provided
It involves choosing a subset of relevant features for model training.
Feature selection
Techniques for feature selection
Filter Methods: Use statistical measures to score the relevance of features.
Wrapper Methods: Evaluate multiple models and select the best subset of features.
Embedded Methods: Perform feature selection as part of the model training process (e.g., regularization).
_____ is integral to preparing data for machine learning, as it helps transform raw data into manageable groups of inputs.
feature extraction
Involves transforming raw data into features that better represent the underlying problem to predictive models, leading to improved model accuracy on unseen data.
Feature engineering
_____, also referred to as discretization, involves dividing continuous features into discrete bins or intervals, which can often lead to better performance for certain machine learning models that work better with categorical data.
Binning
In the context of text processing, _____is the process of splitting text into individual terms or tokens. This is a critical step in the natural language processing (NLP) pipeline as it helps in preparing the text for embedding or feature extraction.
tokenization
_____are data points that fall far away from the majority of the data. They can skew the results of data analysis and model training. Handling this is essential to prevent them from having an undue influence on the model’s performance.
Outliers
Detection methods for outliers:
standard deviation from mean and interquartile range (iqr)
_____ are new features that are created from one or more existing features, typically to provide additional context to a model, or to highlight relationships between features that may not be readily apparent.
Synthetic features
_____ is a technique used to convert categorical variables into a form that could be provided to ML algorithms to do a better job in prediction. It creates a binary vector for each category of the feature.
One-hot encoding
_____ involves techniques that reduce the number of input variables in a dataset. When high, these datasets can be problematic for machine learning models—a phenomenon often referred to as the “curse of ______.”
Dimensionality reduction, dimensionality
Principal Component Analysis (PCA): A statistical procedure that converts a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables known as principal components.
t-Distributed Stochastic Neighbor Embedding (t-SNE): A nonlinear dimensionality reduction technique well-suited for embedding high-dimensional data into a space of two or three dimensions, which can then be visualized in a scatter plot.
Feature selection techniques: Methods such as backward elimination, forward selection, and recursive feature elimination help in selecting the most important features for the model.
Techniques for reducing dimensionality of data
T/F: binning might improve a model’s performance for a continuous variable that has a nonlinear relationship with the target variable, but it may also result in the loss of information.
True
T/F: Tokenization is essential for text analysis but not applicable for numerical data.
True
T/F: Handling outliers is crucial, but one must decide whether to remove them or adjust them based on the context.
True
T/F: Synthetic features can enhance model performance, but creating too many can lead to overfitting.
True
T/F: One-hot encoding can introduce sparsity into the dataset, which might not be optimal for all models
True
T/F: Dimensionality reduction techniques like PCA can remove noise and reduce overfitting but can make the interpretation of the model more challenging.
True
_____ are used to display the relationship between two continuous variables. Each point on the graph represents the values of two variables for a particular observation.
Scatter plots
_____ graphs are used to represent data points collected or recorded at many successive times, often with equal intervals.
Time series
_____ are used to show the distribution of a dataset: how many times each value appears (i.e., frequency).
Histograms
_____ (also known as box-and-whisker plots) are used to show the distribution of quantitative data and to highlight the median, quartiles, and outliers within the dataset.
Box plots
_____can be used for data pre-processing, data exploration, and visualization with the same plotting libraries that we’ve mentioned here or through other visualization tools like Seaborn, which works on top of Matplotlib.
Jupyter notebooks
_____summarize the main features of a data set in quantitative terms.
Descriptive statistics
What kinds of measures might descriptive statististics include?
mean, median, and mode - depict center
std deviation and variance - indicate how spread out data is
_____measures the strength and direction of the linear relationship between two variables.
Correlation
A correlation coefficient ranges between_____.
-1 and 1
A correlation coefficient close to 1 implies a _____ correlation (as one variable increases, the other tends to also increase)
strong positive
A correlation coefficient close to -1 implies a _____correlation (as one variable increases, the other tends to decrease).
strong negative
A correlation close to 0 suggests _______.
no linear relationship
The _____ is used in the context of hypothesis testing to measure the strength of the evidence against the null hypothesis. It quantifies the probability of observing the given sample data, or something more extreme, assuming the null hypothesis is true.
p-value
T/F: Summary statistics can also aid in feature engineering by providing insights into variable scales and distributions that can be normalized or standardized prior to modeling.
True
_____ is a form of unsupervised learning that is used to find structure in a dataset.
Cluster analysis
_____ is a method of cluster analysis that seeks to build a hierarchy of clusters. Observations are not assigned to clusters definitively but instead are linked to nearby clusters with the data ultimately represented as a tree.
Hierarchical clustering
Collecting and preparing the data.
Computing a distance matrix to assess the similarity between data points.
Constructing a dendrogram to represent the distance or dissimilarity between clusters.
Deciding on a threshold for cutting the dendrogram to define the number of clusters.
Example steps in hierarchical clustering
____ involves evaluating the results of your clustering to ensure they are sensible and effectively capture the natural groupings within the data.
Diagnosis
Diagnostic methods might include _____ and _____, using measures such as ____.
evaluating intra-cluster homogeneity, inter-cluster separation, silhouette scores
The ____ is a heuristic used in determining the number of clusters in a dataset.
elbow method
The idea is to run the clustering for a range of cluster values (k) and calculate the sum of squared distances from each point to its assigned center.
elbow method
When plotted, the sum of squares will decrease as k increases, but the rate of decrease will sharply change at some point, creating an _____ in the graph.
elbow
Why is it important to choose the size of clusters in cluster analysis?
Small clusters might be too specific and might not generalize well, while overly large clusters may be too inclusive, failing to provide useful differentiation.
Define a range of possible cluster sizes.
Use a metric (e.g., silhouette score) to quantify the performance for each size.
Opt for the size that maximizes performance according to the chosen metric.
Steps to adjust cluster size efficiently
_____applies to columns in your dataset, which are numerical.
Normalization
T/F: As a best practice, remove any duplicate data/rows you may have in your dataset.
True
You center all the values around the mean value for that column with unit standard deviation, which is useful when your data has a normal distribution or a close to normal distribution.
Standardization
If your data distribution is heavily skewed and you have a large number of outliers, consider using a_____to first transform your data to a distribution that looks similar to a normal distribution and then use standardization.
log transformation
_____ refers to simply labeling the unique values in a categorical column with integers.
Label encoding
In _____, you can convert each value to its own column and assign a 1 or 0 depending on whether that row has that value.
One-hot encoding