Domain 2: Exploratory Data Analysis Flashcards by Natasha WrightPope

Why should you address missing data?

Models may fail to train properly or could result in biased predictions if the missing data is not addressed.

How well did you know this?

Not at all

Perfectly

How do you identify missing data?

Various tools and function can check for missing values.

How well did you know this?

Not at all

Perfectly

What are three options to handle missing data?

Imputation: replace missing values w/ a statistic
Dropping: remove rows/columns with missing values
Predicting: use another ML model to predict/fill in missing values

How well did you know this?

Not at all

Perfectly

_____ refers to inaccurate, incomplete, or inconsistent data that can mislead or confuse your machine learning model.

Corrupt data

How well did you know this?

Not at all

Perfectly

Ways to handle corrupt data.

Validation rules, cleanup, and outlier detection

How well did you know this?

Not at all

Perfectly

_____ are commonly used words in a language that are often removed from text data before training text-based machine learning models. Examples include ‘the’, ‘is’, ‘at’, etc.

Stop words

How well did you know this?

Not at all

Perfectly

Why should you remove stop words?

Removing stop words helps in reducing the dimensionality of the text data and increases the model’s focus on words with more significant meaning.

How well did you know this?

Not at all

Perfectly

How do you remove stop words?

Identify them first, then filter them out

How well did you know this?

Not at all

Perfectly

_____ is about structuring data in a way that is suitable for the problem at hand and for the machine learning model to interpret.

Data formatting

How well did you know this?

Not at all

Perfectly

What kind of service can help transform and format data?

AWS Glue

How well did you know this?

Not at all

Perfectly

_____is the process of scaling individual samples to have unit norm. This process can help in the performance of machine learning models.

Normalization

How well did you know this?

Not at all

Perfectly

_____ is a strategy used to increase the diversity of data available for training models without actually collecting new data. This technique is particularly useful for tasks such as image and speech recognition, where input data can be modified slightly to create new training examples.

Data augmentation

How well did you know this?

Not at all

Perfectly

_____ is crucial when different features have different ranges. This ensures that the model treats each feature equally.

Feature scaling

How well did you know this?

Not at all

Perfectly

What are two common methods of scaling?

Min-Max Scaling (Normalization) - transforms by scaling each feature to a given range
Standardization (z-score normalization) - Centers the feature columns at mean 0 with standard deviation 1, so that the feature columns take the form of a normal distribution

How well did you know this?

Not at all

Perfectly

T/F: Sometimes, you may have to label some data using human experts prior to developing an ML model.

True - in a case where you don’t have sufficient labeled data

How well did you know this?

Not at all

Perfectly

What tool has an auto-labeling feature?

Mechanical Turk

How well did you know this?

Not at all

Perfectly

In natural language processing (NLP), text data requires conversion into _____format before being inputted to machine learning algorithms.

numerical

How well did you know this?

Not at all

Perfectly

What are some common techniques to convert text data into numerical format?

Bag of Words (BoW): Represents text by the frequency of each word.
Term Frequency-Inverse Document Frequency (TF-IDF): Considers the frequency of a term in relation to its frequency across multiple documents, reducing the influence of common terms.
Word Embeddings: Such as Word2Vec or GloVe, represent words in a high-dimensional space where the distance between words conveys semantic similarity.

How well did you know this?

Not at all

Perfectly

_____ data can be represented as a time-series of audio signals.

Speech

How well did you know this?

Not at all

Perfectly

Common techniques to extract features include:

Mel-Frequency Cepstral Coefficients (MFCCs): Captures the short-term power spectrum of sound.
Spectrogram: A visual way to represent the signal strength, or “loudness”, of a signal over time at various frequencies that are present in a waveform.

How well did you know this?

Not at all

Perfectly

_____ data requires features that can capture the visual information contained within.

Image

How well did you know this?

Not at all

Perfectly

Techniques for feature extraction from image data:

Color Histograms: Captures the distribution of colors in an image.
Edge Detection: Detects significant transition in color.
Convolutional Neural Networks (CNNs): Automatically discover the internal features from raw images.

How well did you know this?

Not at all

Perfectly

ImageNet: Used for training computer vision models.
LibriSpeech: An audio corpus for speech recognition research.
The Universal Dependencies: A collection of annotated text corpora in over 70 languages.

Popular public datasets

How well did you know this?

Not at all

Perfectly

When working with public datasets, feature extraction methods are typically determined by _____.

the nature of the data provided

How well did you know this?

Not at all

Perfectly

It involves choosing a subset of relevant features for model training.

Feature selection

Techniques for feature selection

Filter Methods: Use statistical measures to score the relevance of features. Wrapper Methods: Evaluate multiple models and select the best subset of features. Embedded Methods: Perform feature selection as part of the model training process (e.g., regularization).

_____ is integral to preparing data for machine learning, as it helps transform raw data into manageable groups of inputs.

feature extraction

Involves transforming raw data into features that better represent the underlying problem to predictive models, leading to improved model accuracy on unseen data.

Feature engineering

_____, also referred to as discretization, involves dividing continuous features into discrete bins or intervals, which can often lead to better performance for certain machine learning models that work better with categorical data.

Binning

In the context of text processing, _____is the process of splitting text into individual terms or tokens. This is a critical step in the natural language processing (NLP) pipeline as it helps in preparing the text for embedding or feature extraction.

tokenization

_____are data points that fall far away from the majority of the data. They can skew the results of data analysis and model training. Handling this is essential to prevent them from having an undue influence on the model’s performance.

Outliers

Detection methods for outliers:

standard deviation from mean and interquartile range (iqr)

_____ are new features that are created from one or more existing features, typically to provide additional context to a model, or to highlight relationships between features that may not be readily apparent.

Synthetic features

_____ is a technique used to convert categorical variables into a form that could be provided to ML algorithms to do a better job in prediction. It creates a binary vector for each category of the feature.

One-hot encoding

_____ involves techniques that reduce the number of input variables in a dataset. When high, these datasets can be problematic for machine learning models—a phenomenon often referred to as the “curse of ______.”

Dimensionality reduction, dimensionality

Principal Component Analysis (PCA): A statistical procedure that converts a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables known as principal components. t-Distributed Stochastic Neighbor Embedding (t-SNE): A nonlinear dimensionality reduction technique well-suited for embedding high-dimensional data into a space of two or three dimensions, which can then be visualized in a scatter plot. Feature selection techniques: Methods such as backward elimination, forward selection, and recursive feature elimination help in selecting the most important features for the model.

Techniques for reducing dimensionality of data

T/F: binning might improve a model’s performance for a continuous variable that has a nonlinear relationship with the target variable, but it may also result in the loss of information.

True

T/F: Tokenization is essential for text analysis but not applicable for numerical data.

True

T/F: Handling outliers is crucial, but one must decide whether to remove them or adjust them based on the context.

True

T/F: Synthetic features can enhance model performance, but creating too many can lead to overfitting.

True

T/F: One-hot encoding can introduce sparsity into the dataset, which might not be optimal for all models

True

T/F: Dimensionality reduction techniques like PCA can remove noise and reduce overfitting but can make the interpretation of the model more challenging.

True

_____ are used to display the relationship between two continuous variables. Each point on the graph represents the values of two variables for a particular observation.

Scatter plots

_____ graphs are used to represent data points collected or recorded at many successive times, often with equal intervals.

Time series

_____ are used to show the distribution of a dataset: how many times each value appears (i.e., frequency).

Histograms

_____ (also known as box-and-whisker plots) are used to show the distribution of quantitative data and to highlight the median, quartiles, and outliers within the dataset.

Box plots

_____can be used for data pre-processing, data exploration, and visualization with the same plotting libraries that we’ve mentioned here or through other visualization tools like Seaborn, which works on top of Matplotlib.

Jupyter notebooks

_____summarize the main features of a data set in quantitative terms.

Descriptive statistics

What kinds of measures might descriptive statististics include?

mean, median, and mode - depict center std deviation and variance - indicate how spread out data is

_____measures the strength and direction of the linear relationship between two variables.

Correlation

A correlation coefficient ranges between_____.

-1 and 1

A correlation coefficient close to 1 implies a _____ correlation (as one variable increases, the other tends to also increase)

strong positive

A correlation coefficient close to -1 implies a _____correlation (as one variable increases, the other tends to decrease).

strong negative

A correlation close to 0 suggests _______.

no linear relationship

The _____ is used in the context of hypothesis testing to measure the strength of the evidence against the null hypothesis. It quantifies the probability of observing the given sample data, or something more extreme, assuming the null hypothesis is true.

p-value

T/F: Summary statistics can also aid in feature engineering by providing insights into variable scales and distributions that can be normalized or standardized prior to modeling.

True

_____ is a form of unsupervised learning that is used to find structure in a dataset.

Cluster analysis

_____ is a method of cluster analysis that seeks to build a hierarchy of clusters. Observations are not assigned to clusters definitively but instead are linked to nearby clusters with the data ultimately represented as a tree.

Hierarchical clustering

Collecting and preparing the data. Computing a distance matrix to assess the similarity between data points. Constructing a dendrogram to represent the distance or dissimilarity between clusters. Deciding on a threshold for cutting the dendrogram to define the number of clusters.

Example steps in hierarchical clustering

____ involves evaluating the results of your clustering to ensure they are sensible and effectively capture the natural groupings within the data.

Diagnosis

Diagnostic methods might include _____ and _____, using measures such as ____.

evaluating intra-cluster homogeneity, inter-cluster separation, silhouette scores

The ____ is a heuristic used in determining the number of clusters in a dataset.

elbow method

The idea is to run the clustering for a range of cluster values (k) and calculate the sum of squared distances from each point to its assigned center.

elbow method

When plotted, the sum of squares will decrease as k increases, but the rate of decrease will sharply change at some point, creating an _____ in the graph.

elbow

Why is it important to choose the size of clusters in cluster analysis?

Small clusters might be too specific and might not generalize well, while overly large clusters may be too inclusive, failing to provide useful differentiation.

Define a range of possible cluster sizes. Use a metric (e.g., silhouette score) to quantify the performance for each size. Opt for the size that maximizes performance according to the chosen metric.

Steps to adjust cluster size efficiently

_____applies to columns in your dataset, which are numerical.

Normalization

T/F: As a best practice, remove any duplicate data/rows you may have in your dataset.

True

You center all the values around the mean value for that column with unit standard deviation, which is useful when your data has a normal distribution or a close to normal distribution.

Standardization

If your data distribution is heavily skewed and you have a large number of outliers, consider using a_____to first transform your data to a distribution that looks similar to a normal distribution and then use standardization.

log transformation

_____ refers to simply labeling the unique values in a categorical column with integers.

Label encoding

In _____, you can convert each value to its own column and assign a 1 or 0 depending on whether that row has that value.

One-hot encoding

Domain 2: Exploratory Data Analysis Flashcards

(73 cards)