Domain 2: Exploratory Data Analysis Flashcards

1
Q

Why should you address missing data?

A

Models may fail to train properly or could result in biased predictions if the missing data is not addressed.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

How do you identify missing data?

A

Various tools and function can check for missing values.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are three options to handle missing data?

A

Imputation: replace missing values w/ a statistic
Dropping: remove rows/columns with missing values
Predicting: use another ML model to predict/fill in missing values

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

_____ refers to inaccurate, incomplete, or inconsistent data that can mislead or confuse your machine learning model.

A

Corrupt data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Ways to handle corrupt data.

A

Validation rules, cleanup, and outlier detection

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

_____ are commonly used words in a language that are often removed from text data before training text-based machine learning models. Examples include ‘the’, ‘is’, ‘at’, etc.

A

Stop words

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Why should you remove stop words?

A

Removing stop words helps in reducing the dimensionality of the text data and increases the model’s focus on words with more significant meaning.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

How do you remove stop words?

A

Identify them first, then filter them out

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

_____ is about structuring data in a way that is suitable for the problem at hand and for the machine learning model to interpret.

A

Data formatting

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What kind of service can help transform and format data?

A

AWS Glue

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

_____is the process of scaling individual samples to have unit norm. This process can help in the performance of machine learning models.

A

Normalization

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

_____ is a strategy used to increase the diversity of data available for training models without actually collecting new data. This technique is particularly useful for tasks such as image and speech recognition, where input data can be modified slightly to create new training examples.

A

Data augmentation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

_____ is crucial when different features have different ranges. This ensures that the model treats each feature equally.

A

Feature scaling

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What are two common methods of scaling?

A

Min-Max Scaling (Normalization) - transforms by scaling each feature to a given range
Standardization (z-score normalization) - Centers the feature columns at mean 0 with standard deviation 1, so that the feature columns take the form of a normal distribution

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

T/F: Sometimes, you may have to label some data using human experts prior to developing an ML model.

A

True - in a case where you don’t have sufficient labeled data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What tool has an auto-labeling feature?

A

Mechanical Turk

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

In natural language processing (NLP), text data requires conversion into _____format before being inputted to machine learning algorithms.

A

numerical

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What are some common techniques to convert text data into numerical format?

A

Bag of Words (BoW): Represents text by the frequency of each word.
Term Frequency-Inverse Document Frequency (TF-IDF): Considers the frequency of a term in relation to its frequency across multiple documents, reducing the influence of common terms.
Word Embeddings: Such as Word2Vec or GloVe, represent words in a high-dimensional space where the distance between words conveys semantic similarity.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

_____ data can be represented as a time-series of audio signals.

A

Speech

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Common techniques to extract features include:

A

Mel-Frequency Cepstral Coefficients (MFCCs): Captures the short-term power spectrum of sound.
Spectrogram: A visual way to represent the signal strength, or “loudness”, of a signal over time at various frequencies that are present in a waveform.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

_____ data requires features that can capture the visual information contained within.

A

Image

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Techniques for feature extraction from image data:

A

Color Histograms: Captures the distribution of colors in an image.
Edge Detection: Detects significant transition in color.
Convolutional Neural Networks (CNNs): Automatically discover the internal features from raw images.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

ImageNet: Used for training computer vision models.
LibriSpeech: An audio corpus for speech recognition research.
The Universal Dependencies: A collection of annotated text corpora in over 70 languages.

A

Popular public datasets

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

When working with public datasets, feature extraction methods are typically determined by _____.

A

the nature of the data provided

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

It involves choosing a subset of relevant features for model training.

A

Feature selection

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

Techniques for feature selection

A

Filter Methods: Use statistical measures to score the relevance of features.
Wrapper Methods: Evaluate multiple models and select the best subset of features.
Embedded Methods: Perform feature selection as part of the model training process (e.g., regularization).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

_____ is integral to preparing data for machine learning, as it helps transform raw data into manageable groups of inputs.

A

feature extraction

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

Involves transforming raw data into features that better represent the underlying problem to predictive models, leading to improved model accuracy on unseen data.

A

Feature engineering

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

_____, also referred to as discretization, involves dividing continuous features into discrete bins or intervals, which can often lead to better performance for certain machine learning models that work better with categorical data.

A

Binning

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

In the context of text processing, _____is the process of splitting text into individual terms or tokens. This is a critical step in the natural language processing (NLP) pipeline as it helps in preparing the text for embedding or feature extraction.

A

tokenization

31
Q

_____are data points that fall far away from the majority of the data. They can skew the results of data analysis and model training. Handling this is essential to prevent them from having an undue influence on the model’s performance.

A

Outliers

32
Q

Detection methods for outliers:

A

standard deviation from mean and interquartile range (iqr)

33
Q

_____ are new features that are created from one or more existing features, typically to provide additional context to a model, or to highlight relationships between features that may not be readily apparent.

A

Synthetic features

34
Q

_____ is a technique used to convert categorical variables into a form that could be provided to ML algorithms to do a better job in prediction. It creates a binary vector for each category of the feature.

A

One-hot encoding

35
Q

_____ involves techniques that reduce the number of input variables in a dataset. When high, these datasets can be problematic for machine learning models—a phenomenon often referred to as the “curse of ______.”

A

Dimensionality reduction, dimensionality

36
Q

Principal Component Analysis (PCA): A statistical procedure that converts a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables known as principal components.

t-Distributed Stochastic Neighbor Embedding (t-SNE): A nonlinear dimensionality reduction technique well-suited for embedding high-dimensional data into a space of two or three dimensions, which can then be visualized in a scatter plot.

Feature selection techniques: Methods such as backward elimination, forward selection, and recursive feature elimination help in selecting the most important features for the model.

A

Techniques for reducing dimensionality of data

37
Q

T/F: binning might improve a model’s performance for a continuous variable that has a nonlinear relationship with the target variable, but it may also result in the loss of information.

A

True

38
Q

T/F: Tokenization is essential for text analysis but not applicable for numerical data.

A

True

39
Q

T/F: Handling outliers is crucial, but one must decide whether to remove them or adjust them based on the context.

A

True

40
Q

T/F: Synthetic features can enhance model performance, but creating too many can lead to overfitting.

A

True

41
Q

T/F: One-hot encoding can introduce sparsity into the dataset, which might not be optimal for all models

A

True

42
Q

T/F: Dimensionality reduction techniques like PCA can remove noise and reduce overfitting but can make the interpretation of the model more challenging.

A

True

43
Q

_____ are used to display the relationship between two continuous variables. Each point on the graph represents the values of two variables for a particular observation.

A

Scatter plots

44
Q

_____ graphs are used to represent data points collected or recorded at many successive times, often with equal intervals.

A

Time series

45
Q

_____ are used to show the distribution of a dataset: how many times each value appears (i.e., frequency).

A

Histograms

46
Q

_____ (also known as box-and-whisker plots) are used to show the distribution of quantitative data and to highlight the median, quartiles, and outliers within the dataset.

A

Box plots

47
Q

_____can be used for data pre-processing, data exploration, and visualization with the same plotting libraries that we’ve mentioned here or through other visualization tools like Seaborn, which works on top of Matplotlib.

A

Jupyter notebooks

48
Q

_____summarize the main features of a data set in quantitative terms.

A

Descriptive statistics

49
Q

What kinds of measures might descriptive statististics include?

A

mean, median, and mode - depict center
std deviation and variance - indicate how spread out data is

50
Q

_____measures the strength and direction of the linear relationship between two variables.

A

Correlation

51
Q

A correlation coefficient ranges between_____.

A

-1 and 1

52
Q

A correlation coefficient close to 1 implies a _____ correlation (as one variable increases, the other tends to also increase)

A

strong positive

53
Q

A correlation coefficient close to -1 implies a _____correlation (as one variable increases, the other tends to decrease).

A

strong negative

54
Q

A correlation close to 0 suggests _______.

A

no linear relationship

55
Q

The _____ is used in the context of hypothesis testing to measure the strength of the evidence against the null hypothesis. It quantifies the probability of observing the given sample data, or something more extreme, assuming the null hypothesis is true.

A

p-value

56
Q

T/F: Summary statistics can also aid in feature engineering by providing insights into variable scales and distributions that can be normalized or standardized prior to modeling.

A

True

57
Q

_____ is a form of unsupervised learning that is used to find structure in a dataset.

A

Cluster analysis

58
Q

_____ is a method of cluster analysis that seeks to build a hierarchy of clusters. Observations are not assigned to clusters definitively but instead are linked to nearby clusters with the data ultimately represented as a tree.

A

Hierarchical clustering

59
Q

Collecting and preparing the data.
Computing a distance matrix to assess the similarity between data points.
Constructing a dendrogram to represent the distance or dissimilarity between clusters.
Deciding on a threshold for cutting the dendrogram to define the number of clusters.

A

Example steps in hierarchical clustering

60
Q

____ involves evaluating the results of your clustering to ensure they are sensible and effectively capture the natural groupings within the data.

A

Diagnosis

61
Q

Diagnostic methods might include _____ and _____, using measures such as ____.

A

evaluating intra-cluster homogeneity, inter-cluster separation, silhouette scores

62
Q

The ____ is a heuristic used in determining the number of clusters in a dataset.

A

elbow method

63
Q

The idea is to run the clustering for a range of cluster values (k) and calculate the sum of squared distances from each point to its assigned center.

A

elbow method

64
Q

When plotted, the sum of squares will decrease as k increases, but the rate of decrease will sharply change at some point, creating an _____ in the graph.

A

elbow

65
Q

Why is it important to choose the size of clusters in cluster analysis?

A

Small clusters might be too specific and might not generalize well, while overly large clusters may be too inclusive, failing to provide useful differentiation.

66
Q

Define a range of possible cluster sizes.
Use a metric (e.g., silhouette score) to quantify the performance for each size.
Opt for the size that maximizes performance according to the chosen metric.

A

Steps to adjust cluster size efficiently

67
Q

_____applies to columns in your dataset, which are numerical.

A

Normalization

68
Q

T/F: As a best practice, remove any duplicate data/rows you may have in your dataset.

A

True

69
Q

You center all the values around the mean value for that column with unit standard deviation, which is useful when your data has a normal distribution or a close to normal distribution.

A

Standardization

70
Q

If your data distribution is heavily skewed and you have a large number of outliers, consider using a_____to first transform your data to a distribution that looks similar to a normal distribution and then use standardization.

A

log transformation

71
Q

_____ refers to simply labeling the unique values in a categorical column with integers.

A

Label encoding

72
Q

In _____, you can convert each value to its own column and assign a 1 or 0 depending on whether that row has that value.

A

One-hot encoding

73
Q
A