Domain 2: Exploratory Data Analysis Flashcards
Why should you address missing data?
Models may fail to train properly or could result in biased predictions if the missing data is not addressed.
How do you identify missing data?
Various tools and function can check for missing values.
What are three options to handle missing data?
Imputation: replace missing values w/ a statistic
Dropping: remove rows/columns with missing values
Predicting: use another ML model to predict/fill in missing values
_____ refers to inaccurate, incomplete, or inconsistent data that can mislead or confuse your machine learning model.
Corrupt data
Ways to handle corrupt data.
Validation rules, cleanup, and outlier detection
_____ are commonly used words in a language that are often removed from text data before training text-based machine learning models. Examples include ‘the’, ‘is’, ‘at’, etc.
Stop words
Why should you remove stop words?
Removing stop words helps in reducing the dimensionality of the text data and increases the model’s focus on words with more significant meaning.
How do you remove stop words?
Identify them first, then filter them out
_____ is about structuring data in a way that is suitable for the problem at hand and for the machine learning model to interpret.
Data formatting
What kind of service can help transform and format data?
AWS Glue
_____is the process of scaling individual samples to have unit norm. This process can help in the performance of machine learning models.
Normalization
_____ is a strategy used to increase the diversity of data available for training models without actually collecting new data. This technique is particularly useful for tasks such as image and speech recognition, where input data can be modified slightly to create new training examples.
Data augmentation
_____ is crucial when different features have different ranges. This ensures that the model treats each feature equally.
Feature scaling
What are two common methods of scaling?
Min-Max Scaling (Normalization) - transforms by scaling each feature to a given range
Standardization (z-score normalization) - Centers the feature columns at mean 0 with standard deviation 1, so that the feature columns take the form of a normal distribution
T/F: Sometimes, you may have to label some data using human experts prior to developing an ML model.
True - in a case where you don’t have sufficient labeled data
What tool has an auto-labeling feature?
Mechanical Turk
In natural language processing (NLP), text data requires conversion into _____format before being inputted to machine learning algorithms.
numerical
What are some common techniques to convert text data into numerical format?
Bag of Words (BoW): Represents text by the frequency of each word.
Term Frequency-Inverse Document Frequency (TF-IDF): Considers the frequency of a term in relation to its frequency across multiple documents, reducing the influence of common terms.
Word Embeddings: Such as Word2Vec or GloVe, represent words in a high-dimensional space where the distance between words conveys semantic similarity.
_____ data can be represented as a time-series of audio signals.
Speech
Common techniques to extract features include:
Mel-Frequency Cepstral Coefficients (MFCCs): Captures the short-term power spectrum of sound.
Spectrogram: A visual way to represent the signal strength, or “loudness”, of a signal over time at various frequencies that are present in a waveform.
_____ data requires features that can capture the visual information contained within.
Image
Techniques for feature extraction from image data:
Color Histograms: Captures the distribution of colors in an image.
Edge Detection: Detects significant transition in color.
Convolutional Neural Networks (CNNs): Automatically discover the internal features from raw images.
ImageNet: Used for training computer vision models.
LibriSpeech: An audio corpus for speech recognition research.
The Universal Dependencies: A collection of annotated text corpora in over 70 languages.
Popular public datasets
When working with public datasets, feature extraction methods are typically determined by _____.
the nature of the data provided
It involves choosing a subset of relevant features for model training.
Feature selection
Techniques for feature selection
Filter Methods: Use statistical measures to score the relevance of features.
Wrapper Methods: Evaluate multiple models and select the best subset of features.
Embedded Methods: Perform feature selection as part of the model training process (e.g., regularization).
_____ is integral to preparing data for machine learning, as it helps transform raw data into manageable groups of inputs.
feature extraction
Involves transforming raw data into features that better represent the underlying problem to predictive models, leading to improved model accuracy on unseen data.
Feature engineering
_____, also referred to as discretization, involves dividing continuous features into discrete bins or intervals, which can often lead to better performance for certain machine learning models that work better with categorical data.
Binning