Big Data Projects Flashcards
Identify asteps in a data analysis project.
Conceptualization of the modeling task.
Data Collecor
Data preparation and wrangling.
Data exploration.
Model training.
Conceptualization of the modeling task.
define the problem at hand
Data preparation and wrangling
cleaning the data set and preparing it for the model
Data exploration.
feature selection and engineering and initial data analysis
Model training
determining the appropriate ML algorithm to use, evaluating the algorithm using a training data set, and tuning the model.
steps of preparing and wrangling data.
critical step involves cleansing and organizing raw data for use in a model
Data cleansing
deals with reducing errors in the raw data
Data wrangling
involves preprocessing data for model use
Preprocessing includes
data transformation and scaling.
scaling
Conversion of data features to a common unit of measurement
Two common methods of scaling
normalization and standardization.
Normalization scales
normalized X=(X−Xmin)/(Xmax−Xmin)
Standardization Scales
standardized Xi=Xi−μσ
Text processing
cleansing and preprocessing of text-based data.
Text cleansing involves the following steps:
Remove HTML tags.
Remove punctuations.
Remove numbers.
Remove white spaces.
Cleansed text is then normalized using the following steps:
Lowercasing.
Removal of stop words
Stemming.
Lemmatization.
Stemming
converts all variations of a word into a common value
Lemmatization
conversion of inflected forms of a word into its lemma
tokenization
is the process of splitting a sentence into tokens
Data exploration
seeks to evaluate the data set and determine the most appropriate way to configure it for model training
Steps in data exploration include the following:
Exploratory data analysis (EDA)
Feature selection.
Feature engineering
Exploratory data analysis (EDA)
involves looking at data descriptors
Feature selection
is a process to select only the needed attributes of the data for ML model training.
Feature engineering
is the process of creating new features by transforming
Data Exploration for Structured Data
With EDA, structured data is organized in rows (observations) and columns (features).
With feature selection, we try to select only the features that contribute to the out-of-sample predictive power of the model.
Feature Engineering (FE) involves optimizing and improving the selected features.
Model fitting errors can be caused by:
Size of the training sample.
Number of features.
The three tasks of model training are as follows:
1- Method selection
2-Performance evaluation
3- Tuning
Techniques to Measure Model Performance
1 - Error analysis.
2 - Receiver operating characteristic (ROC).
3 - Root mean square error (RMSE).
Error analysis.
Errors in classification problems can be false positives (type I error) or false negatives (type II error).
precision (P)
= TP / (TP + FP)
recall (R)
= TP / (TP + FN)
accuracy
= (TP + TN) / (TP + TN + FP + FN)
F1 score =
= (2 × P × R) / (P + R)
Receiver operating characteristic (ROC)
Also used for classification problems, the ROC is a curve that plots the tradeoff between FPs and TPs.
TPR
= TP / (TP + FN)
FPR
= FP / (FP + TN)
Root mean square error (RMSE)
RMSE=⎷(predicted−actual)2/n
Parameters
are estimated by the model (e.g., slope coefficients in a regression model) using an optimization technique on the training sample.
Hyperparameters
are specified by ML engineers, and are not dependent on the training sample.
Ceiling analysis
Ceiling analysis is an evaluation and tuning of each of the components in the entire model-building pipeline.