Big Data Projects Flashcards by Paul Formica

Identify asteps in a data analysis project.

Conceptualization of the modeling task.
Data Collecor
Data preparation and wrangling.
Data exploration.
Model training.

How well did you know this?

Not at all

Perfectly

Conceptualization of the modeling task.

define the problem at hand

How well did you know this?

Not at all

Perfectly

Data preparation and wrangling

cleaning the data set and preparing it for the model

How well did you know this?

Not at all

Perfectly

Data exploration.

feature selection and engineering and initial data analysis

How well did you know this?

Not at all

Perfectly

Model training

determining the appropriate ML algorithm to use, evaluating the algorithm using a training data set, and tuning the model.

How well did you know this?

Not at all

Perfectly

steps of preparing and wrangling data.

critical step involves cleansing and organizing raw data for use in a model

How well did you know this?

Not at all

Perfectly

Data cleansing

deals with reducing errors in the raw data

How well did you know this?

Not at all

Perfectly

Data wrangling

involves preprocessing data for model use

How well did you know this?

Not at all

Perfectly

Preprocessing includes

data transformation and scaling.

How well did you know this?

Not at all

Perfectly

scaling

Conversion of data features to a common unit of measurement

How well did you know this?

Not at all

Perfectly

Two common methods of scaling

normalization and standardization.

How well did you know this?

Not at all

Perfectly

Normalization scales

normalized X=(X−Xmin)/(Xmax−Xmin)

How well did you know this?

Not at all

Perfectly

Standardization Scales

standardized Xi=Xi−μσ

How well did you know this?

Not at all

Perfectly

Text processing

cleansing and preprocessing of text-based data.

How well did you know this?

Not at all

Perfectly

Text cleansing involves the following steps:

Remove HTML tags.
Remove punctuations.
Remove numbers.
Remove white spaces.

How well did you know this?

Not at all

Perfectly

Cleansed text is then normalized using the following steps:

Lowercasing.
Removal of stop words
Stemming.
Lemmatization.

How well did you know this?

Not at all

Perfectly

Stemming

Study These Flashcards

converts all variations of a word into a common value

Lemmatization

Study These Flashcards

conversion of inflected forms of a word into its lemma

tokenization

Study These Flashcards

is the process of splitting a sentence into tokens

Data exploration

Study These Flashcards

seeks to evaluate the data set and determine the most appropriate way to configure it for model training

Steps in data exploration include the following:

Study These Flashcards

Exploratory data analysis (EDA)
Feature selection.
Feature engineering

Exploratory data analysis (EDA)

Study These Flashcards

involves looking at data descriptors

Feature selection

Study These Flashcards

is a process to select only the needed attributes of the data for ML model training.

Feature engineering

Study These Flashcards

is the process of creating new features by transforming

Data Exploration for Structured Data

With EDA, structured data is organized in rows (observations) and columns (features). With feature selection, we try to select only the features that contribute to the out-of-sample predictive power of the model. Feature Engineering (FE) involves optimizing and improving the selected features.

Model fitting errors can be caused by:

Size of the training sample. Number of features.

The three tasks of model training are as follows:

1- Method selection 2-Performance evaluation 3- Tuning

Techniques to Measure Model Performance

1 - Error analysis. 2 - Receiver operating characteristic (ROC). 3 - Root mean square error (RMSE).

Error analysis.

Errors in classification problems can be false positives (type I error) or false negatives (type II error).

precision (P)

= TP / (TP + FP)

recall (R)

= TP / (TP + FN)

accuracy

= (TP + TN) / (TP + TN + FP + FN)

F1 score =

= (2 × P × R) / (P + R)

Receiver operating characteristic (ROC)

Also used for classification problems, the ROC is a curve that plots the tradeoff between FPs and TPs.

TPR

= TP / (TP + FN)

FPR

= FP / (FP + TN)

Root mean square error (RMSE)

RMSE=⎷(predicted−actual)2/n

Parameters

are estimated by the model (e.g., slope coefficients in a regression model) using an optimization technique on the training sample.

Hyperparameters

are specified by ML engineers, and are not dependent on the training sample.

Ceiling analysis

Ceiling analysis is an evaluation and tuning of each of the components in the entire model-building pipeline.

Big Data Projects Flashcards

(40 cards)