Big Data Projects Flashcards

1
Q

Identify asteps in a data analysis project.

A

Conceptualization of the modeling task.
Data Collecor
Data preparation and wrangling.
Data exploration.
Model training.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Conceptualization of the modeling task.

A

define the problem at hand

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Data preparation and wrangling

A

cleaning the data set and preparing it for the model

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Data exploration.

A

feature selection and engineering and initial data analysis

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Model training

A

determining the appropriate ML algorithm to use, evaluating the algorithm using a training data set, and tuning the model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

steps of preparing and wrangling data.

A

critical step involves cleansing and organizing raw data for use in a model

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Data cleansing

A

deals with reducing errors in the raw data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Data wrangling

A

involves preprocessing data for model use

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Preprocessing includes

A

data transformation and scaling.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

scaling

A

Conversion of data features to a common unit of measurement

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Two common methods of scaling

A

normalization and standardization.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Normalization scales

A

normalized X=(X−Xmin)/(Xmax−Xmin)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Standardization Scales

A

standardized Xi=Xi−μσ

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Text processing

A

cleansing and preprocessing of text-based data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Text cleansing involves the following steps:

A

Remove HTML tags.
Remove punctuations.
Remove numbers.
Remove white spaces.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Cleansed text is then normalized using the following steps:

A

Lowercasing.
Removal of stop words
Stemming.
Lemmatization.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Stemming

A

converts all variations of a word into a common value

18
Q

Lemmatization

A

conversion of inflected forms of a word into its lemma

19
Q

tokenization

A

is the process of splitting a sentence into tokens

20
Q

Data exploration

A

seeks to evaluate the data set and determine the most appropriate way to configure it for model training

21
Q

Steps in data exploration include the following:

A

Exploratory data analysis (EDA)
Feature selection.
Feature engineering

22
Q

Exploratory data analysis (EDA)

A

involves looking at data descriptors

23
Q

Feature selection

A

is a process to select only the needed attributes of the data for ML model training.

24
Q

Feature engineering

A

is the process of creating new features by transforming

25
Data Exploration for Structured Data
With EDA, structured data is organized in rows (observations) and columns (features). With feature selection, we try to select only the features that contribute to the out-of-sample predictive power of the model. Feature Engineering (FE) involves optimizing and improving the selected features.
26
Model fitting errors can be caused by:
Size of the training sample. Number of features.
27
The three tasks of model training are as follows:
1- Method selection 2-Performance evaluation 3- Tuning
28
Techniques to Measure Model Performance
1 - Error analysis. 2 - Receiver operating characteristic (ROC). 3 - Root mean square error (RMSE).
29
Error analysis.
Errors in classification problems can be false positives (type I error) or false negatives (type II error).
30
precision (P)
= TP / (TP + FP)
31
recall (R)
= TP / (TP + FN)
32
accuracy
= (TP + TN) / (TP + TN + FP + FN)
33
F1 score =
= (2 × P × R) / (P + R)
34
Receiver operating characteristic (ROC)
Also used for classification problems, the ROC is a curve that plots the tradeoff between FPs and TPs.
35
TPR
= TP / (TP + FN)
36
FPR
= FP / (FP + TN)
37
Root mean square error (RMSE)
RMSE=⎷(predicted−actual)2/n
38
Parameters
are estimated by the model (e.g., slope coefficients in a regression model) using an optimization technique on the training sample.
39
Hyperparameters
are specified by ML engineers, and are not dependent on the training sample.
40
Ceiling analysis
Ceiling analysis is an evaluation and tuning of each of the components in the entire model-building pipeline.