Big Data Projects Flashcards

1
Q

Steps in Big Data Analysis/Projects: Traditional with strucutred data.

A

**Conceptualize the task **-> Collect data -> Data Preperation & processing -> Data Exploration -> Model traning.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Steps in Big Data Analysis/Projects: Textual Bid Data.

A

Text probelm formulation -> Data Curation ->** Text preperation and processing** -> Text exploration -> Classifier output.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Preperation in strucutred data: Extraction

A

Creating a new variable from an already existing one for easing the analysis.

Example: Date of birth -> Age

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Preperation in strucutred data: Aggregation

A

2 or more variables aggregated into one signle variable.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Preperation in strucutred data: Filtration

A

Eliminate data rows which are not needed.

[We filter out the information that is not relevant]

CFA Lv 2 Candidates only

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Preperation in strucutred data: Selection

A

Columns that can be eliminated

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Preperation in strucutred data: Conversion

A

Nominal, ordinal, integer, ratio, categorical.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Cleansing strucutred data: Incomplete

A

Missing entries

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Cleansing strucutred data: Invalid

A

Outside a meaningful range

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Cleansing strucutred data: Inconsistent

A

Some data conflicts with other data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Cleansing strucutred data: Inaccurate

A

Not a true value

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Cleansing strucutred data: non-uniform

A

Non identical data format

American date (M/D/Y) vs European (D/M/Y)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Cleansing strucutred data: Duplication

A

Multiple identical observation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Adjusting the range of a feature: Normalization

A

Rescales in the rage 0-1

Sensitive to outliers.

Xi- Xmin /(Range)
Xi- Xmin /(Xmax -Xmin)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Adjusting the range of a feature: Standardization

A

Centers and Rescales

Requiers normal distribution

(Xi - u) / Standard deviation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Performance evaluation graph: Precision formula

A

P= TP / (TP + FP)

Remeber: Demoninator ( Positive)
Useful when type 1 error is high

is the ratio of correctly predictive positive classes to all predictive positive classes.
Precision is useful in situations where the cost of FP or Type I Error is high.

For example, when an expensive product fails quality inspection (predicted class 1) and is
scrapped, but it is actually perfectly good (actual class 0).

17
Q

Performance evaluation graph: Recall formula

A

TP / (TP + FN)

Remember: ( Recall we have the opposite in the denominator)

Sensitivity: useful when type 2 error is high.

also known as sensitivity i.e. is the ratio of correctly predicted positive classes to all actual
positive classes. Recall is useful in situations where the cost of FN or Type II Error is high.

For example, when an expensive product passes quality inspection (predicted class 0) and
is sent to the valued customer, but it is actually quite defective (actual class 1)

18
Q

Performance evaluation graph: Accuracy formula

A

(TP + TN) / (TP + FN + TN + FP)

Is the percentage of correctly predicted classes out of total predictions.

19
Q

Receiver operating characterisitcs: False Positive Rate Formula

A

FP / (FP + TN)

Statement / (Statement + Opposite)

20
Q

Receiver operating characterisitcs: True Positive Rate Formula

A

TP / (TP + FN)

Statement / (Statement + Opposite)

21
Q

In big data projects, which measure is the most appropriate for regression method

A

RMSE

(Root Mean Square Error)

22
Q

What is “trimming” in big data projects?

A

Removing the bottom and top 1% of observation on a feature in a data set.

23
Q

What is “Winsorization” in big data projects?

A

Replacing the extreme values in a data set with the same maximum or minumimum value

24
Q

Confusion Matrix: F1 Score Formula

A

(2 x P x R) / (P + R)

is the harmonic mean of precision and recall.

F1 Score is more appropriate than Accuracy when unequal class distribution is in the dataset andit is necessary to measure the equilibrium of Precision and Recall.

High scores on both of these metrices suggest good model performance.

25
Q

Confusion Matrix Display

A

TP FP
FN TN

26
Q

What is Mutual Information in big data projects?

A

How much info a token contributes to a class

Mutual Information (MI) Measures how much information is contributed by a token to a class of text.

MI = 0 The token’s distribution in all text classes is the same.

MI = 1 The token in any one class tends to occure more often in only that particular class of text.

27
Q

Feature Engineering

A

Final stage in Data Exploration

Numbers: Differentitate among types of numbers

N-Grams: Multi-Word patterns kept intact

Name entity recognition (NER): Class: Money, Time, Organization.

28
Q

How to deal with Class Imbalance?

A

The majority class can be under-sampled and the minority class can be over-sampled.

29
Q

Tokenization is the process of

A

Splitting a givien text into seperate words or characters.

Token is equvulant to a word, and tokenization is the process of splitting the word into seperate tokens.

30
Q

the sequence of steps for text preprocessing is to produce

A

Tokens -> N-grams which to build a bag -> Input to a document term matrix.

31
Q

Big Data differs from traditional data sources based on the presence of a set of characteristics commonly referred to as the 4 V’s.. What are thr 4 V’s?

A

Volume: refers to the quantity of data.

Variety: pertains to the array of available data sources.

Velocity: is the speed at which data is created (data in motion is hard to analyze compared to data at rest).

Veracity: related to the credibility and reliability of different data sources.

32
Q

What is Exploratory Data Analysis (EDA), and in which stage is it in?

A

Stage 4 and first stage in Data Exploration

is the preliminary step in data exploration. Exploratory graphs, charts and other visualizations
such as heat maps and word clouds are designed to summarize and observe data.

33
Q

What is Feature Selection, and in which stage is it in?

A

Stage 4 and Second stage in Data Exploration

is a process whereby only pertinent features from the dataset are selected for ML model training.
Feature

34
Q

What is Feature Engineering, and in which stage is it in

A

Stage 4 and Third and final stage in Data Exploration

is a process of creating new features by changing or transforming existing features. Feature Engineering techniques systematically alter, decompose or combine existing features to produce more meaningful
features.