Big Data Projects Flashcards by Zakarias Larsson

Steps in Big Data Analysis/Projects: Traditional with strucutred data.

**Conceptualize the task **-> Collect data -> Data Preperation & processing -> Data Exploration -> Model traning.

How well did you know this?

Not at all

Perfectly

Steps in Big Data Analysis/Projects: Textual Bid Data.

Text probelm formulation -> Data Curation ->** Text preperation and processing** -> Text exploration -> Classifier output.

How well did you know this?

Not at all

Perfectly

Preperation in strucutred data: Extraction

Creating a new variable from an already existing one for easing the analysis.

Example: Date of birth -> Age

How well did you know this?

Not at all

Perfectly

Preperation in strucutred data: Aggregation

2 or more variables aggregated into one signle variable.

How well did you know this?

Not at all

Perfectly

Preperation in strucutred data: Filtration

Eliminate data rows which are not needed.

[We filter out the information that is not relevant]

CFA Lv 2 Candidates only

How well did you know this?

Not at all

Perfectly

Preperation in strucutred data: Selection

Columns that can be eliminated

How well did you know this?

Not at all

Perfectly

Preperation in strucutred data: Conversion

Nominal, ordinal, integer, ratio, categorical.

How well did you know this?

Not at all

Perfectly

Cleansing strucutred data: Incomplete

Missing entries

How well did you know this?

Not at all

Perfectly

Cleansing strucutred data: Invalid

Outside a meaningful range

How well did you know this?

Not at all

Perfectly

Cleansing strucutred data: Inconsistent

Some data conflicts with other data.

How well did you know this?

Not at all

Perfectly

Cleansing strucutred data: Inaccurate

Not a true value

How well did you know this?

Not at all

Perfectly

Cleansing strucutred data: non-uniform

Non identical data format

American date (M/D/Y) vs European (D/M/Y)

How well did you know this?

Not at all

Perfectly

Cleansing strucutred data: Duplication

Multiple identical observation

How well did you know this?

Not at all

Perfectly

Adjusting the range of a feature: Normalization

Rescales in the rage 0-1

Sensitive to outliers.

Xi- Xmin /(Range)
Xi- Xmin /(Xmax -Xmin)

How well did you know this?

Not at all

Perfectly

Adjusting the range of a feature: Standardization

Centers and Rescales

Requiers normal distribution

(Xi - u) / Standard deviation

How well did you know this?

Not at all

Perfectly

Performance evaluation graph: Precision formula

Study These Flashcards

P= TP / (TP + FP)

Remeber: Demoninator ( Positive)
Useful when type 1 error is high

is the ratio of correctly predictive positive classes to all predictive positive classes.
Precision is useful in situations where the cost of FP or Type I Error is high.

For example, when an expensive product fails quality inspection (predicted class 1) and is
scrapped, but it is actually perfectly good (actual class 0).

Performance evaluation graph: Recall formula

Study These Flashcards

TP / (TP + FN)

Remember: ( Recall we have the opposite in the denominator)

Sensitivity: useful when type 2 error is high.

also known as sensitivity i.e. is the ratio of correctly predicted positive classes to all actual
positive classes. Recall is useful in situations where the cost of FN or Type II Error is high.

For example, when an expensive product passes quality inspection (predicted class 0) and
is sent to the valued customer, but it is actually quite defective (actual class 1)

Performance evaluation graph: Accuracy formula

Study These Flashcards

(TP + TN) / (TP + FN + TN + FP)

Is the percentage of correctly predicted classes out of total predictions.

Receiver operating characterisitcs: False Positive Rate Formula

Study These Flashcards

FP / (FP + TN)

Statement / (Statement + Opposite)

Receiver operating characterisitcs: True Positive Rate Formula

Study These Flashcards

TP / (TP + FN)

Statement / (Statement + Opposite)

In big data projects, which measure is the most appropriate for regression method

Study These Flashcards

RMSE

(Root Mean Square Error)

What is “trimming” in big data projects?

Study These Flashcards

Removing the bottom and top 1% of observation on a feature in a data set.

What is “Winsorization” in big data projects?

Study These Flashcards

Replacing the extreme values in a data set with the same maximum or minumimum value

Confusion Matrix: F1 Score Formula

Study These Flashcards

(2 x P x R) / (P + R)

is the harmonic mean of precision and recall.

F1 Score is more appropriate than Accuracy when unequal class distribution is in the dataset andit is necessary to measure the equilibrium of Precision and Recall.

High scores on both of these metrices suggest good model performance.

Confusion Matrix Display

TP FP FN TN

What is **Mutual Information** in big data projects?

**How much info a token contributes to a class** Mutual Information (MI) Measures how much information is contributed by a token to a class of text. **MI = 0** The token’s distribution in all text classes is the same. **MI = 1** The token in any one class tends to occure more often in only that particular class of text.

Feature Engineering

Final stage in Data Exploration **Numbers**: Differentitate among types of numbers **N-Grams**: Multi-Word patterns kept intact **Name entity recognition (NER)**: Class: Money, Time, Organization.

How to deal with **Class Imbalance?**

The majority class can be under-sampled and the minority class can be over-sampled.

Tokenization is the process of

Splitting a givien text into seperate words or characters. Token is equvulant to a word, and tokenization is the process of splitting the word into seperate tokens.

the sequence of steps for text preprocessing is to produce

Tokens -> N-grams which to build a bag -> Input to a document term matrix.

Big Data differs from traditional data sources based on the presence of a set of characteristics commonly referred to as the 4 V’s.. **What are thr 4 V’s?**

**Volume**: refers to the quantity of data. **Variety**: pertains to the array of available data sources. **Velocity**: is the speed at which data is created (data in motion is hard to analyze compared to data at rest). **Veracity**: related to the credibility and reliability of different data sources.

What is **Exploratory Data Analysis (EDA)**, and in which stage is it in?

**Stage 4 and first stage in Data Exploration** is the preliminary step in data exploration. Exploratory graphs, charts and other visualizations such as heat maps and word clouds are designed to summarize and observe data.

What is **Feature Selection**, and in which stage is it in?

**Stage 4 and Second stage in Data Exploration** is a process whereby only pertinent features from the dataset are selected for ML model training. Feature

What is **Feature Engineering**, and in which stage is it in

**Stage 4 and Third and final stage in Data Exploration** is a process of creating new features by changing or transforming existing features. Feature Engineering techniques systematically alter, decompose or combine existing features to produce more meaningful features.

Big Data Projects Flashcards

(34 cards)