Big Data Projects Flashcards
Steps in Big Data Analysis/Projects: Traditional with strucutred data.
**Conceptualize the task **-> Collect data -> Data Preperation & processing -> Data Exploration -> Model traning.
Steps in Big Data Analysis/Projects: Textual Bid Data.
Text probelm formulation -> Data Curation ->** Text preperation and processing** -> Text exploration -> Classifier output.
Preperation in strucutred data: Extraction
Creating a new variable from an already existing one for easing the analysis.
Example: Date of birth -> Age
Preperation in strucutred data: Aggregation
2 or more variables aggregated into one signle variable.
Preperation in strucutred data: Filtration
Eliminate data rows which are not needed.
[We filter out the information that is not relevant]
CFA Lv 2 Candidates only
Preperation in strucutred data: Selection
Columns that can be eliminated
Preperation in strucutred data: Conversion
Nominal, ordinal, integer, ratio, categorical.
Cleansing strucutred data: Incomplete
Missing entries
Cleansing strucutred data: Invalid
Outside a meaningful range
Cleansing strucutred data: Inconsistent
Some data conflicts with other data.
Cleansing strucutred data: Inaccurate
Not a true value
Cleansing strucutred data: non-uniform
Non identical data format
American date (M/D/Y) vs European (D/M/Y)
Cleansing strucutred data: Duplication
Multiple identical observation
Adjusting the range of a feature: Normalization
Rescales in the rage 0-1
Sensitive to outliers.
Xi- Xmin /(Range)
Xi- Xmin /(Xmax -Xmin)
Adjusting the range of a feature: Standardization
Centers and Rescales
Requiers normal distribution
(Xi - u) / Standard deviation
Performance evaluation graph: Precision formula
P= TP / (TP + FP)
Remeber: Demoninator ( Positive)
Useful when type 1 error is high
is the ratio of correctly predictive positive classes to all predictive positive classes.
Precision is useful in situations where the cost of FP or Type I Error is high.
For example, when an expensive product fails quality inspection (predicted class 1) and is
scrapped, but it is actually perfectly good (actual class 0).
Performance evaluation graph: Recall formula
TP / (TP + FN)
Remember: ( Recall we have the opposite in the denominator)
Sensitivity: useful when type 2 error is high.
also known as sensitivity i.e. is the ratio of correctly predicted positive classes to all actual
positive classes. Recall is useful in situations where the cost of FN or Type II Error is high.
For example, when an expensive product passes quality inspection (predicted class 0) and
is sent to the valued customer, but it is actually quite defective (actual class 1)
Performance evaluation graph: Accuracy formula
(TP + TN) / (TP + FN + TN + FP)
Is the percentage of correctly predicted classes out of total predictions.
Receiver operating characterisitcs: False Positive Rate Formula
FP / (FP + TN)
Statement / (Statement + Opposite)
Receiver operating characterisitcs: True Positive Rate Formula
TP / (TP + FN)
Statement / (Statement + Opposite)
In big data projects, which measure is the most appropriate for regression method
RMSE
(Root Mean Square Error)
What is “trimming” in big data projects?
Removing the bottom and top 1% of observation on a feature in a data set.
What is “Winsorization” in big data projects?
Replacing the extreme values in a data set with the same maximum or minumimum value
Confusion Matrix: F1 Score Formula
(2 x P x R) / (P + R)
is the harmonic mean of precision and recall.
F1 Score is more appropriate than Accuracy when unequal class distribution is in the dataset andit is necessary to measure the equilibrium of Precision and Recall.
High scores on both of these metrices suggest good model performance.
Confusion Matrix Display
TP FP
FN TN
What is Mutual Information in big data projects?
How much info a token contributes to a class
Mutual Information (MI) Measures how much information is contributed by a token to a class of text.
MI = 0 The token’s distribution in all text classes is the same.
MI = 1 The token in any one class tends to occure more often in only that particular class of text.
Feature Engineering
Final stage in Data Exploration
Numbers: Differentitate among types of numbers
N-Grams: Multi-Word patterns kept intact
Name entity recognition (NER): Class: Money, Time, Organization.
How to deal with Class Imbalance?
The majority class can be under-sampled and the minority class can be over-sampled.
Tokenization is the process of
Splitting a givien text into seperate words or characters.
Token is equvulant to a word, and tokenization is the process of splitting the word into seperate tokens.
the sequence of steps for text preprocessing is to produce
Tokens -> N-grams which to build a bag -> Input to a document term matrix.
Big Data differs from traditional data sources based on the presence of a set of characteristics commonly referred to as the 4 V’s.. What are thr 4 V’s?
Volume: refers to the quantity of data.
Variety: pertains to the array of available data sources.
Velocity: is the speed at which data is created (data in motion is hard to analyze compared to data at rest).
Veracity: related to the credibility and reliability of different data sources.
What is Exploratory Data Analysis (EDA), and in which stage is it in?
Stage 4 and first stage in Data Exploration
is the preliminary step in data exploration. Exploratory graphs, charts and other visualizations
such as heat maps and word clouds are designed to summarize and observe data.
What is Feature Selection, and in which stage is it in?
Stage 4 and Second stage in Data Exploration
is a process whereby only pertinent features from the dataset are selected for ML model training.
Feature
What is Feature Engineering, and in which stage is it in
Stage 4 and Third and final stage in Data Exploration
is a process of creating new features by changing or transforming existing features. Feature Engineering techniques systematically alter, decompose or combine existing features to produce more meaningful
features.