Exploratory Data Analysis Flashcards
How to use Pandas and Numpy?
use Pandas to play and finally load into Numpy to feed into a ML algorithm
Data types
Numerical
Categorical
Ordinal
Numerical Data types
Discrete or Integer
- head count
Continues:
infinite precision
- How much rain fell on a given day?
Categorical Data
Gender
Political Parties
…
Orders don’t matter
doesn’t have an intrinsic numerical meaning
Ordinal Data
Mixture of Categorical and Numericals
- Movie Ratings (4 Star movie)
Normal Distribution vs Probability Mass Function
Continues vs Discrete data
Poisson Distribution
works with Discrete data
Binomial Distribution
Discrete data of boolean
Binomial distributions are used for binary classifications of discrete events, such as flipping a coin.
Bernoulli Distribution
Special case of binomial distribution single trial (n=1)
Binomial Distribution is sort of sum of Bernoulli distributions
Additive model
Seasonality + Trends + Noise
constant seasonal variantion
Multiplicative model
Seasonal variation increase as the trend increases
Amazon Athena
Presto under the hood
Serverless interactive queries of S3
Athena Supported formats
CSV JSON ORC Parquet Avro
unstructured, semi or structured
Can you integrate Athena with some notebooks ?
yes you can
- Jupyter
- Zeppelin
- RStudio
Does Athena charge for DDL?
No not for DDL ( Create/Alter/Drop)
How to save money on Athena?
use Columnar format
ORC, Parquet
Cross account access in S3 is possible?
yes it is. tune a bucket policy
Can you do CSE-KMS in S3 for Athena results at rest?
yes you can
Athena Anti-Patterns?
Highly formatted reports / vizes
- try QuickSight
ETL
- use Glue ETL
QuickSight?
Serverless Redshift Aurora / RDS Athena EC2 Files - Excel - CSV, TSV - common log format
it does some limited ETL
SPICE
accelerate interactive queries
in-memory
QuickSight ML capabilities?
Anomaly Detection
Forecasting
Auto-narratives
using RANDOM_CUT_FORREST
how to draw hierarchical Aggregations?
Tree Maps
Different node types in EMR?
Master node
Core node
- store data on HDFS
Task node
- Run the tasks and doesn’t host the data
- spot instances sound good for this type
- cluster can continue with/without it
HDFS default size
128 mb
How does S3 provides EMRFS Consistent view?
using DynamoDB
Can you add/remove core/task nodes on the fly in EMR?
yes you can
What’s Hadoop modules?
top: MapReduce - Spark
middle: YARN
button: HDFS
underlying all of this is, Hadoop Core (Hadoop Common)
Spark
- how does it work in EMR?
- compare to map reduce?
- API
Thanks to YARN spark can negotiate with HDFS
faster alternative to MapReduce
It uses DAG for managing dependencies and processing and schedule effectively
API for Java, Python, Scala and R
Spark components
Spark SQL
MLLib
GraphX
Spark Streaming
Spark Core
Resilient Distributed Dataset (RDD)
What is taking the place of the lower level Resilient Distributed Dataset ?
Dataframe in Python
Dataset in Scala
MLLib capabilities
Classification
- Logistic Regression, Naive Bayes
Regression
Decision trees
Recommendation Engine (ALS)
Clustering (K-Means)
LDA (Topic modeling)
ML workflow utilities
SCV, PCA, Statistics
What does Zeppelin bring to table with Spark?
Make spark like a data science tool
Run Spark code interactively (like in Spark Shell)
Execute SQL queries directly against SparkSQL
Visualize in charts and graphs
Compare EMR Notebook with Zeppelin
EMR Notebook backed up to S3
more integration with AWS
Provision/terminate cluster within notebook
Hosted inside a VPC
Access only via AWS Console
No charge for EMR customers
What does Kerberos provide?
Strong Authentication through secret key cryptography
what are the curse of dimensionality?
too many features > Spars data
Dimensionality Reduction methods
Principal Component Analysis (PCA)
K-Means
Methods to impute data
Mean
- replace with Mean of entire column
Median
- if outliers are there then use Median
Most Frequent Value
- for categorical: e.g. use the
Copy
- Summary for Description
KNN for Numerical or Hamming Distance
Deep Learning
- good for categorical data
Regression
- Linear or non-linear regression
- MICE (Multiple imputation by Chained Equations)
Get more data
Unbalanced data?
Oversampling
- duplicate from minority class
- can be done at random
Undersampling
- remove from majority class
SMOT
SMOT
Synthetic Minority over-sampling Technique
Artificially generate new samples of the minority class using nearest neighbors
- KNN
- Create new samples from KNN
Variance
Sigma square:
average of the squared differences from the mean
Standard Deviation
Sigma:
Square root of variance
How to spot outliers?
Whisk and Plot e.g. beyond 1.5 interquartile range
Standard deviation
AWS Proprietary algorithm, RANDOM_CUT_FOREST
Binning
Bucket numerical values and make them categorical
cover up in-precisions, uncertainty or errors in measurements
Quantile Binning: even sizes in each bin
Transforming
apply some functions to a feature for better training
features with an exponential trend may benefit from logarithmic transformation e.g. ln(x)
or x2 or sqrt(x)
Encoding
in deep learning is common
one-hot encoding
- create buckets for every category
- bucket has 1 for category and 0 for others
Scaling / Normalizing
imagine age range and income range !
normal and let them have an even-level playing field
basically to avoid giving more weight to larger magnitudes
Scikit-Learn has a pre-processor module (MinMaxScaler)
remember to scale back the result
Shuffling
avoid learning from residual signals in the training resulting from the order in which they are collected
SageMaker Ground Truth?
Manages humans to label data
efficient because start to develop its own model during the process to reduce the reliance on human
Who are the labelers of SageMaker Ground Truth?
Mechanical Turk
Internal team
Professional labeling companies
AWS Rekognition
Classify and feature extraction (tags) of images
AWS Comprehend
generate sentiment or topics from texts
TF-IDF
Term Frequency and Inverse Document Frequency
we can use the log of IDF since word frequencies are distributed exponentially
unigram, bigram and ngrams ?
an extension of TF-IDF e.g. I Love Certification uni: I,Love,Certification bigram: I Love, Love Certification ...
for TF-IDF, what transformation we do?
Tokenize the content and then get the
Sparse Vector
MICE?
Multiple Imputation by Chained Equations finds relationships between features and is one of the most advanced imputation methods available. Using machine learning techniques such as KNN and deep learning are also good approaches.