Exploratory Data Analysis Flashcards
How to use Pandas and Numpy?
use Pandas to play and finally load into Numpy to feed into a ML algorithm
Data types
Numerical
Categorical
Ordinal
Numerical Data types
Discrete or Integer
- head count
Continues:
infinite precision
- How much rain fell on a given day?
Categorical Data
Gender
Political Parties
…
Orders don’t matter
doesn’t have an intrinsic numerical meaning
Ordinal Data
Mixture of Categorical and Numericals
- Movie Ratings (4 Star movie)
Normal Distribution vs Probability Mass Function
Continues vs Discrete data
Poisson Distribution
works with Discrete data
Binomial Distribution
Discrete data of boolean
Binomial distributions are used for binary classifications of discrete events, such as flipping a coin.
Bernoulli Distribution
Special case of binomial distribution single trial (n=1)
Binomial Distribution is sort of sum of Bernoulli distributions
Additive model
Seasonality + Trends + Noise
constant seasonal variantion
Multiplicative model
Seasonal variation increase as the trend increases
Amazon Athena
Presto under the hood
Serverless interactive queries of S3
Athena Supported formats
CSV JSON ORC Parquet Avro
unstructured, semi or structured
Can you integrate Athena with some notebooks ?
yes you can
- Jupyter
- Zeppelin
- RStudio
Does Athena charge for DDL?
No not for DDL ( Create/Alter/Drop)
How to save money on Athena?
use Columnar format
ORC, Parquet
Cross account access in S3 is possible?
yes it is. tune a bucket policy
Can you do CSE-KMS in S3 for Athena results at rest?
yes you can
Athena Anti-Patterns?
Highly formatted reports / vizes
- try QuickSight
ETL
- use Glue ETL
QuickSight?
Serverless Redshift Aurora / RDS Athena EC2 Files - Excel - CSV, TSV - common log format
it does some limited ETL
SPICE
accelerate interactive queries
in-memory
QuickSight ML capabilities?
Anomaly Detection
Forecasting
Auto-narratives
using RANDOM_CUT_FORREST