Exploratory Data Analysis Flashcards

1
Q

What are the 3 major types of data?

A

-Numerical
-Categorical
-Ordinal

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

In Timeseries data, what is Trend and Sazonality?

A

A Trend is the overall expected change of datapoints based on time passed, while seasonality is a repeating behavior that cycles as time passes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are the data formats supported by Amazon Athena?

A

-CSV
-JSON
-Parquet
-Avro
-ORC

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

How does Athena’s billing work?

A

-You pay for each 5TB of data scanned

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

True or False: Data transfers between S3 and Athena are unencrypted

A

False, they are encrypted using TLS

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are Amazon Quicksight’s accepted data origins?

A

-Excel, CSV, TSV and log files (Local or S3)
-Athena
-Redshift
-RDS / Aurora
-EC2 hosted databases
-AWS IoT Analytics

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is SPICE in a Quicksight context?

A

It is an in-memory, high performance calculation engine used by quicksight to perform it’s queries.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

How much memory does each Quicksight user have access to on SPICE?

A

10GB

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What are the machine learning insights available on Quicksight?

A

-Anomaly detection
-Forecasting
-Auto-narrative

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is Quicksight Q?

A

It is a GenAI functionality on Quicksight that can analise your data and generate insights to questions asked to it

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is a Quicksight paginated report?

A

It is a report that is created in a format designed to be printed

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

True or False: Quicksight was not designed for perfoming ETL, Glue being recommended for that instead

A

True

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

True or False: Quicksight supports user creation via IAM or email, but not 2FA

A

False, it does support 2FA

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Ture or False: You can restrict Athena Access to S3 data through the usage of bucket policies

A

True

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is EMR?

A

It is an AWS service that allows you to run a Managed Hadoop framework on EC2 instances

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

True or false: EMR includes Spark, Hive, Flink, Presto and HBase

17
Q

In the architecture of an EMR cluster, what are the existing node types?

A

In an EMR cluster, there are the following nodes:
-Master node: Manages the cluster
-Core node: Hosts HDFS data and tuns tasks
-Task nodes: Runs task while not storing any data (Good use of spot instances)

18
Q

True or False: EMR Serverless lets AWS scale your nodes automatically

19
Q

What are the types of EMR data storage?

A

-HDFS
-EMRFS: Access S3 as if it was HDFS
-Local File System
-EBS for HDFS

20
Q

What is the EMRFS Consistent View setting for EMR data storage?

A

Optional feature used to mantin consistency on S3 data using DynamoDB

21
Q

What models are available on SparkMLLib?

A

-Classification: Logistic Regression and Naive Bayes
-Regression
-Decision trees
-Recommendation engine (ALS)
-Clustering (k-means)
-LDA (topic modelling)
-SVD, PCA, statistics
-ML workflow utilities (pipelines, fetaure transformation, persistence)

22
Q

What is the recommended size for EMR mater nodes?

A

m5.large if <50 nodes, m5.xlarge otherwise

23
Q

What does HDFS mean?

A

Hadoop Distributed File System

24
Q

What is the difference between a Transient and a Long-running EMR cluster?

A

A transient cluster is automatically terminated after its tasks are run, while a Long-running one is terminated manually by the user

25
True or False: when you terminate an EMR Cluster using HDFS storage, the files are backed up to EBS
False, all the HDFS file is ephemeral and lost
26
What are better ways than mean of imputing missing data?
-KNN mean (mean o N closest entries) -Deep Learning (Good for categorical data) -Regression (Advanced technique: MICE)
27
What is SMOTE?
It is an oversampling technique where new artificial entries are generated to solve biased datasets. Also Undersamples majority class.
28
What is a possible statistical algorithm that can be used to detect outliers?
Random Cut Forest
29
What is binning?
Grouping multiple entries with similar values together.
30
What is quantile binning?
It is a strategy where you perform bining based on the value distribution, making sure all bins have the same size.
31
What is Sagemaker Ground Truth?
It is an AWS service used for labeling data for ML
32
What is the difference between Sagemaker Ground Truth and Mechanical Turk?
Ground truth is directed specifically to data labelling, while Mechanical Turk focuses in any task that has to be manually performed by humans.
33
Ture or False: Sagemaker Ground Truth works 100% on human labour.
False, as images are labelled, ground truth crates it's own model, only requesting human assistance for instances where it is not sure of it's own classifications.W
34
What is Sagemaker Ground Truth Plus?
It is a feature of Ground Truth where the image workflow and labelling is managed by AWS experts and then delivered to the customer.
35
What are other AWS services that can be used to generate training labels?
AWS Rekognition (image) and AWS Comprehend (text)