Exploratory Data Analysis Flashcards
What are the 3 major types of data?
-Numerical
-Categorical
-Ordinal
In Timeseries data, what is Trend and Sazonality?
A Trend is the overall expected change of datapoints based on time passed, while seasonality is a repeating behavior that cycles as time passes
What are the data formats supported by Amazon Athena?
-CSV
-JSON
-Parquet
-Avro
-ORC
How does Athena’s billing work?
-You pay for each 5TB of data scanned
True or False: Data transfers between S3 and Athena are unencrypted
False, they are encrypted using TLS
What are Amazon Quicksight’s accepted data origins?
-Excel, CSV, TSV and log files (Local or S3)
-Athena
-Redshift
-RDS / Aurora
-EC2 hosted databases
-AWS IoT Analytics
What is SPICE in a Quicksight context?
It is an in-memory, high performance calculation engine used by quicksight to perform it’s queries.
How much memory does each Quicksight user have access to on SPICE?
10GB
What are the machine learning insights available on Quicksight?
-Anomaly detection
-Forecasting
-Auto-narrative
What is Quicksight Q?
It is a GenAI functionality on Quicksight that can analise your data and generate insights to questions asked to it
What is a Quicksight paginated report?
It is a report that is created in a format designed to be printed
True or False: Quicksight was not designed for perfoming ETL, Glue being recommended for that instead
True
True or False: Quicksight supports user creation via IAM or email, but not 2FA
False, it does support 2FA
Ture or False: You can restrict Athena Access to S3 data through the usage of bucket policies
True
What is EMR?
It is an AWS service that allows you to run a Managed Hadoop framework on EC2 instances
True or false: EMR includes Spark, Hive, Flink, Presto and HBase
True
In the architecture of an EMR cluster, what are the existing node types?
In an EMR cluster, there are the following nodes:
-Master node: Manages the cluster
-Core node: Hosts HDFS data and tuns tasks
-Task nodes: Runs task while not storing any data (Good use of spot instances)
True or False: EMR Serverless lets AWS scale your nodes automatically
True
What are the types of EMR data storage?
-HDFS
-EMRFS: Access S3 as if it was HDFS
-Local File System
-EBS for HDFS
What is the EMRFS Consistent View setting for EMR data storage?
Optional feature used to mantin consistency on S3 data using DynamoDB
What models are available on SparkMLLib?
-Classification: Logistic Regression and Naive Bayes
-Regression
-Decision trees
-Recommendation engine (ALS)
-Clustering (k-means)
-LDA (topic modelling)
-SVD, PCA, statistics
-ML workflow utilities (pipelines, fetaure transformation, persistence)
What is the recommended size for EMR mater nodes?
m5.large if <50 nodes, m5.xlarge otherwise
What does HDFS mean?
Hadoop Distributed File System
What is the difference between a Transient and a Long-running EMR cluster?
A transient cluster is automatically terminated after its tasks are run, while a Long-running one is terminated manually by the user