Data Engineering Fundamentals Flashcards
What is Avro?
A Binary storage format that keeps information about the schema.
What is Parquet?
Columnar storage optimized for analytics.
What does random sampling do?
It gives everything an equal chance at being selected.
What is stratified sampling?
It splits the population, but ensures representation of each subgroup.
What is systemic sampling?
When you are going to select every N item.
What is data skew?
Unequal distribution between partitions.
What can be done to address data skew?
Adaptive partitionig
Salting
Repartitioning
What does the YEAR() function in SQL do?
It selects only the year from a date field.
What does a pivot table do?
It makes row level data into columnar data.
What is the default SQL join?
An inner join?
How does inner join work?
It select all the rows from table A that have a matching identifier in table B.
How does a left outer join work?
It selects everything in Table A regardless of whether there is a match in Table B. Only records with a match in Table B are returned.
How does a right outer join work?
It selects everything in Table B regardless of whether there is a match in Table A. Only records with a match in Table A are returned. Opposite of Left Join.
How does a full outer join work?
Data from Table A and Table B is returned, but only matching records will have values.
What does Regex do?
It pattern matches.
What is the RegEx operator for case sensitivity?
~*
What is the RegEx expression operator?
~
What is the RegEx expression to not match?
!~*
In GIT, how do I get files from the repository to my local workspace?
GIT Pull
How would you initialize a new Git repository?
GIT Init
What does GIT Config do?
Sets configuration values for user info and aliases.
How do you clone or download a repository from an existing URL?
git clone
What does git status do?
It checks the status of your changes in your working directory. This is local.
How do you view commit logs in git?
Git log
What does git branch do?
It shows all branches
How would you create a new branch
git branch newBranchName
How do you switch branches?
git checkout branchname
How do you create a new branch and switch to it?
git checkout -b
How do you delete a branch?
git branch -d
How do you push your changes to the remote repository?
git push
What does git pull do?
Pulls changes from a remote repository branch into the current local branch
What is a transition action in s3?
It is used to move objects from one storage glass to another.
What are expiration actions in S3?
They are used to configure object expiration / delete after N period of time.
Can lifecycle rules be created based on tags or prefixes?
Yes, on Both
What is the level hierarchy for S3?
Standard
Standard IA
Intelligent Tiering
One Zone IA
Glacier Instant Retrieval
Glacier Flexible Retrieval
Glacier Deep Archive
What does S3 analytics do?
Helps you decide when to transition objects to the right storage class.
What are the targets for S3 event notifications?
Lambda, SNS, and SQS