Data Engineering Fundamentals Flashcards
What is Avro?
A Binary storage format that keeps information about the schema.
What is Parquet?
Columnar storage optimized for analytics.
What does random sampling do?
It gives everything an equal chance at being selected.
What is stratified sampling?
It splits the population, but ensures representation of each subgroup.
What is systemic sampling?
When you are going to select every N item.
What is data skew?
Unequal distribution between partitions.
What can be done to address data skew?
Adaptive partitionig
Salting
Repartitioning
What does the YEAR() function in SQL do?
It selects only the year from a date field.
What does a pivot table do?
It makes row level data into columnar data.
What is the default SQL join?
An inner join?
How does inner join work?
It select all the rows from table A that have a matching identifier in table B.
How does a left outer join work?
It selects everything in Table A regardless of whether there is a match in Table B. Only records with a match in Table B are returned.
How does a right outer join work?
It selects everything in Table B regardless of whether there is a match in Table A. Only records with a match in Table A are returned. Opposite of Left Join.
How does a full outer join work?
Data from Table A and Table B is returned, but only matching records will have values.
What does Regex do?
It pattern matches.