Chapter 3 - Databases and Data Acquisition Flashcards
What does an entity contain? What kind of database does it relate to?
An entity contains data about a single subject. Think of it as a noun that describes a person, place or thing. Used in relational databases.
What’s an entity instance?
It’s like a single record or instances of an entity subject. Like picking out one record in a table about pets.
What is the name given to describe the relationship between two entities?
Cardinality
Which types of entity relationships are comparatively complex and rare?
Unary and Ternary.
What is the difference between an entity instance and an entity attribute?
An entity attribute is akin to the column heading in a table. An entity instance is akin to a single record in a table, for example, the record about one particular customer. The record contains multiple entity attributes.
Entities become a separate _____ in a database
table
What two things together link data between two tables?
An associative table and a foreign keys
the F-keys link to the primary keys in the two related tables
A foreign key helps to enforce what?
foreign key enforces referential integriy, or how consistent the data is in related tables
What describes the modifications necessary to to create tables for storing data?
A schema
A table that contains a primary key in more than one column has a primary key called what?
A composite primary key
You need to store and retrieve simple data quickly that doesn’t have a lot of structure, which of the two non-relational databases would you choose?
Key-value or Column-family
Key-value
Name 2 high level key benefits of a column-family database?
1) DYNAMIC/FLEXIBLE - Column data can vary dynamically between key values
2) ABILITY TO SCALE - can process vast amounts of data
What kind of database is best at exploring relationships between pieces of data?
Graph databases
What database type do the below relate to and what are they?
1) NODES -
2) PROPERTIES -
3) RELATIONSHIPS -
GRAPH DATABASES
2) 1) NODES - these are the equivalent of an entity instance; a person or a thing.
2) PROPERTIES - these store attributes about a node
3) RELATIONSHIPS - these are arrows connecting nodes denoting their relationship to each other
What kind of database is often behind website features like related or RECOMMENDED items?
Graph database.
What database process is employed to reduce data redundancy?
Normalization
List the 5 rules of 1NF
1) KEY - a table without a primary key is not permitted
2) GROUPS - Repeating groups are not permitted
3) ORDER - Row order is not permitted
4) DATATYPES - Mixing of datatypes in a column is not permitted
5) VALUES - non-atomic values are not permitted
What level of normalization does this statement describe? “every attribute in a table should depend on the key, the whole key. and nothing but the key”?
3NF - third normal form
it doesn’t use a schema
it stores data in raw format
it requires specialist knowledge (data scientists) about the data to operate
a datalake
a subset of a datawarehouse that is data that serves a particular part or subject within the organization as opposed to data that serves the entire organziation
A Datamart
List 2 key features of the Star Schema design pattern
1) Denormalized to improve read performance over large datasets
2) Uses a central fact table with dimension tables as spokes.
Regarding OLAP databases, which schema design pattern would you pick if storage was a concern?
Snowflake schema
Regarding design patterns, which design pattern does a datawarehouse usually use and why and which design pattern does a datamart usually use and why?
1) Datawarehouses usually use snowflake schema design patterns because they’re supplied from many differences sources.
2) Datamarts usually use the Star schema because they represent a single data subject area.
When designing a dimension table, it is crucial to understand what to help you build it?
The questions that analysts will be asking about the the business or subject?
A dimension table provides additional context to a fact table. These could be information about people, or the price of a product. The data may change over time and you’ll need to add additional attributes to the table. Regardless of how quickly the data changes, what must you consider about the data when designing dimension tables?
whether you need to store the history/indicate whether the attribute is current or not
what are 3 approaches mentioned by the author to handle the time dimension in tables
1) using a start date / end date for the attribute
2) indicator flag (shows what data is current)
3) effective date - date the attribute became effective
you must consider the historic requirements of the data
Name the main disadvantage of using effective date approach in creation of dimension tables
Itcomplicates queries because you need to calculate date math
when you ask the question “what was the price on x date” it needs to calculate between the effective dates
What’s the key difference between ETL and ELT and what technology do they use to do it?
1) ETL transforms data before it enters the warehouse and often uses Python
2) ELT transforms data after it enters the warehouse and typically uses SQL
Describe initial load, delta load and incremental load and batch window
1) initial load is the first load of data into the DWH
2) the delta loads are subsequent loads after the initial load
3) incremental loads are the same as delta loads
4) batch window is the time period in which you have to extract data.
ETL and ELT are associated with what data acquisition concept?
Integration
Regarding data-collection, Qualtrics is an application you can use in what type of data collection method?
conducting Surveys
when acquiring qualitative data using observation, what should you be mindful of?
1) accounting for subconscious action
2) bias - both in the observer and observee
you must develop methods for observation to mitigate the two
What lets you actually create a database ready to accept data from a database design?
SQL’s Data Definition Language
What’s the difference between DDL and DML?
1) DDL is about creating database tables and objects
2) DML is about querying, altering and retrieving data from the database table
What are the four CRUD data manipulation actions and their association SQL Keywords?
1) Create - INSERT
2) Read - SELECT
3) Update - UPDATE
4) Delete DELETE
Which logical function could you use to convert a given value into a different name of your choice to help divide data into categories?
For example, you want to change the M and F letters in the Sex column to Boys and Girls.
The IFF function
IFF(boolean_expression, true_value, false_value)
List the 6 aggregate functions
1) Max
2) Min
3) Count
4) Sum
5) AVG
6) STDD
Preventing injection attacks and reducing several queries into a single query are benefits of what?
Parameterization
Name the key benefit of database indexes and one drawback
1) BENEFIT - they improve query speed
2) DRAWBACK - it slows down create, update and delete activity
You want to explore historic sales data. The sales history table is huge. What’s the issue with this and what solution could you emply to mitigate it?
Querying a very large table is inefficient.
It’s more efficient to create a temporary table using a SUBSET of data from the main table
What would you look at to troubleshoot query performance?
The execution plan
What should you look at to ensure you’re developing an efficient query?
the execution plan
Snowflake and Star schemas are also known as what label?
Multidimensional Schemas
Generally speaking, Fact tables contain ______ data and dimension tables contain _____________ data?
1) fact tables = quantitative data
2) dimension tables = qualitative data
Regarding data acquisition. ETL or ELT is associated with sourcing data from where?
a transactional system
What are the key differences between ETL and ELT when it comes to how they transform the data?
1) ETL uses external technology to transform the data
2) ELT uses the database itself to transform the data
What’s the name given to the concept of moving and combining data from different sources into a a data warehouse or datalake?
Data Integration
In data integration, API’s return transaction data as what? And return bulk data as what?
1) JSON files
2) CSVs
What were the 4 approaches mentioned in the book that mitigate performance issues for query optimization
1) Parameterization
2) Using data subsets and temp tables
3) Review execution plan with DB admin
4) Applying a database index
Why is adding a time dimension to star of snowflake tables a good idea?
It allows handling of current and historical data
which data collection method is good for validating business processes?
Observation
if you’re doing just ad-hoc analysis, what would you utilize?
data subsets and use temporary tables
Star and Snowflake Schemas are designed for what kind of database?
OLAP databases
Why is the the denormalized nature of OLAP databases result in quicker read performance?
Queries are simpler and data reads faster due to fewer table joins required due to the hub and spoke nature of the schema.
what determines the frequency of ETL data loads?
Business requirements
What defines the batch window for loading data into the datawarehouse?
Between the last transaction of the day and the first transaction of the following day
__________ translates the human-readable SQL into code the database understands
parsing
____________ reduces the number of times the database has to parse individual queries
parameterization
Large tables without indexes are inefficient. Applying a database index helps speed queries up. Ideally you want to include all columns in the index, but if you can’t, what should should you ensure when writing your select statement?
That the column you’re selecting is covered by an index
while indexing improves read speed of a query, it slows down what 3 activites?
create, update and delete
the indexing strategy needs to match the type of database, i.e. transactional or reporting
You find that a table where a non-key attribute is dependent on another non-key attribute. What form does this violate?
3NF
When the primary key of a table is indirectly related to a non-key attribute through a key-attribute, what violation is this called?
transitive violation
because the key is related to a non-key attribute indirectly.
this violates 3NF
Match the type of normalization to the statements below:
1) All non-key attributes must depend on the key
2) All non-key attributes must depend on the whole key and nothing else!
1) All non-key attributes must depend on the key - 2NF
2) All non-key attributes must depend on the whole key and nothing else! - 3NF
You may have multiple applications in your business that need to request the same data, each using the same SQL calls to the database. What issue arises from this and what method can make this more efficient?
Issue - SQL code has to be maintained in multiple places.
Solution - implement a single API call for the specific data needed from the database. All applications can call the API instead of maintaining their own SQL.
Frosty wants to use an web service enabled API. What must they obtain/purchase before they can use it?
an API key
Webscrapers use Python / R or webscrapers use API’s?
Webscrapers use Python / R
re: data collection/acquisition - the act of collecting primary source data from people or machines
is known as Observation
Why is the snowflake schema more efficient with data storage than star schema?
because it is more normalized, reducing data duplication
Fred is looking at where data comes from, how frequently it changes, and how long it needs to persist. What is he doing?
Considering the life-cycle of data
for managing data in a datawarehouse or datamart.
Which method of handling time dimensionality keeps all pricing data in a single place (as opposed to splitting it out into a different table) and avoids using date math to get the current data?
Indicator flag method
Which is the best choice if you need to get massive amounts of data from a transactional system into an OLAP system? ETL or ELT?
ELT
you extract data from the source system and place it into a staging area as flat file. What Integration approach is this?
ETL
what should take into account when designing the delta load strategy?
the batch window
slowly changing dimension - keeps current and historic
adds column to show previous value
Type 3
Slowly changing dimension - keep current state
Type 1
slowly changing dimension - keeps current and hisotrical changes
tracks change on new row
Type 2
Which symbol denotes the end of an SQL statement?
Semi-colon “;”
which key enforced referencial integrity?
Foreign key
s a set of protocols within a computer system that allows two unrelated systems to communicate.
cm
API
is an API that allows a hosted computer on a network to share data with a computer in the same hosted environment.
cm
web service API
DO NOT AGREE!