Chapter 3 - Databases and Data Acquisition Flashcards
What does an entity contain? What kind of database does it relate to?
An entity contains data about a single subject. Think of it as a noun that describes a person, place or thing. Used in relational databases.
What’s an entity instance?
It’s like a single record or instances of an entity subject. Like picking out one record in a table about pets.
What is the name given to describe the relationship between two entities?
Cardinality
Which types of entity relationships are comparatively complex and rare?
Unary and Ternary.
What is the difference between an entity instance and an entity attribute?
An entity attribute is akin to the column heading in a table. An entity instance is akin to a single record in a table, for example, the record about one particular customer. The record contains multiple entity attributes.
Entities become a separate _____ in a database
table
What two things together link data between two tables?
An associative table and a foreign keys
the F-keys link to the primary keys in the two related tables
A foreign key helps to enforce what?
foreign key enforces referential integriy, or how consistent the data is in related tables
What describes the modifications necessary to to create tables for storing data?
A schema
A table that contains a primary key in more than one column has a primary key called what?
A composite primary key
You need to store and retrieve simple data quickly that doesn’t have a lot of structure, which of the two non-relational databases would you choose?
Key-value or Column-family
Key-value
Name 2 high level key benefits of a column-family database?
1) DYNAMIC/FLEXIBLE - Column data can vary dynamically between key values
2) ABILITY TO SCALE - can process vast amounts of data
What kind of database is best at exploring relationships between pieces of data?
Graph databases
What database type do the below relate to and what are they?
1) NODES -
2) PROPERTIES -
3) RELATIONSHIPS -
GRAPH DATABASES
2) 1) NODES - these are the equivalent of an entity instance; a person or a thing.
2) PROPERTIES - these store attributes about a node
3) RELATIONSHIPS - these are arrows connecting nodes denoting their relationship to each other
What kind of database is often behind website features like related or RECOMMENDED items?
Graph database.
What database process is employed to reduce data redundancy?
Normalization
List the 5 rules of 1NF
1) KEY - a table without a primary key is not permitted
2) GROUPS - Repeating groups are not permitted
3) ORDER - Row order is not permitted
4) DATATYPES - Mixing of datatypes in a column is not permitted
5) VALUES - non-atomic values are not permitted
What level of normalization does this statement describe? “every attribute in a table should depend on the key, the whole key. and nothing but the key”?
3NF - third normal form
it doesn’t use a schema
it stores data in raw format
it requires specialist knowledge (data scientists) about the data to operate
a datalake
a subset of a datawarehouse that is data that serves a particular part or subject within the organization as opposed to data that serves the entire organziation
A Datamart
List 2 key features of the Star Schema design pattern
1) Denormalized to improve read performance over large datasets
2) Uses a central fact table with dimension tables as spokes.
Regarding OLAP databases, which schema design pattern would you pick if storage was a concern?
Snowflake schema
Regarding design patterns, which design pattern does a datawarehouse usually use and why and which design pattern does a datamart usually use and why?
1) Datawarehouses usually use snowflake schema design patterns because they’re supplied from many differences sources.
2) Datamarts usually use the Star schema because they represent a single data subject area.
When designing a dimension table, it is crucial to understand what to help you build it?
The questions that analysts will be asking about the the business or subject?
A dimension table provides additional context to a fact table. These could be information about people, or the price of a product. The data may change over time and you’ll need to add additional attributes to the table. Regardless of how quickly the data changes, what must you consider about the data when designing dimension tables?
whether you need to store the history/indicate whether the attribute is current or not
what are 3 approaches mentioned by the author to handle the time dimension in tables
1) using a start date / end date for the attribute
2) indicator flag (shows what data is current)
3) effective date - date the attribute became effective
you must consider the historic requirements of the data
Name the main disadvantage of using effective date approach in creation of dimension tables
Itcomplicates queries because you need to calculate date math
when you ask the question “what was the price on x date” it needs to calculate between the effective dates
What’s the key difference between ETL and ELT and what technology do they use to do it?
1) ETL transforms data before it enters the warehouse and often uses Python
2) ELT transforms data after it enters the warehouse and typically uses SQL
Describe initial load, delta load and incremental load and batch window
1) initial load is the first load of data into the DWH
2) the delta loads are subsequent loads after the initial load
3) incremental loads are the same as delta loads
4) batch window is the time period in which you have to extract data.
ETL and ELT are associated with what data acquisition concept?
Integration
Regarding data-collection, Qualtrics is an application you can use in what type of data collection method?
conducting Surveys
when acquiring qualitative data using observation, what should you be mindful of?
1) accounting for subconscious action
2) bias - both in the observer and observee
you must develop methods for observation to mitigate the two