Deck 1 Flashcards

Question 1

Q

Define a Star Schema

Answer

A

A Data Warehouse design comprised of a fact table(s) with a single table for each dimension. The dimensions in fact tables are connected to dimension tables though a primary key and foreign key relationship.

Question 2

Q

What is a Fact Table?

Answer

A

Fact table is a table of keys and measurements

Question 3

Q

Define a Snowflake Schema

Answer

A

Similar to the star schema. It uses normalization which splits up the data into additional tables. The splitting results in the reduction of redundancy and prevention from memory wastage. A snowflake schema is easy to manage but more complex to design and understand.

Question 4

Q

7 Key differences between Start and Snowflake Schema

Answer

A

Star schema contains just one dimension table for one dimension entry while there may exist dimension and sub-dimension table for one entry.
Normalization is used in snowflake schema which eliminates the data redundancy. As against, normalization is not performed in star schema which results in data redundancy.
Star schema is simple, easy to understand and involves less intricate queries. On the contrary, snowflake schema is hard to understand and involves complex queries.
The data model approach used in a star schema is top-down whereas snowflake schema uses bottom-up.
Star schema uses a fewer number of joins. On the other hand, snowflake schema uses a large number of joins.
The space consumed by star schema is more as compared to snowflake schema.
The time consumed for executing a query in a star schema is less. Conversely, snowflake schema consumes more time due to the excessive use of joins.

Question 5

Q

What is a deadlock?

Answer

A

A system is in a deadlock state if there exists a set of transactions such that every transaction in the set is waiting for another transaction in the set. None of the transaction can make progress in such a situation. The only remedy to this undesirable condition is for system to invoke some drastic action,such as rolling back some of the transactions involved in the deadlock.  There are two methods for dealing with deadlock 1.Deadlock Prevention. 2.Deadlock detection & Recovery.

Question 6

Q

What is the definition of GDPR?

Answer

A

The General Data Protection Regulation (GDPR) is a legal framework that sets guidelines for the collection and processing of personal information from individuals who live in the European Union (EU)

Question 7

Q

What is the definition of CCPA?

Answer

A

The California Consumer Privacy Act (CCPA) is a comprehensive new consumer protection law set to take effect on January 1, 2020. In the wake of the CCPA’s passage, approximately 15 other states introduced their own CCPA-like privacy legislation, and similar proposals are being considered at the federal level.

Question 8

Q

What is a ‘Dirty Read’?

Answer

A

A dirty read (aka uncommitted dependency) occurs when a transaction is allowed to read data from a row that has been modified by another running transaction and not yet committed.

Question 9

Q

What is a Galaxy Schema?

Answer

A

A Galaxy Schema contains two fact table that shares dimension tables. It is also called Fact Constellation Schema. The schema is viewed as a collection of stars hence the name Galaxy Schema.

Question 10

Q

What is a Star Cluster Schema?

Answer

A

Star cluster schema contains attributes of Start schema and Slow flake schema.

Question 11

Q

What is Cardinality?

Answer

A

In the context of ERD, cardinality refers to the count of instances that are allowed or necessary between entity relationships. How many rows are needed from one entity before it can be linked to another entity.

Two types:
Minimum - The minimum number of instances that are required in the relationship
Maximum - The maximum number of relationships required in the relationship.

Question 12

Q

What is Network Density?

Answer

A

The “Network Density” metric is commonly calculated as the number of actual possible connections divided by the number of possible connections. There are 9 actual connections and 56 possible connections in the example data, resulting in a Network Density value of .1607 which depending on the context could be considered to be low or high.

Question 13

Q

What is Network Centralization?

Answer

A

The “Network Centralization” metric tells us how “centered” the network is around the member(s) of the network with the highest number of connections. In a network with three members, this metric is of little value – but in a network with thousands or millions of connections, knowing the people or persons the network is centralized around is meaningful to our understanding of the network. In the data driving my implementation, Jane is involved in four of the nine transactions which would be commonly calculated as (4 / 9) = .444. This would be considered a high value in most cases, so you could say that the total network is highly centralized (around Jane).

Question 14

Q

What is Network Homophily?

Answer

A

The “Network Homophily” metric describes the degree that connected nodes share similar characteristics - i.e. are connected nodes largely alike? The richer the source data is, the more important and interesting this metric can be as the row count increases. This metric is of particular interest to marketers.

Question 15

Q

What is “In Degree”?

Answer

A

Switching to Node specific metrics; the “In Degree” metric is the count of in-coming connections to a Node from other nodes in the network. The “Out Degree” metric is the count of outgoing connections from a single node to other nodes in the network. These two metrics are often used to help analysts and marketers understand how “social” products within particular retail categories are with products in similar or different retail categories.

Question 16

Q

What is “Betweeness”?

Answer

A

The “Betweeness” metric helps us understand how important a particular node is to the overall “performance” of the network from the perspective of a particular metric or class of metrics. The example data describes connections through “Sales”. If Sally and Roger had made huge sales to each other or to Jane, removing Jane from the network would lower the “total value” of the network because Roger and Sally are in the network by virtue of their relationships to Jane.

Question 17

Q

What is “Closeness”?

Answer

A

The “Closeness” metric helps us understand how useful a given network member is for getting a message from outside the network circulated within the network as soon as possible. If an outside person wanted to circulate a message within the network described in the example data, the go-to person is Jane because she is directly connected (one hop away) to five other network members, who in turn are a hop away from the remaining network members (Roger and Ken).

Question 18

Q

What us “Eigenvector Centrality”?

Answer

A

The “Eigenvector Centrality” metric explains the degree to which a given node is connected to the most important node in the network. In a given network, an “introverted” member with low “in degree” and “out degree” metrics and has little or no “betweenness” or “closeness” could in fact be quite important due to its influence on members who are very well connected. If Jane is heavily influenced by Sally’s purchasing recommendations, Sally’s role in shaping the profile of the network is important given Jane’s position in the network as the most important buyer in the network.