Data Management Flashcards

1
Q

What is a data model?

A

a simple representation that demonstrates how data structures, including their characteristics, relations, constraints, transformations form a database that solves a real-world business problem.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

business rule

A

a policy, procedure, or principle within a specific organization. business rules are used to define entities, attributes, relationships, and constraints.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

the basic features of the relational data model

A

entities, attributes, relationships

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Translate business rules into data models

A

a noun in a business rule will translate into an entity in the model, and a verb (active or passive) that associates the nouns will translate into a relationship among the entities

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

entity

A

An entity is a person, place, thing, concept, or event about which data will be collected and stored.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Attribute

A

An attribute is a characteristic of an entity. For example, a CUSTOMER entity would be described by attributes such as customer last name, customer first name, customer phone number, customer address, and customer credit limit.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Relationships

A

describes an association among entities. Data models use three types of relationships: one-to-many, many-to-many, and one-to-one. shorthand notations 1:M or 1..*, M:N or .., and 1:1 or 1..1, respectively.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Visualize One-to-many (1:M or 1..*) relationship.

A

An author has many books, but a book has one author 1:M AUTHOR publishes BOOKS
An invoice is created by one person, but a customer generates many invoices 1:M CUSTOMER generates INVOICES

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Visualize Many-to-many (M:N or ..) relationship.

A

An employee may learn many job skills, and each job skill may be learned by many employees. “EMPLOYEE learns SKILL” as M:N.
A student can take many classes and each class can be taken by many students, M:N STUDENT takes CLASSES

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

One-to-one (1:1 or 1..1) relationship.

A

A retail company’s management structure may require that each of its stores be managed by a single employee. “EMPLOYEE manages STORE” is labeled 1:1.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Schema

A

The schema is the conceptual organization of the entire database as viewed by the database administrator.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Hierarchial model

A

The hierarchical model depicts a set of one-to-many (1:M) relationships between a parent and its children segments.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Network model

A

the user perceives the network database as a collection of records in 1:M relationships. However, unlike the hierarchical model, the network model allows a record to have more than one parent.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Subschema

A

The subschema defines the portion of the database “seen” by the application programs that actually produce the desired information from the data within the database.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Relational model

A

A relation (sometimes called a table) as a two-dimensional structure composed of intersecting rows and columns. Each row in a relation is called a tuple. Each column represents an attribute.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

The Entity relationship model

A

The graphical representation of entities and their relationships in a database structure. ER models are normally represented in an entity relationship diagram (ERD)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Object-Oriented Model (OODM)

A

data and its relationships are contained in a single structure known as an object, forming the basis for the object-oriented database management system (OODBMS).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Why is an object said to have greater semantic content than an entity?

A

an object includes information about relationships between the facts within the object, as well as information about its relationships with other objects.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Class vs object

A

A class is a collection of similar objects with shared structure (attributes) and behavior (methods). An object is an abstraction of an entity

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Extensible Markup Language (XML)

A

a markup language designed to store, transport, and structure data in a way that is both human-readable and machine-readable. allows users to define their own tags and data structures based on their needs. <tag></tag>

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

table

A

A logical construct perceived to be a two-dimensional structure composed of intersecting rows (entities) and columns (attributes) that represents an entity set in the relational model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

The 3 Vs

A

VOLUME: amounts of data being stored
VELOCITY: the speed which data grows but also the need to process
VARIETY: the fact that data is collected in multiple formats

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Internet of Things (IoT)

A

A web of Internet-connected devices constantly exchanging and collecting data over the Internet. IoT devices can be remotely managed and configured to collect data and interact with other devices on the Internet.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What is Hadoop, and what are its basic components?

A

A Java-based, open-source file storage system that uses the write-once, read many model. This means that once the data is written, it cannot be modified.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

data

A

Raw facts, or facts that have not yet been processed to reveal their meaning to the end user.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

field

A

A character or group of characters (alphabetic or numeric) that has a specific meaning. A field is used to define and store data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

record

A

A logically connected set of one or more fields that describes a person, place, or thing.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

file

A

A collection of related records.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

what is data redundancy?

A

Exists when the same data is stored unnecessarily at different places, leading to poor data security, inconsistency, entry errors, and integretiy problems

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

The difference between data, info, and a database

A

the database system consists of logically related data stored in a single logical data repository. the current generation of DBMS software stores not only the data structures but also the relationships between those structures and the access paths to those structures

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

What does a business need to manage a database?

A

Hardware, software, people, procedures or rules governing the design, and the data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

What is metadata?

A

The metadata describes the data characteristics and the set of relationships that links the data found within the database.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

Structured data vs. non-structured

A

Data that has been formatted to facilitate storage, use, and information generation in a predefined data model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

Information

A

the result of processing raw data to reveal its meaning. Data processing can be as simple as organizing data to reveal patterns or as complex as making forecasts or drawing inferences using statistical modeling.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

data dictionary

A

stores metadata—data about data. contains the data definition as well as their characteristics and relationships.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

Data Anomalies

A

Develops when not all of the required changes in the redundant data are made successfully.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

What are the characteristics of big data?

A

Velocity: the speed at which data emanates and changes
Value: The value that can be derived from access and analysis
Veracity: The discrepancies found in data
Volume: The sheer size of data generated every second
Variety: The combination of datatypes in the system

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
38
Q

Data warehouse

A

centralized repository designed for structured data. Like a server that has all the data sets. Data is cleaned, transformed, and organized before being stored. Often used for historical analysis, reporting, and dashboards.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
39
Q

Data lake

A

a centralized repository designed to store raw, unstructured, semi-structured, and structured data at scale. Supports a variety of formats like text, video, audio, JSON, and CSV.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
40
Q

ETL

A

a data integration process that moves data from multiple sources into a destination system
Extract: Retrieve data from various sources like databases, APIs, or files.
Transform: Clean, standardize, and format the data (e.g., removing duplicates, converting data types).
Load: Store the transformed data into the destination system, typically a data warehouse.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
41
Q

The Cloud

A

Contains structured, semi-structured, unstructured data, but doesn’t rely on local maintenance or a server. Uses tools like Amazon Web Services (AWS), Microsoft Azure, or Google Cloud

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
42
Q

Multicloud

A

Uses several cloud storage and computing providers simultaneously

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
43
Q

Hybrid Cloud

A

Uses both public cloud providers and a secure, private cloud

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
44
Q

The relational model’s three components

A

A logical data structure represented by relations
A set of integrity rules to enforce that the data is consistent and remains consistent over time
A set of operations that defines how data is manipulated

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
45
Q

What are the characteristics of a relational table?

A

It is a two-dimensional structure composed of rows and columns.
Each table row (tuple) represents a single entity.
Each table column represents an attribute with a distinct name.
All values in a column must conform to the same data format.
Each column has a specific range of values known as the attribute domain.
The order of the rows and columns is immaterial to the DBMS.
Each table must have an attribute or combination of attributes that uniquely identifies each row.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
46
Q

Determination and functional dependency

A

The state in which knowing the value of one attribute makes it possible to determine the value of another. Ex. Revenue - cost = profit. Known values revenue and cost determine profit . Such is applied to functional dependency, which means that the value of one or more attributes determines the value of one or more other attributes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
47
Q

Entity Integrity

A

the condition in which each row (entity instance) in the table has its own known, unique identity.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
48
Q

What are the requirements of the primary key for entity integrity?

A

all of the values in the primary key must be unique and
no key attribute in the primary key can contain a null

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
49
Q

Foreign Key

A

A foreign key (FK) is the primary key of one table that has been placed into another table to create a common attribute.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
50
Q

Referential Integrity

A

a concept in relational databases that ensures the consistency and accuracy of data by maintaining valid relationships between tables. Thereby, it enforces rules about how foreign keys relate to primary keys

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
51
Q

Relational algebra

A

defines the theoretical way of manipulating table contents using relational operators

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
52
Q

SELECT (RESTRICT)

A

It yields values for all rows found in the table that satisfy a given condition.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
53
Q

PROJECT

A

yields all values for selected attributes. It is also a unary operator, accepting only one table as input.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
54
Q

UNION

A

combines all rows from two tables, excluding duplicate rows.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
55
Q

INTERSECT

A

yields only the rows that appear in both tables.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
56
Q

Product

A

yields all possible pairs of rows from two tables—also known as the Cartesian product. if one table has 6 rows and the other table has 3 rows, the PRODUCT yields a list composed of 6 × 3=18 rows.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
57
Q

JOIN

A

allows information to be intelligently combined from two or more tables.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
58
Q

Equijoin

A

links tables on the basis of an equality condition that compares specified columns of each table. The outcome of the equijoin does not eliminate duplicate columns,

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
59
Q

inner join

A

only returns matched records from the tables that are being joined. In an outer join, the matched pairs would be retained, and any unmatched values in the other table would be left null.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
60
Q

outer join

A

an “inner join plus.” The outer join still returns all of the matched records that the inner join returns, plus it returns the unmatched records from one of the tables.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
61
Q

What join is useful for uncovering referential integrity errors?

A

the outer joins. Such problems are created when foreign key values do not match the primary key values in the related table(s).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
62
Q

DIVIDE

A

is used to answer questions about one set of data being associated with all values of data in another set of data. For example which CUSTOMERS in col1 bought all 3 products in col2?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
63
Q

system catalog

A

a detailed system data dictionary that describes all objects within the database, including data about table names, a table’s creator and creation date, authorized users, and access privileges. the system catalog tables can be queried just like any user/designer-created table.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
64
Q

homonyms

A

similar-sounding words with different meanings, such as boar and bore.
For example, you might use C_NAME to label a customer name attribute in a CUSTOMER table and use C_NAME to label a consultant name attribute in a CONSULTANT table. Avoid this.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
65
Q

synonyms

A

indicates the use of different names to describe the same attribute. For example, car and auto.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
66
Q

what is a real test of redundancy?

A

if you delete an attribute and the original information can still be generated through relational algebra, the inclusion of that attribute would be redundant.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
67
Q

T or F: All redundancies must be deleted

A

False. Planned redundancies are common in good database design. Sometimes redundancies occur to maintain historical data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
68
Q

Index

A

an index is an ordered arrangement of keys and pointers. Each key points to the location of the data identified by the key.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
69
Q

What are the four main NoSQL databases?

A

Key-value stores
Document Databases
Wide-Column Stores
Graph Databases

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
70
Q

Graph databases

A

Definition: Use nodes, edges, and properties to represent relationships between data (e.g., Neo4j, Amazon Neptune).
Relational Potential:
Graph databases are inherently non-relational because they are optimized for querying relationships directly through graph structures.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
71
Q

Wide-Column Stores

A

Store data in a column-oriented format where each row can have varying columns (e.g., Apache Cassandra, HBase). great for dealing with extremely large amounts of data where speed is of utmost importance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
72
Q

Document databases

A

Stores documents in often JSON or BSON formats. Don’t have a fixed schema or table relationship. Document stores prioritize flexibility and hierarchical data, making them better suited for unstructured data. Popular document storages are Apache’s CouchDB, MongoDB, and Azure Cosmos DB

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
73
Q

Key-Value stores

A

Key-value stores lack relationships, constraints, and structured querying. Ideal for fast lookups and simple use cases. Every element is stored as a key-value pair consisting of an attribute name (“key”) and a value. Popular systems are Redis, DynamoDB, Oracle NoSQL. Useful for shopping carts, user preferences, user profiles.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
74
Q

Natural Keys

A

Data you already store that meets the requirement of a primary key: No Nulls, all values unique

75
Q

Composite Key

A

multiple attribute columns that together provide the unique identifier for a row and make up a composite key

76
Q

Surrogate Primary Key

A

keys that have no real world meaning. Their entire purpose is to create a unique column in a data table, Much like library card numbers, credit card numbers, driver license numbers

77
Q

Foreign Keys

A

Foreign key columns store the primary keys value of the rows that they’re related to. Because they need to store the same value, their data type needs to be the same as the primary key’s data type in the related table.

78
Q

Views

A

A view in SQL is a virtual table created based on a query. It does not store data but dynamically pulls it from the underlying tables.

79
Q

Normalization

A

Normalization is the process of removing redundancy from a database.
1st normal form: Ensures each column contains atomic (indivisible) values. Removes duplicate columns and ensures each row is unique.
2nd: Removes partial dependencies (no non-key column depends on part of a composite key)
3rd: removes transitive dependencies

80
Q

OLTP systems

A

Online Transaction Processing. Reduce data redundancy by usin more table relationships to increase database write speed. These databases store information quickly. For instance, an online storefront that’s primarily concerned with allowing customers to put items into their shopping cart and quickly process their payment information

81
Q

OLAP systems

A

OLAP stands for Online Analytical Processing, and databases designed with this model are primarily concerned with retrieval of information. designed to support analysis workloads

82
Q

What does a column have to achieve to be non-transitively dependent?

A

Its value is directly determined by the primary key.
It is not indirectly dependent on the primary key through another column.

83
Q

What database would you use for social networks, recommendation engines, fraud detection, supply chain management

A

Graph databases: Analyzing and querying complex relationships between entities.

84
Q

What database would you use for IoT applications, real-time analysis, log management, CMS

A

Wide-column stores: Handling large-scale, high-throughput applications with semi-structured data.

85
Q

What database would you use for CMS, CRMs, ecommerce, or mobile applications?

A

Document databases. Best For: Managing semi-structured or hierarchical data, such as JSON or XML.

86
Q

What databases would you use for caching, session management, feature flags like A/B testing, or leaderboard systems?

A

Key-value stores: Best For: Simple, fast lookups of key-value pairs.

87
Q

Visualize how to model a Primary Key in an ER model

A

TABLE NAME ( KEY_ATTRIBUTE 1 , ATTRIBUTE 2, ATTRIBUTE 3, … ATTRIBUTE K)

88
Q

Visualize how to model attributes of an entity in a ER

A

If entity is a STUDENT, then attributes could be STU_LNAME, STU_FNAME, STU_ETHNICITY

89
Q

What two conditions must be met before an entity can be classified as a weak entity?

A

The entity is existence-dependent; it cannot exist without the entity with which it has a relationship.
The entity has a primary key that is partially or totally derived from the parent entity in the relationship.

90
Q

Recursive relationships

A

a recursive relationship is one in which a relationship can exist between occurrences of the same entity set.

91
Q

What are cardinalities

A

Expresses the minimum and maximum number of entity occurrences associated with one occurrence of the related entity. Expressed as (x,y). Ex. A professor can take up to 4 classes (1,4).

92
Q

Composite Attributes

A

an attribute that can be further subdivided to yield additional attributes. For example, the attribute ADDRESS can be subdivided into street, city, state, and zip code.

93
Q

What are good modeling practices for multivalued attributes

A
  1. Create a new entity composed of the original multivalued attribute’s components
  2. Within the original entity, create several new attributes,one for each component of the original attribute
94
Q

Derived attributes

A

attributes that are computed, or derived from calculating two attributes together

95
Q

How do you define relationships between entities?

A

Using active or passive verbs. And it shows operation in both directions

A CUSTOMER generates many INVOICES
Each INVOICE generated by one CUSTOMER

96
Q

What is connectivity?

A

Used to describe the relationship classification such as 1:M, M:N, and 1:1

97
Q

What is a weak relationship?

A

exists if the primary key of the related entity (parent) does not appear as a foreign key in the child. By default, relationships are established by having the primary key of the parent entity appear as a foreign key (FK) on the related entity (also known as the child entity).

98
Q

Strong relationships

A

A strong (identifying) relationship exists when the primary kiey of the related entity (a foreign key) is a a primary key of the parent entity.

99
Q

What are multivalued attributes?

A

Multivalued attributes are attributes that can have many values. For instance, a person may have several college degrees, and a household may have several different phones, each with its own number.

100
Q

What is a composite entity?

A

This associative entity, also called a composite or bridge entity, is in a 1:M relationship with the parent entities and is composed of the primary key attributes of each parent entity.

101
Q

Steps to developing an ERD model?

A

Create a detailed narrative of the organization’s description of operations.
Identify the business rules based on the description of operations.
Identify the main entities and relationships from the business rules.
Develop the initial ERD.
Identify the attributes and primary keys that adequately describe the entities.
Revise and review the ERD.

102
Q

What 3 database requirements are often conflicting and must be addressed in design

A

Design standards.
Processing speed
Information requirements

103
Q

Single-valued attributes

A

A single-valued attribute is an attribute that can have only a single value. For example, a person can have only one Social Security number, and a manufactured part can have only one serial number.

104
Q

Simple attributes

A

A simple attribute is the opposite of a composite attribute in that it cannot be subdivided. For example, age, sex, and marital status would be classified as simple attributes.

105
Q

Why is an object an abstraction of an entity?

A

Abstractions simplify complexity by modeling only what matters for the system. an object is an abstraction of an entity because it simplifies a complex real-world concept into a model that can be represented and manipulated programmatically

106
Q

Entity Set

A

An entity set is a group of entities that belong to the same type and are represented by the same attributes. It is a fundamental concept in database design, particularly in the context of the Entity-Relationship (ER) model.

107
Q

T or F: To implement many-to-many (M:N) relationships in databases, you cross reference with primary and foreign keys.

A

False. Relational databases do not support direct many-to-many relationships between tables and require a join table.

108
Q

What is a join, junction, or cross-reference table

A

Contains two foreign keys that reference the primary keys of the two related tables.
Represents individual instances of the many-to-many relationship.

109
Q

Dimensional modeling

A

a data modeling technique used to organize and structure a data in a data warehouse is made up of two types of data: facts and dimensions.
Facts: Facts are usually things that can be measured and aggregated, such as profit, sales. These are usually stored in one table.
Dimensions: Stored in multiple tables, these provide additional contexts to facts such as month, product category.

110
Q

Facts Table (dimensional modeling)

A

Facts are individual pieces of data or information that we want to store and analyze in our data warehouse. These facts can be numerical or quantitative values. numbers of products sold, total sales amount, number of customer complaints

111
Q

Dimensions Table (dimensional modeling)

A

A dimension in the data warehousing is a collection of categories or attributes that describes the facts in your data.

112
Q

What schema is suitable for large amounts of data in a data warehouse, but has slower performance?

A

Snowflake Schema

113
Q

What schema is suitable for small amounts of data in a data warehouse, but has faster performance?

A

Star Schema

114
Q

Which type of data warehouse is faster, more reliable, store locally, but is expensive to setup and maintain, and limited in scalability?

A

On-premises warehouses

115
Q

Describe Cloud Warehouses

A

They have great scalability
More cost effective (follow CapEx model)
More accessible

116
Q

Describe primary key vs. indexes

A

While a primary key is a logical constraint that uniquely identifies each row in a table, an index is a database structure that improves the speed of data retrieval for specific queries. It is a performance optimization feature. Doesn’t require uniques.

117
Q

B-tree indexes

A

a tree data structure with a root and nodes. The root node is the index value that splits the range of values found in the index column. The top node of the subtree splits the value of the index column so that the values less than the node value are stored to the left branch of the tree, and values greater than the value in the node are stored to the right.

118
Q

Bitmap indexes

A

store a series of bits for indexed values. The number of bits used is the same as the number of distinct values in a column. For example, a column that has either a yes or no value would require two bits, one corresponding to the yes, and one corresponding to the no.

119
Q

Hash index

A

Hash functions take an arbitrary length data and map it to a fixed-size value. Hash values are designed so that different inputs will produce different outputs.

120
Q

Bloom filter indexes

A

bloom filters are especially useful when we’re querying arbitrary combinations or a large number of attributes. a bloom filter index is probabilistic, which means it’s not deterministic, we might get some results that aren’t actually fitting our filter criteria. But it’s really space efficient. It’s a lossy representation.

121
Q

Why is it important to know the Nested Loop, Hash Join, Merge Join, and subqueries

A

This allows you to turn on either a nest, hash, or merge to save time and space. When space and time is important, you can try either one.
And use EXPLAIN to calculate how much the join took up
The below code turns on either join in PostgreSQL
set enable_nestloop=true;
set enable_hashjoin=false;
Set enable_mergejoin=false;

122
Q

Schema vs. model

A

A “schema” refers to the detailed structure of data within a database, defining how tables, fields, and relationships are organized, while a “model” is a high-level conceptual blueprint that outlines the overall design of a database system, including the entities, attributes, and relationships between them

123
Q

Why is ELT sometimes preferred

A

It lets data engineers interact directly with raw data almost in real time with minimal processing and transportation time.

124
Q

ELT

A

Extract, load, transform: data is extracted from a source server or servers and is then transported immediately to the target location and loaded. No transformation occurs between these two steps.

125
Q

When do you want to use ETL?

A

you may have a need to transform sensitive information so it’s not sitting unmasked in a data lake or warehouse
Limit data access through filtering
Process data for migration to different servers

126
Q

Trigger

A

An extraction tooling created through PostgreSQL to create a new triggeron a specified function when certain actionsor parameters are met. Ex. creating a trigger that runs every time a rowof the table accounts is about to be updated.

CREATE TRIGGER check_update
BEFORE UPDATE ON accounts
FOR EACH ROW
EXECUTE FUNCTION
check_account_update():

127
Q

Change data capture

A

An extraction tool that only captures the data that has changed since the previous ETL operation. Debezium is just one example of an open source CDC platform

128
Q

data transformations

A

This process involves looking for errors, inconsistencies or other validation errors and stripping or restructuring data accordingly. Some examples of actions include deduplicating redundant or identical records, mapping values appropriately between the source and destination databases, performing data validation to ensure the records are of compatible data types and are aligned properly between the source and destination schemas or establishing key relationships across tables.

129
Q

T or F: Security and cost requires that you transform the data on source server, filtering out sensitive information and implementing secruity processes, therefore you should use ETL.

A

T: ETL you transform data on the source server before loading into the destination. This is slower than using in-warehouse data transformation (ELT).

130
Q

Incremental loading

A

Part of ETL/ELT, it evaluates the differences between the source and destination data sets to determine what data has changed since the last load. Any modified or added records are imported through streaming, batch, or small batch incremental loads.

131
Q

full load

A

the entire data set is loaded into the target data warehouse and information already existing is overwritten every time data is loaded.

132
Q

T or F: You should do a full load to restore historical data

A

F: Historical data is loss everytime you do a full load. Full loads should only be done in disasters or first time transfers.

133
Q

partial dependency

A

When a column depends on only part of the primary key, not the whole thing

Such as Class_Name being dependent on just class_ID, but there’s also a Student_ID in the table

134
Q

Transitive dependency

A

When a column depends on the primary key indirectly,

Such as Teach_Name depending on Class_ID through Class_Name. Teacher_Name needs to be placed in its own table.

135
Q

What are reasons to denormalize

A

Data warehouses routinely use 2NF structures
When you require higher processing speeds and less sspace

136
Q

Atomic attribute

A

When an attribute can no longer be subdivided

Such as EMP_NAME can be last_name and first_name

137
Q

Systems Development Life Cycle (SDLC)

A

divided into five phases; planning, analysis, detailed systems design, implementation, and maintenance. The SDLC is an iterative process rather than a sequential process.

138
Q

Database Life Cycle (DBLC)

A

Contains six phases: database initial study, database design, implementation and loading, testing and evaluation, operation, and maintenance and evolution.

139
Q

Conceptual design

A

a conceptual data model that describes the main data entities, attributes, relationships, and constraints of a given problem domain.

140
Q

Centralized design

A

design can be carried out and represented in a fairly simple database. typical of relatively simple, small databases and can be successfully done by a single database administrator or by a small, informal design team.

141
Q

Decentralized design

A

used when the system’s data component has a considerable number of entities and complex relations on which very complex operations are performed.

142
Q

Top-down design

A

Identifying the data sets and then defines the data elements for each of those sets. This process involves the identification of different entity types and the definition of each entity’s attributes.

143
Q

Bottom-up design

A

identifies the data elements (items) and then groups them together in data sets. In other words, it first defines attributes, and then groups them to form entities.

144
Q

Logical Design

A

an enterprise-wide database that is based on a specific data model but independent of physical-level details. logical design for a relational DBMS includes the specifications for the relations (tables), relationships, and constraints (in other words, domain definitions, data validations, and security views).

145
Q

Physical Design

A

the process of determining the data storage organization and data access characteristics of the database to ensure its integrity, security, and performance.

146
Q

Key performance indicators (KPIs)

A

Quantifiable numeric or scale-based measurements that assess the company’s effectiveness or success in reaching its strategic and operational goals.

147
Q

Operational Data

A

Operational data storage is optimized to support transactions that represent daily operations.

148
Q

decision support data

A

gives tactical and strategic business meaning to the operational data. Support data differs from operational data in three main areas: time span, granularity, and dimensionality.

149
Q

Data Mart

A

small, single-subject data warehouse subset that provides decision support to a small group of people. a data mart could be created from data extracted from a larger data warehouse f

150
Q

What are the three requirements for a decision support database

A

Database Schema, Extraction and Filtering, and database size

151
Q

data cube

A

Used in Multidimensional online analytical processing (MOLAP). The location of each data value in the data cube is a function of the x-, y-, and z-axes in a three-dimensional space. The three axes represent the dimensions of the data value.

152
Q

Online Analytical Processing

153
Q

Facts

A

Numeric measurements (values) that represent a specific business aspect or activity. Facts commonly used in business data analysis are units, costs, prices, and revenues.

154
Q

Dimensions

A

Qualifying characteristics that provide additional perspectives to a given fact. Dimensions provide descriptive characteristics about the facts through their attributes.

155
Q

What is drill-down analysis

A

Drill-down involves going from a summary or higher-level view of the data to a more detailed or specific view.

156
Q

How does a data analys usually look at facts?

A

Through a the dimension’s attributes. a data warehouse DBMS that is optimized for decision support first searches the smaller dimension tables before accessing the larger fact tables.

157
Q

What is this an example of?
Start with Yearly Sales:
2024: $1M
Drill-down to Quarterly Sales:
Q1: $250k, Q2: $300k, Q3: $200k, Q4: $250k
Drill further into Monthly Sales for Q1:
January: $100k, February: $80k, March: $70

A

Drill down analysis

158
Q

Database Performance Tuning

A

the goal of database performance is to execute queries as fast as possible. Therefore, database performance must be closely monitored and regularly tuned. Database performance tuning refers to a set of activities and procedures designed to reduce the response time of the database system

159
Q

What are database statistics, and why are they important?

A

statistics provide information about database size, number of records, average access time, number of requests serviced, and number of users with access rights. These statistics are then used to determine the best access strategy. Current-generation DBMSs are intelligent enough to determine the best type of index to use under certain circumstances

160
Q

How are database statistics obtained?

A

EXPLAIN ANALYZE;
ANALYZE;
VACCUM ANALYZE;
It is different from Oracle to IBM, to PostgreSQL

161
Q

the DBMS processes a query in three phases.

A

Parsing. The DBMS parses the SQL query and chooses the most efficient access/execution plan.
Execution. The DBMS executes the SQL query using the chosen execution plan.
Fetching. The DBMS fetches the data and sends the result set back to the client.

162
Q

What determines when to use an index?

A

Data sparsity refers to the number of different values a column could have.
Table size. Small tables don’t necessarily warrant indexes.

163
Q

Rule-based Optimizer

A

uses preset rules and points to determine the best approach to execute a query. The rules assign a “fixed cost” to each SQL operation; the costs are then added to yield the cost of the execution plan. F

164
Q

Cost-based optimizer

A

uses sophisticated algorithms based on statistics about the objects being accessed to determine the best approach to execute a query. In this case, the optimizer process adds up the processing cost, the I/O costs, and the resource costs (RAM and temporary space) to determine the total cost of a given execution plan.

165
Q

Index guidelines

A

Create indexes for each single attribute used in a WHERE, HAVING, ORDER BY, or GROUP BY clause.
Declare primary and foreign keys
Declare indexes in join columns other than PK or FK

166
Q

What is data sparsity?

A

refers to how much a column’s values are repeated versus how many unique values it contains.

167
Q

Query optimization techniques

A

Use simple columns or literals as operands in a conditional expression
Numeric field comparisons are faster than character, date, and NULL comparisons.
Equality comparisons are generally faster than inequality
When using multiple conditional expressions, write the equality conditions first.
If you use multiple AND conditions, write the condition most likely to be false first.
Try to avoid using NOT whenever possible

168
Q

Scaling Up

A

keeping the same number of systems but migrating each system to a larger system: for example, changing from a server with 16 CPU cores and a 1 TB storage system to a server with 64 CPU

169
Q

Scaling Out

A

means that when the workload exceeds the capacity of a server, the workload is spread out across a number of servers. This is also referred to as clustering—creating a cluster of low-cost servers to share a workload

170
Q

Stream Processing

A

requires analysis of the data stream as it enters the system. In some situations, large volumes of data can enter the system at such a rapid pace that it is not feasible to try to store all of the data. The data must be processed and filtered as it enters the system to determine which data to keep and which data to discard.

171
Q

Feedback loop processing

A

Capturing the data, processing it into usable information, and then acting on that information is a feedback loop. processing to provide immediate results requires analyzing large amounts of data within just a few seconds ..

172
Q

Value (Big Data)

A

also called viability, refers to the degree to which the data can be analyzed to provide meaningful information that can add value to the organization.

173
Q

Visualization (Big Data)

A

the ability to graphically present the data in such a way as to make it understandable.

174
Q

Veracity (Big Data)

A

One of the keys to data modeling is that only the data that is of interest to the users should be included in the data model. Data that is not of value should not be recorded in any data store

175
Q

ACID

A

(atomicity, consistency, isolation, and durability):
four essential properties that ensure data integrity and reliability in databases, especially in transactional systems.

176
Q

Atomicity

A

A transaction is fully completed or fully rolled back—no partial updates.

177
Q

Consistency

A

The database moves from one valid state to another (no broken constraints).

178
Q

Isolation

A

Transactions are independent of each other—one transaction doesn’t affect another until it’s committed.

179
Q

Durability

A

Once committed, data is permanently saved (even if there’s a system crash).

180
Q

What are the four basic categories of NoSQL databases?

A

Key-value database
Document databases
Column-oriented databases
Graph databases

181
Q

Graph databases

A

interdependent queries about relationships that could take hours to run in a relational database are the forte of graph databases. Graph databases can complete these queries in seconds. In fact, you often encounter the phrase “minutes to milliseconds”

182
Q

Aggregate Aware

A

Key-value, document, and column family databases are aggregate aware. Aggregate aware means that the data is collected or aggregated around a central topic or entity.

183
Q

T or F: Hadoop is outdated

A

Both. It is on the decline with faster more efficient ways of combating cloud-based solutions, slow performance, high storage cofts, and design for realtime processing