Data Management Flashcards
What is a data model?
a simple representation that demonstrates how data structures, including their characteristics, relations, constraints, transformations form a database that solves a real-world business problem.
business rule
a policy, procedure, or principle within a specific organization. business rules are used to define entities, attributes, relationships, and constraints.
the basic features of the relational data model
entities, attributes, relationships
Translate business rules into data models
a noun in a business rule will translate into an entity in the model, and a verb (active or passive) that associates the nouns will translate into a relationship among the entities
entity
An entity is a person, place, thing, concept, or event about which data will be collected and stored.
Attribute
An attribute is a characteristic of an entity. For example, a CUSTOMER entity would be described by attributes such as customer last name, customer first name, customer phone number, customer address, and customer credit limit.
Relationships
describes an association among entities. Data models use three types of relationships: one-to-many, many-to-many, and one-to-one. shorthand notations 1:M or 1..*, M:N or .., and 1:1 or 1..1, respectively.
Visualize One-to-many (1:M or 1..*) relationship.
An author has many books, but a book has one author 1:M AUTHOR publishes BOOKS
An invoice is created by one person, but a customer generates many invoices 1:M CUSTOMER generates INVOICES
Visualize Many-to-many (M:N or ..) relationship.
An employee may learn many job skills, and each job skill may be learned by many employees. “EMPLOYEE learns SKILL” as M:N.
A student can take many classes and each class can be taken by many students, M:N STUDENT takes CLASSES
One-to-one (1:1 or 1..1) relationship.
A retail company’s management structure may require that each of its stores be managed by a single employee. “EMPLOYEE manages STORE” is labeled 1:1.
Schema
The schema is the conceptual organization of the entire database as viewed by the database administrator.
Hierarchial model
The hierarchical model depicts a set of one-to-many (1:M) relationships between a parent and its children segments.
Network model
the user perceives the network database as a collection of records in 1:M relationships. However, unlike the hierarchical model, the network model allows a record to have more than one parent.
Subschema
The subschema defines the portion of the database “seen” by the application programs that actually produce the desired information from the data within the database.
Relational model
A relation (sometimes called a table) as a two-dimensional structure composed of intersecting rows and columns. Each row in a relation is called a tuple. Each column represents an attribute.
The Entity relationship model
The graphical representation of entities and their relationships in a database structure. ER models are normally represented in an entity relationship diagram (ERD)
Object-Oriented Model (OODM)
data and its relationships are contained in a single structure known as an object, forming the basis for the object-oriented database management system (OODBMS).
Why is an object said to have greater semantic content than an entity?
an object includes information about relationships between the facts within the object, as well as information about its relationships with other objects.
Class vs object
A class is a collection of similar objects with shared structure (attributes) and behavior (methods). An object is an abstraction of an entity
Extensible Markup Language (XML)
a markup language designed to store, transport, and structure data in a way that is both human-readable and machine-readable. allows users to define their own tags and data structures based on their needs. <tag></tag>
table
A logical construct perceived to be a two-dimensional structure composed of intersecting rows (entities) and columns (attributes) that represents an entity set in the relational model.
The 3 Vs
VOLUME: amounts of data being stored
VELOCITY: the speed which data grows but also the need to process
VARIETY: the fact that data is collected in multiple formats
Internet of Things (IoT)
A web of Internet-connected devices constantly exchanging and collecting data over the Internet. IoT devices can be remotely managed and configured to collect data and interact with other devices on the Internet.
What is Hadoop, and what are its basic components?
A Java-based, open-source file storage system that uses the write-once, read many model. This means that once the data is written, it cannot be modified.
data
Raw facts, or facts that have not yet been processed to reveal their meaning to the end user.
field
A character or group of characters (alphabetic or numeric) that has a specific meaning. A field is used to define and store data.
record
A logically connected set of one or more fields that describes a person, place, or thing.
file
A collection of related records.
what is data redundancy?
Exists when the same data is stored unnecessarily at different places, leading to poor data security, inconsistency, entry errors, and integretiy problems
The difference between data, info, and a database
the database system consists of logically related data stored in a single logical data repository. the current generation of DBMS software stores not only the data structures but also the relationships between those structures and the access paths to those structures
What does a business need to manage a database?
Hardware, software, people, procedures or rules governing the design, and the data
What is metadata?
The metadata describes the data characteristics and the set of relationships that links the data found within the database.
Structured data vs. non-structured
Data that has been formatted to facilitate storage, use, and information generation in a predefined data model.
Information
the result of processing raw data to reveal its meaning. Data processing can be as simple as organizing data to reveal patterns or as complex as making forecasts or drawing inferences using statistical modeling.
data dictionary
stores metadata—data about data. contains the data definition as well as their characteristics and relationships.
Data Anomalies
Develops when not all of the required changes in the redundant data are made successfully.
What are the characteristics of big data?
Velocity: the speed at which data emanates and changes
Value: The value that can be derived from access and analysis
Veracity: The discrepancies found in data
Volume: The sheer size of data generated every second
Variety: The combination of datatypes in the system
Data warehouse
centralized repository designed for structured data. Like a server that has all the data sets. Data is cleaned, transformed, and organized before being stored. Often used for historical analysis, reporting, and dashboards.
Data lake
a centralized repository designed to store raw, unstructured, semi-structured, and structured data at scale. Supports a variety of formats like text, video, audio, JSON, and CSV.
ETL
a data integration process that moves data from multiple sources into a destination system
Extract: Retrieve data from various sources like databases, APIs, or files.
Transform: Clean, standardize, and format the data (e.g., removing duplicates, converting data types).
Load: Store the transformed data into the destination system, typically a data warehouse.
The Cloud
Contains structured, semi-structured, unstructured data, but doesn’t rely on local maintenance or a server. Uses tools like Amazon Web Services (AWS), Microsoft Azure, or Google Cloud
Multicloud
Uses several cloud storage and computing providers simultaneously
Hybrid Cloud
Uses both public cloud providers and a secure, private cloud
The relational model’s three components
A logical data structure represented by relations
A set of integrity rules to enforce that the data is consistent and remains consistent over time
A set of operations that defines how data is manipulated
What are the characteristics of a relational table?
It is a two-dimensional structure composed of rows and columns.
Each table row (tuple) represents a single entity.
Each table column represents an attribute with a distinct name.
All values in a column must conform to the same data format.
Each column has a specific range of values known as the attribute domain.
The order of the rows and columns is immaterial to the DBMS.
Each table must have an attribute or combination of attributes that uniquely identifies each row.
Determination and functional dependency
The state in which knowing the value of one attribute makes it possible to determine the value of another. Ex. Revenue - cost = profit. Known values revenue and cost determine profit . Such is applied to functional dependency, which means that the value of one or more attributes determines the value of one or more other attributes
Entity Integrity
the condition in which each row (entity instance) in the table has its own known, unique identity.
What are the requirements of the primary key for entity integrity?
all of the values in the primary key must be unique and
no key attribute in the primary key can contain a null
Foreign Key
A foreign key (FK) is the primary key of one table that has been placed into another table to create a common attribute.
Referential Integrity
a concept in relational databases that ensures the consistency and accuracy of data by maintaining valid relationships between tables. Thereby, it enforces rules about how foreign keys relate to primary keys
Relational algebra
defines the theoretical way of manipulating table contents using relational operators
SELECT (RESTRICT)
It yields values for all rows found in the table that satisfy a given condition.
PROJECT
yields all values for selected attributes. It is also a unary operator, accepting only one table as input.
UNION
combines all rows from two tables, excluding duplicate rows.
INTERSECT
yields only the rows that appear in both tables.
Product
yields all possible pairs of rows from two tables—also known as the Cartesian product. if one table has 6 rows and the other table has 3 rows, the PRODUCT yields a list composed of 6 × 3=18 rows.
JOIN
allows information to be intelligently combined from two or more tables.
Equijoin
links tables on the basis of an equality condition that compares specified columns of each table. The outcome of the equijoin does not eliminate duplicate columns,
inner join
only returns matched records from the tables that are being joined. In an outer join, the matched pairs would be retained, and any unmatched values in the other table would be left null.
outer join
an “inner join plus.” The outer join still returns all of the matched records that the inner join returns, plus it returns the unmatched records from one of the tables.
What join is useful for uncovering referential integrity errors?
the outer joins. Such problems are created when foreign key values do not match the primary key values in the related table(s).
DIVIDE
is used to answer questions about one set of data being associated with all values of data in another set of data. For example which CUSTOMERS in col1 bought all 3 products in col2?
system catalog
a detailed system data dictionary that describes all objects within the database, including data about table names, a table’s creator and creation date, authorized users, and access privileges. the system catalog tables can be queried just like any user/designer-created table.
homonyms
similar-sounding words with different meanings, such as boar and bore.
For example, you might use C_NAME to label a customer name attribute in a CUSTOMER table and use C_NAME to label a consultant name attribute in a CONSULTANT table. Avoid this.
synonyms
indicates the use of different names to describe the same attribute. For example, car and auto.
what is a real test of redundancy?
if you delete an attribute and the original information can still be generated through relational algebra, the inclusion of that attribute would be redundant.
T or F: All redundancies must be deleted
False. Planned redundancies are common in good database design. Sometimes redundancies occur to maintain historical data.
Index
an index is an ordered arrangement of keys and pointers. Each key points to the location of the data identified by the key.
What are the four main NoSQL databases?
Key-value stores
Document Databases
Wide-Column Stores
Graph Databases
Graph databases
Definition: Use nodes, edges, and properties to represent relationships between data (e.g., Neo4j, Amazon Neptune).
Relational Potential:
Graph databases are inherently non-relational because they are optimized for querying relationships directly through graph structures.
Wide-Column Stores
Store data in a column-oriented format where each row can have varying columns (e.g., Apache Cassandra, HBase). great for dealing with extremely large amounts of data where speed is of utmost importance.
Document databases
Stores documents in often JSON or BSON formats. Don’t have a fixed schema or table relationship. Document stores prioritize flexibility and hierarchical data, making them better suited for unstructured data. Popular document storages are Apache’s CouchDB, MongoDB, and Azure Cosmos DB
Key-Value stores
Key-value stores lack relationships, constraints, and structured querying. Ideal for fast lookups and simple use cases. Every element is stored as a key-value pair consisting of an attribute name (“key”) and a value. Popular systems are Redis, DynamoDB, Oracle NoSQL. Useful for shopping carts, user preferences, user profiles.
Natural Keys
Data you already store that meets the requirement of a primary key: No Nulls, all values unique
Composite Key
multiple attribute columns that together provide the unique identifier for a row and make up a composite key
Surrogate Primary Key
keys that have no real world meaning. Their entire purpose is to create a unique column in a data table, Much like library card numbers, credit card numbers, driver license numbers
Foreign Keys
Foreign key columns store the primary keys value of the rows that they’re related to. Because they need to store the same value, their data type needs to be the same as the primary key’s data type in the related table.
Views
A view in SQL is a virtual table created based on a query. It does not store data but dynamically pulls it from the underlying tables.
Normalization
Normalization is the process of removing redundancy from a database.
1st normal form: Ensures each column contains atomic (indivisible) values. Removes duplicate columns and ensures each row is unique.
2nd: Removes partial dependencies (no non-key column depends on part of a composite key)
3rd: removes transitive dependencies
OLTP systems
Online Transaction Processing. Reduce data redundancy by usin more table relationships to increase database write speed. These databases store information quickly. For instance, an online storefront that’s primarily concerned with allowing customers to put items into their shopping cart and quickly process their payment information
OLAP systems
OLAP stands for Online Analytical Processing, and databases designed with this model are primarily concerned with retrieval of information. designed to support analysis workloads
What does a column have to achieve to be non-transitively dependent?
Its value is directly determined by the primary key.
It is not indirectly dependent on the primary key through another column.
What database would you use for social networks, recommendation engines, fraud detection, supply chain management
Graph databases: Analyzing and querying complex relationships between entities.
What database would you use for IoT applications, real-time analysis, log management, CMS
Wide-column stores: Handling large-scale, high-throughput applications with semi-structured data.
What database would you use for CMS, CRMs, ecommerce, or mobile applications?
Document databases. Best For: Managing semi-structured or hierarchical data, such as JSON or XML.
What databases would you use for caching, session management, feature flags like A/B testing, or leaderboard systems?
Key-value stores: Best For: Simple, fast lookups of key-value pairs.
Visualize how to model a Primary Key in an ER model
TABLE NAME ( KEY_ATTRIBUTE 1 , ATTRIBUTE 2, ATTRIBUTE 3, … ATTRIBUTE K)
Visualize how to model attributes of an entity in a ER
If entity is a STUDENT, then attributes could be STU_LNAME, STU_FNAME, STU_ETHNICITY
What two conditions must be met before an entity can be classified as a weak entity?
The entity is existence-dependent; it cannot exist without the entity with which it has a relationship.
The entity has a primary key that is partially or totally derived from the parent entity in the relationship.
Recursive relationships
a recursive relationship is one in which a relationship can exist between occurrences of the same entity set.
What are cardinalities
Expresses the minimum and maximum number of entity occurrences associated with one occurrence of the related entity. Expressed as (x,y). Ex. A professor can take up to 4 classes (1,4).
Composite Attributes
an attribute that can be further subdivided to yield additional attributes. For example, the attribute ADDRESS can be subdivided into street, city, state, and zip code.
What are good modeling practices for multivalued attributes
- Create a new entity composed of the original multivalued attribute’s components
- Within the original entity, create several new attributes,one for each component of the original attribute
Derived attributes
attributes that are computed, or derived from calculating two attributes together
How do you define relationships between entities?
Using active or passive verbs. And it shows operation in both directions
A CUSTOMER generates many INVOICES
Each INVOICE generated by one CUSTOMER
What is connectivity?
Used to describe the relationship classification such as 1:M, M:N, and 1:1
What is a weak relationship?
exists if the primary key of the related entity (parent) does not appear as a foreign key in the child. By default, relationships are established by having the primary key of the parent entity appear as a foreign key (FK) on the related entity (also known as the child entity).
Strong relationships
A strong (identifying) relationship exists when the primary kiey of the related entity (a foreign key) is a a primary key of the parent entity.
What are multivalued attributes?
Multivalued attributes are attributes that can have many values. For instance, a person may have several college degrees, and a household may have several different phones, each with its own number.
What is a composite entity?
This associative entity, also called a composite or bridge entity, is in a 1:M relationship with the parent entities and is composed of the primary key attributes of each parent entity.
Steps to developing an ERD model?
Create a detailed narrative of the organization’s description of operations.
Identify the business rules based on the description of operations.
Identify the main entities and relationships from the business rules.
Develop the initial ERD.
Identify the attributes and primary keys that adequately describe the entities.
Revise and review the ERD.
What 3 database requirements are often conflicting and must be addressed in design
Design standards.
Processing speed
Information requirements
Single-valued attributes
A single-valued attribute is an attribute that can have only a single value. For example, a person can have only one Social Security number, and a manufactured part can have only one serial number.
Simple attributes
A simple attribute is the opposite of a composite attribute in that it cannot be subdivided. For example, age, sex, and marital status would be classified as simple attributes.
Why is an object an abstraction of an entity?
Abstractions simplify complexity by modeling only what matters for the system. an object is an abstraction of an entity because it simplifies a complex real-world concept into a model that can be represented and manipulated programmatically
Entity Set
An entity set is a group of entities that belong to the same type and are represented by the same attributes. It is a fundamental concept in database design, particularly in the context of the Entity-Relationship (ER) model.
T or F: To implement many-to-many (M:N) relationships in databases, you cross reference with primary and foreign keys.
False. Relational databases do not support direct many-to-many relationships between tables and require a join table.
What is a join, junction, or cross-reference table
Contains two foreign keys that reference the primary keys of the two related tables.
Represents individual instances of the many-to-many relationship.
Dimensional modeling
a data modeling technique used to organize and structure a data in a data warehouse is made up of two types of data: facts and dimensions.
Facts: Facts are usually things that can be measured and aggregated, such as profit, sales. These are usually stored in one table.
Dimensions: Stored in multiple tables, these provide additional contexts to facts such as month, product category.
Facts Table (dimensional modeling)
Facts are individual pieces of data or information that we want to store and analyze in our data warehouse. These facts can be numerical or quantitative values. numbers of products sold, total sales amount, number of customer complaints
Dimensions Table (dimensional modeling)
A dimension in the data warehousing is a collection of categories or attributes that describes the facts in your data.
What schema is suitable for large amounts of data in a data warehouse, but has slower performance?
Snowflake Schema
What schema is suitable for small amounts of data in a data warehouse, but has faster performance?
Star Schema
Which type of data warehouse is faster, more reliable, store locally, but is expensive to setup and maintain, and limited in scalability?
On-premises warehouses
Describe Cloud Warehouses
They have great scalability
More cost effective (follow CapEx model)
More accessible
Describe primary key vs. indexes
While a primary key is a logical constraint that uniquely identifies each row in a table, an index is a database structure that improves the speed of data retrieval for specific queries. It is a performance optimization feature. Doesn’t require uniques.
B-tree indexes
a tree data structure with a root and nodes. The root node is the index value that splits the range of values found in the index column. The top node of the subtree splits the value of the index column so that the values less than the node value are stored to the left branch of the tree, and values greater than the value in the node are stored to the right.
Bitmap indexes
store a series of bits for indexed values. The number of bits used is the same as the number of distinct values in a column. For example, a column that has either a yes or no value would require two bits, one corresponding to the yes, and one corresponding to the no.
Hash index
Hash functions take an arbitrary length data and map it to a fixed-size value. Hash values are designed so that different inputs will produce different outputs.
Bloom filter indexes
bloom filters are especially useful when we’re querying arbitrary combinations or a large number of attributes. a bloom filter index is probabilistic, which means it’s not deterministic, we might get some results that aren’t actually fitting our filter criteria. But it’s really space efficient. It’s a lossy representation.
Why is it important to know the Nested Loop, Hash Join, Merge Join, and subqueries
This allows you to turn on either a nest, hash, or merge to save time and space. When space and time is important, you can try either one.
And use EXPLAIN to calculate how much the join took up
The below code turns on either join in PostgreSQL
set enable_nestloop=true;
set enable_hashjoin=false;
Set enable_mergejoin=false;
Schema vs. model
A “schema” refers to the detailed structure of data within a database, defining how tables, fields, and relationships are organized, while a “model” is a high-level conceptual blueprint that outlines the overall design of a database system, including the entities, attributes, and relationships between them
Why is ELT sometimes preferred
It lets data engineers interact directly with raw data almost in real time with minimal processing and transportation time.
ELT
Extract, load, transform: data is extracted from a source server or servers and is then transported immediately to the target location and loaded. No transformation occurs between these two steps.
When do you want to use ETL?
you may have a need to transform sensitive information so it’s not sitting unmasked in a data lake or warehouse
Limit data access through filtering
Process data for migration to different servers
Trigger
An extraction tooling created through PostgreSQL to create a new triggeron a specified function when certain actionsor parameters are met. Ex. creating a trigger that runs every time a rowof the table accounts is about to be updated.
CREATE TRIGGER check_update
BEFORE UPDATE ON accounts
FOR EACH ROW
EXECUTE FUNCTION
check_account_update():
Change data capture
An extraction tool that only captures the data that has changed since the previous ETL operation. Debezium is just one example of an open source CDC platform
data transformations
This process involves looking for errors, inconsistencies or other validation errors and stripping or restructuring data accordingly. Some examples of actions include deduplicating redundant or identical records, mapping values appropriately between the source and destination databases, performing data validation to ensure the records are of compatible data types and are aligned properly between the source and destination schemas or establishing key relationships across tables.
T or F: Security and cost requires that you transform the data on source server, filtering out sensitive information and implementing secruity processes, therefore you should use ETL.
T: ETL you transform data on the source server before loading into the destination. This is slower than using in-warehouse data transformation (ELT).
Incremental loading
Part of ETL/ELT, it evaluates the differences between the source and destination data sets to determine what data has changed since the last load. Any modified or added records are imported through streaming, batch, or small batch incremental loads.
full load
the entire data set is loaded into the target data warehouse and information already existing is overwritten every time data is loaded.
T or F: You should do a full load to restore historical data
F: Historical data is loss everytime you do a full load. Full loads should only be done in disasters or first time transfers.
partial dependency
When a column depends on only part of the primary key, not the whole thing
Such as Class_Name being dependent on just class_ID, but there’s also a Student_ID in the table
Transitive dependency
When a column depends on the primary key indirectly,
Such as Teach_Name depending on Class_ID through Class_Name. Teacher_Name needs to be placed in its own table.
What are reasons to denormalize
Data warehouses routinely use 2NF structures
When you require higher processing speeds and less sspace
Atomic attribute
When an attribute can no longer be subdivided
Such as EMP_NAME can be last_name and first_name
Systems Development Life Cycle (SDLC)
divided into five phases; planning, analysis, detailed systems design, implementation, and maintenance. The SDLC is an iterative process rather than a sequential process.
Database Life Cycle (DBLC)
Contains six phases: database initial study, database design, implementation and loading, testing and evaluation, operation, and maintenance and evolution.
Conceptual design
a conceptual data model that describes the main data entities, attributes, relationships, and constraints of a given problem domain.
Centralized design
design can be carried out and represented in a fairly simple database. typical of relatively simple, small databases and can be successfully done by a single database administrator or by a small, informal design team.
Decentralized design
used when the system’s data component has a considerable number of entities and complex relations on which very complex operations are performed.
Top-down design
Identifying the data sets and then defines the data elements for each of those sets. This process involves the identification of different entity types and the definition of each entity’s attributes.
Bottom-up design
identifies the data elements (items) and then groups them together in data sets. In other words, it first defines attributes, and then groups them to form entities.
Logical Design
an enterprise-wide database that is based on a specific data model but independent of physical-level details. logical design for a relational DBMS includes the specifications for the relations (tables), relationships, and constraints (in other words, domain definitions, data validations, and security views).
Physical Design
the process of determining the data storage organization and data access characteristics of the database to ensure its integrity, security, and performance.
Key performance indicators (KPIs)
Quantifiable numeric or scale-based measurements that assess the company’s effectiveness or success in reaching its strategic and operational goals.
Operational Data
Operational data storage is optimized to support transactions that represent daily operations.
decision support data
gives tactical and strategic business meaning to the operational data. Support data differs from operational data in three main areas: time span, granularity, and dimensionality.
Data Mart
small, single-subject data warehouse subset that provides decision support to a small group of people. a data mart could be created from data extracted from a larger data warehouse f
What are the three requirements for a decision support database
Database Schema, Extraction and Filtering, and database size
data cube
Used in Multidimensional online analytical processing (MOLAP). The location of each data value in the data cube is a function of the x-, y-, and z-axes in a three-dimensional space. The three axes represent the dimensions of the data value.
Online Analytical Processing
Facts
Numeric measurements (values) that represent a specific business aspect or activity. Facts commonly used in business data analysis are units, costs, prices, and revenues.
Dimensions
Qualifying characteristics that provide additional perspectives to a given fact. Dimensions provide descriptive characteristics about the facts through their attributes.
What is drill-down analysis
Drill-down involves going from a summary or higher-level view of the data to a more detailed or specific view.
How does a data analys usually look at facts?
Through a the dimension’s attributes. a data warehouse DBMS that is optimized for decision support first searches the smaller dimension tables before accessing the larger fact tables.
What is this an example of?
Start with Yearly Sales:
2024: $1M
Drill-down to Quarterly Sales:
Q1: $250k, Q2: $300k, Q3: $200k, Q4: $250k
Drill further into Monthly Sales for Q1:
January: $100k, February: $80k, March: $70
Drill down analysis
Database Performance Tuning
the goal of database performance is to execute queries as fast as possible. Therefore, database performance must be closely monitored and regularly tuned. Database performance tuning refers to a set of activities and procedures designed to reduce the response time of the database system
What are database statistics, and why are they important?
statistics provide information about database size, number of records, average access time, number of requests serviced, and number of users with access rights. These statistics are then used to determine the best access strategy. Current-generation DBMSs are intelligent enough to determine the best type of index to use under certain circumstances
How are database statistics obtained?
EXPLAIN ANALYZE;
ANALYZE;
VACCUM ANALYZE;
It is different from Oracle to IBM, to PostgreSQL
the DBMS processes a query in three phases.
Parsing. The DBMS parses the SQL query and chooses the most efficient access/execution plan.
Execution. The DBMS executes the SQL query using the chosen execution plan.
Fetching. The DBMS fetches the data and sends the result set back to the client.
What determines when to use an index?
Data sparsity refers to the number of different values a column could have.
Table size. Small tables don’t necessarily warrant indexes.
Rule-based Optimizer
uses preset rules and points to determine the best approach to execute a query. The rules assign a “fixed cost” to each SQL operation; the costs are then added to yield the cost of the execution plan. F
Cost-based optimizer
uses sophisticated algorithms based on statistics about the objects being accessed to determine the best approach to execute a query. In this case, the optimizer process adds up the processing cost, the I/O costs, and the resource costs (RAM and temporary space) to determine the total cost of a given execution plan.
Index guidelines
Create indexes for each single attribute used in a WHERE, HAVING, ORDER BY, or GROUP BY clause.
Declare primary and foreign keys
Declare indexes in join columns other than PK or FK
What is data sparsity?
refers to how much a column’s values are repeated versus how many unique values it contains.
Query optimization techniques
Use simple columns or literals as operands in a conditional expression
Numeric field comparisons are faster than character, date, and NULL comparisons.
Equality comparisons are generally faster than inequality
When using multiple conditional expressions, write the equality conditions first.
If you use multiple AND conditions, write the condition most likely to be false first.
Try to avoid using NOT whenever possible
Scaling Up
keeping the same number of systems but migrating each system to a larger system: for example, changing from a server with 16 CPU cores and a 1 TB storage system to a server with 64 CPU
Scaling Out
means that when the workload exceeds the capacity of a server, the workload is spread out across a number of servers. This is also referred to as clustering—creating a cluster of low-cost servers to share a workload
Stream Processing
requires analysis of the data stream as it enters the system. In some situations, large volumes of data can enter the system at such a rapid pace that it is not feasible to try to store all of the data. The data must be processed and filtered as it enters the system to determine which data to keep and which data to discard.
Feedback loop processing
Capturing the data, processing it into usable information, and then acting on that information is a feedback loop. processing to provide immediate results requires analyzing large amounts of data within just a few seconds ..
Value (Big Data)
also called viability, refers to the degree to which the data can be analyzed to provide meaningful information that can add value to the organization.
Visualization (Big Data)
the ability to graphically present the data in such a way as to make it understandable.
Veracity (Big Data)
One of the keys to data modeling is that only the data that is of interest to the users should be included in the data model. Data that is not of value should not be recorded in any data store
ACID
(atomicity, consistency, isolation, and durability):
four essential properties that ensure data integrity and reliability in databases, especially in transactional systems.
Atomicity
A transaction is fully completed or fully rolled back—no partial updates.
Consistency
The database moves from one valid state to another (no broken constraints).
Isolation
Transactions are independent of each other—one transaction doesn’t affect another until it’s committed.
Durability
Once committed, data is permanently saved (even if there’s a system crash).
What are the four basic categories of NoSQL databases?
Key-value database
Document databases
Column-oriented databases
Graph databases
Graph databases
interdependent queries about relationships that could take hours to run in a relational database are the forte of graph databases. Graph databases can complete these queries in seconds. In fact, you often encounter the phrase “minutes to milliseconds”
Aggregate Aware
Key-value, document, and column family databases are aggregate aware. Aggregate aware means that the data is collected or aggregated around a central topic or entity.
T or F: Hadoop is outdated
Both. It is on the decline with faster more efficient ways of combating cloud-based solutions, slow performance, high storage cofts, and design for realtime processing