Data Management Flashcards

Question

data

Answer 1

Raw facts, or facts that have not yet been processed to reveal their meaning to the end user.

Answer 2

A character or group of characters (alphabetic or numeric) that has a specific meaning. A field is used to define and store data.

Answer 3

A logically connected set of one or more fields that describes a person, place, or thing.

Answer 4

A collection of related records.

Answer 5

Exists when the same data is stored unnecessarily at different places, leading to poor data security, inconsistency, entry errors, and integretiy problems

Answer 6

the database system consists of logically related data stored in a single logical data repository. the current generation of DBMS software stores not only the data structures but also the relationships between those structures and the access paths to those structures

Answer 7

Hardware, software, people, procedures or rules governing the design, and the data

Answer 8

The metadata describes the data characteristics and the set of relationships that links the data found within the database.

Answer 9

Data that has been formatted to facilitate storage, use, and information generation in a predefined data model.

Answer 10

the result of processing raw data to reveal its meaning. Data processing can be as simple as organizing data to reveal patterns or as complex as making forecasts or drawing inferences using statistical modeling.

Answer 11

stores metadata—data about data. contains the data definition as well as their characteristics and relationships.

Answer 12

Develops when not all of the required changes in the redundant data are made successfully.

Answer 13

Velocity: the speed at which data emanates and changes Value: The value that can be derived from access and analysis Veracity: The discrepancies found in data Volume: The sheer size of data generated every second Variety: The combination of datatypes in the system

Answer 14

centralized repository designed for structured data. Like a server that has all the data sets. Data is cleaned, transformed, and organized before being stored. Often used for historical analysis, reporting, and dashboards.

Answer 15

a centralized repository designed to store raw, unstructured, semi-structured, and structured data at scale. Supports a variety of formats like text, video, audio, JSON, and CSV.

Answer 16

a data integration process that moves data from multiple sources into a destination system Extract: Retrieve data from various sources like databases, APIs, or files. Transform: Clean, standardize, and format the data (e.g., removing duplicates, converting data types). Load: Store the transformed data into the destination system, typically a data warehouse.

Answer 17

Contains structured, semi-structured, unstructured data, but doesn't rely on local maintenance or a server. Uses tools like Amazon Web Services (AWS), Microsoft Azure, or Google Cloud

Answer 18

Uses several cloud storage and computing providers simultaneously

Answer 19

Uses both public cloud providers and a secure, private cloud

Answer 20

A logical data structure represented by relations A set of integrity rules to enforce that the data is consistent and remains consistent over time A set of operations that defines how data is manipulated

Answer 21

It is a two-dimensional structure composed of rows and columns. Each table row (tuple) represents a single entity. Each table column represents an attribute with a distinct name. All values in a column must conform to the same data format. Each column has a specific range of values known as the attribute domain. The order of the rows and columns is immaterial to the DBMS. Each table must have an attribute or combination of attributes that uniquely identifies each row.

Answer 22

The state in which knowing the value of one attribute makes it possible to determine the value of another. Ex. Revenue - cost = profit. Known values revenue and cost determine profit . Such is applied to functional dependency, which means that the value of one or more attributes determines the value of one or more other attributes

Answer 23

the condition in which each row (entity instance) in the table has its own known, unique identity.

Answer 24

all of the values in the primary key must be unique and no key attribute in the primary key can contain a null

Answer 25

A foreign key (FK) is the primary key of one table that has been placed into another table to create a common attribute.

Answer 26

rules governing how foreign keys relate to primary keys so relationships in databases remain valid.

Answer 27

defines the theoretical way of manipulating table contents using relational operators

Answer 28

It yields values for all rows found in the table that satisfy a given condition.

Answer 29

yields all values for selected attributes. It is also a unary operator, accepting only one table as input.

Answer 30

combines all rows from two tables, excluding duplicate rows.

Answer 31

yields only the rows that appear in both tables.

Answer 32

yields all possible pairs of rows from two tables—also known as the Cartesian product. if one table has 6 rows and the other table has 3 rows, the PRODUCT yields a list composed of 6 × 3=18 rows.

Answer 33

allows information to be intelligently combined from two or more tables.

Answer 34

links tables on the basis of an equality condition that compares specified columns of each table. The outcome of the equijoin does not eliminate duplicate columns,

Answer 35

only returns matched records from the tables that are being joined. In an outer join, the matched pairs would be retained, and any unmatched values in the other table would be left null.

Answer 36

an “inner join plus.” The outer join still returns all of the matched records that the inner join returns, plus it returns the unmatched records from one of the tables.

Answer 37

the outer joins. Such problems are created when foreign key values do not match the primary key values in the related table(s).

Answer 38

is used to answer questions about one set of data being associated with all values of data in another set of data. For example which CUSTOMERS in col1 bought all 3 products in col2?

Answer 39

a detailed system data dictionary that describes all objects within the database, including data about table names, a table’s creator and creation date, authorized users, and access privileges. the system catalog tables can be queried just like any user/designer-created table.

Answer 40

similar-sounding words with different meanings, such as boar and bore. For example, you might use C_NAME to label a customer name attribute in a CUSTOMER table and use C_NAME to label a consultant name attribute in a CONSULTANT table. Avoid this.

Answer 41

indicates the use of different names to describe the same attribute. For example, car and auto.

Answer 42

if you delete an attribute and the original information can still be generated through relational algebra, the inclusion of that attribute would be redundant.

Answer 43

False. Planned redundancies are common in good database design. Sometimes redundancies occur to maintain historical data.

Answer 44

an index is an ordered arrangement of keys and pointers. Each key points to the location of the data identified by the key.

Answer 45

Key-value stores Document Databases Wide-Column Stores Graph Databases

Answer 46

Definition: Use nodes, edges, and properties to represent relationships between data (e.g., Neo4j, Amazon Neptune). Relational Potential: Graph databases are inherently non-relational because they are optimized for querying relationships directly through graph structures.

Answer 47

Store data in a column-oriented format where each row can have varying columns (e.g., Apache Cassandra, HBase). great for dealing with extremely large amounts of data where speed is of utmost importance.

Answer 48

Stores documents in often JSON or BSON formats. Don't have a fixed schema or table relationship. Document stores prioritize flexibility and hierarchical data, making them better suited for unstructured data. Popular document storages are Apache's CouchDB, MongoDB, and Azure Cosmos DB

Answer 49

Every element is stored as a key-value pair consisting of an attribute name ("key") and a value. Popular systems are Redis, DynamoDB, Oracle NoSQL. Useful for shopping carts, user preferences, user profiles.

Answer 50

Data you already store that meets the requirement of a primary key: No Nulls, all values unique

Answer 51

multiple attribute columns that together provide the unique identifier for a row and make up a composite key

Answer 52

keys that have no real world meaning. Their entire purpose is to create a unique column in a data table, Much like library card numbers, credit card numbers, driver license numbers

Answer 53

Foreign key columns store the primary keys value of the rows that they're related to. Because they need to store the same value, their data type needs to be the same as the primary key's data type in the related table.

Answer 54

A view in SQL is a virtual table created based on a query. It does not store data but dynamically pulls it from the underlying tables.

Answer 55

Normalization is the process of removing redundancy from a database. 1st normal form: Ensures each column contains atomic (indivisible) values. Removes duplicate columns and ensures each row is unique. 2nd: Removes partial dependencies (no non-key column depends on part of a composite key) 3rd: removes transitive dependencies

Answer 56

Online Transaction Processing. Reduce data redundancy by usin more table relationships to increase database write speed. These databases store information quickly. For instance, an online storefront that's primarily concerned with allowing customers to put items into their shopping cart and quickly process their payment information

Answer 57

OLAP stands for Online Analytical Processing, and databases designed with this model are primarily concerned with retrieval of information. designed to support analysis workloads

Answer 58

Its value is directly determined by the primary key. It is not indirectly dependent on the primary key through another column.

Answer 59

Graph databases: Analyzing and querying complex relationships between entities.

Answer 60

Wide-column stores: Handling large-scale, high-throughput applications with semi-structured data.

Answer 61

Document databases. Best For: Managing semi-structured or hierarchical data, such as JSON or XML.

Answer 62

Key-value stores: Best For: Simple, fast lookups of key-value pairs.

Answer 63

TABLE NAME ( KEY_ATTRIBUTE 1 , ATTRIBUTE 2, ATTRIBUTE 3, … ATTRIBUTE K)

Answer 64

If entity is a STUDENT, then attributes could be STU_LNAME, STU_FNAME, STU_ETHNICITY

Answer 65

The entity is existence-dependent; it cannot exist without the entity with which it has a relationship. The entity has a primary key that is partially or totally derived from the parent entity in the relationship.

Answer 66

A recursive relationship in a database table is a relationship where a row in a table is related to another row in the same table — in other words, the table references itself.

Answer 67

Expresses the minimum and maximum number of entity occurrences associated with one occurrence of the related entity. Expressed as (x,y). Ex. A professor can take up to 4 classes (1,4).

Answer 68

an attribute that can be further subdivided to yield additional attributes. For example, the attribute ADDRESS can be subdivided into street, city, state, and zip code.

Answer 69

1. Create a new entity composed of the original multivalued attribute's components 2. Within the original entity, create several new attributes,one for each component of the original attribute

Answer 70

attributes that are computed, or derived from calculating two attributes together

Answer 71

Using active or passive verbs. And it shows operation in both directions A CUSTOMER generates many INVOICES Each INVOICE generated by one CUSTOMER

Answer 72

Used to describe the relationship classification such as 1:M, M:N, and 1:1

Answer 73

exists if the primary key of the related entity (parent) does not appear as a foreign key in the child.

Answer 74

A strong (identifying) relationship exists when the primary kiey of the related entity (a foreign key) is a a primary key of the parent entity.

Answer 75

Multivalued attributes are attributes that can have many values. For instance, a person may have several college degrees, and a household may have several different phones, each with its own number.

Answer 76

This associative entity, also called a composite or bridge entity, is in a 1:M relationship with the parent entities and is composed of the primary key attributes of each parent entity.

Answer 77

Create a detailed narrative of the organization’s description of operations. Identify the business rules based on the description of operations. Identify the main entities and relationships from the business rules. Develop the initial ERD. Identify the attributes and primary keys that adequately describe the entities. Revise and review the ERD.

Answer 78

Design standards. Processing speed Information requirements

Answer 79

A single-valued attribute is an attribute that can have only a single value. For example, a person can have only one Social Security number, and a manufactured part can have only one serial number.

Answer 80

A simple attribute is the opposite of a composite attribute in that it cannot be subdivided. For example, age, sex, and marital status would be classified as simple attributes.

Answer 81

Abstractions simplify complexity by modeling only what matters for the system. an object is an abstraction of an entity because it simplifies a complex real-world concept into a model that can be represented and manipulated programmatically

Answer 82

An entity set is a group of entities that belong to the same type and are represented by the same attributes. It is a fundamental concept in database design, particularly in the context of the Entity-Relationship (ER) model.

Answer 83

False. Relational databases do not support direct many-to-many relationships between tables and require a join table.

Answer 84

Contains two foreign keys that reference the primary keys of the two related tables. Represents individual instances of the many-to-many relationship.

Answer 85

a data modeling technique used to organize and structure data in a data warehouse. Made up of two types of data: facts and dimensions.

Answer 86

Facts are individual pieces of data or information that we want to store and analyze in our data warehouse. These facts can be numerical or quantitative values. numbers of products sold, total sales amount, number of customer complaints

Answer 87

A dimension in the data warehousing is a collection of categories or attributes that describes the facts in your data.

Answer 88

Snowflake Schema

Answer 89

Star Schema

Answer 90

On-premises warehouses

Answer 91

They have great scalability More cost effective (follow CapEx model) More accessible

Answer 92

a primary key is a logical constraint that uniquely identifies each row in a table, an index is a database structure that improves the speed and optimization of data retrieval for specific queries.

Answer 93

a tree data structure with a root and nodes. The root node is the index value that splits the range of values found in the index column. The top node of the subtree splits the value of the index column so that the values less than the node value are stored to the left branch of the tree, and values greater than the value in the node are stored to the right.

Answer 94

store a series of bits for indexed values. The number of bits used is the same as the number of distinct values in a column.

Answer 95

Hash functions take an arbitrary length data and map it to a fixed-size value. Hash values are designed so that different inputs will produce different outputs.

Answer 96

bloom filters are especially useful when we're querying arbitrary combinations or a large number of attributes. But it's really space efficient. It's a lossy representation.

Answer 97

This allows you to turn on either a nest, hash, or merge to save time and space. When space and time is important, you can try either one. And use EXPLAIN to calculate how much the join took up The below code turns on either join in PostgreSQL set enable_nestloop=true; set enable_hashjoin=false; Set enable_mergejoin=false;

Answer 98

A "schema" refers to the detailed structure of data within a database, defining how tables, fields, and relationships are organized, while a "model" is a high-level conceptual blueprint that outlines the overall design of a database system, including the entities, attributes, and relationships between them

Answer 99

It lets data engineers interact directly with raw data almost in real time with minimal processing and transportation time.

Answer 100

Extract, load, transform: data is extracted from a source server or servers and is then transported immediately to the target location and loaded. No transformation occurs between these two steps.

Answer 101

you may have a need to transform sensitive information so it's not sitting unmasked in a data lake or warehouse Limit data access through filtering Process data for migration to different servers

Answer 102

An extraction tooling created through PostgreSQL to create a new trigger on a specified function when certain actions or parameters are met. Ex. creating a trigger that runs every time a row of the table accounts is about to be updated. CREATE TRIGGER check_update BEFORE UPDATE ON accounts FOR EACH ROW EXECUTE FUNCTION check_account_update():

Answer 103

An extraction tool that only captures the data that has changed since the previous ETL operation. Debezium is just one example of an open source CDC platform

Answer 104

This process involves looking for errors, inconsistencies or other validation errors and stripping or restructuring data accordingly. Some examples of actions include deduplicating redundant or identical records, mapping values appropriately between the source and destination databases, performing data validation to ensure the records are of compatible data types and are aligned properly between the source and destination schemas or establishing key relationships across tables.

Answer 105

T: ETL you transform data on the source server before loading into the destination. This is slower than using in-warehouse data transformation (ELT).

Answer 106

Part of ETL/ELT, it evaluates the differences between the source and destination data sets to determine what data has changed since the last load. Any modified or added records are imported through streaming, batch, or small batch incremental loads.

Answer 107

the entire data set is loaded into the target data warehouse and information already existing is overwritten every time data is loaded.

Answer 108

F: Historical data is loss everytime you do a full load. Full loads should only be done in disasters or first time transfers.

Answer 109

When a column depends on only part of the primary key, not the whole thing Such as Class_Name being dependent on just class_ID, but there's also a Student_ID in the table

Answer 110

When a column depends on the primary key indirectly, Such as Teach_Name depending on Class_ID through Class_Name. Teacher_Name needs to be placed in its own table.

Answer 111

Data warehouses routinely use 2NF structures When you require higher processing speeds and less sspace

Answer 112

When an attribute can no longer be subdivided Such as EMP_NAME can be last_name and first_name

Answer 113

divided into five phases; planning, analysis, detailed systems design, implementation, and maintenance. The SDLC is an iterative process rather than a sequential process.

Answer 114

Contains six phases: database initial study, database design, implementation and loading, testing and evaluation, operation, and maintenance and evolution.

Answer 115

a conceptual data model that describes the main data entities, attributes, relationships, and constraints of a given problem domain.

Answer 116

design can be carried out and represented in a fairly simple database. typical of relatively simple, small databases and can be successfully done by a single database administrator or by a small, informal design team.

Answer 117

used when the system’s data component has a considerable number of entities and complex relations on which very complex operations are performed.

Answer 118

Identifying the data sets and then defines the data elements for each of those sets. This process involves the identification of different entity types and the definition of each entity’s attributes.

Answer 119

identifies the data elements (items) and then groups them together in data sets. In other words, it first defines attributes, and then groups them to form entities.

Answer 120

an enterprise-wide database that is based on a specific data model but independent of physical-level details. logical design for a relational DBMS includes the specifications for the relations (tables), relationships, and constraints (in other words, domain definitions, data validations, and security views).

Answer 121

the process of determining the data storage organization and data access characteristics of the database to ensure its integrity, security, and performance.

Answer 122

Quantifiable numeric or scale-based measurements that assess the company’s effectiveness or success in reaching its strategic and operational goals.

Answer 123

Operational data storage is optimized to support transactions that represent daily operations.

Answer 124

gives tactical and strategic business meaning to the operational data. Support data differs from operational data in three main areas: time span, granularity, and dimensionality.

Answer 125

small, single-subject data warehouse subset that provides decision support to a small group of people. a data mart could be created from data extracted from a larger data warehouse f

Answer 126

Database Schema, Extraction and Filtering, and database size

Answer 127

Used in Multidimensional online analytical processing (MOLAP). The location of each data value in the data cube is a function of the x-, y-, and z-axes in a three-dimensional space. The three axes represent the dimensions of the data value.

Answer 128

Numeric measurements (values) that represent a specific business aspect or activity. Facts commonly used in business data analysis are units, costs, prices, and revenues.

Answer 129

Qualifying characteristics that provide additional perspectives to a given fact. Dimensions provide descriptive characteristics about the facts through their attributes.

Answer 130

Drill-down involves going from a summary or higher-level view of the data to a more detailed or specific view.

Answer 131

Through a the dimension's attributes. a data warehouse DBMS that is optimized for decision support first searches the smaller dimension tables before accessing the larger fact tables.

Answer 132

Drill down analysis

Answer 133

the goal of database performance is to execute queries as fast as possible. Therefore, database performance must be closely monitored and regularly tuned. Database performance tuning refers to a set of activities and procedures designed to reduce the response time of the database system

Answer 134

statistics provide information about database size, number of records, average access time, number of requests serviced, and number of users with access rights. These statistics are then used to determine the best access strategy. Current-generation DBMSs are intelligent enough to determine the best type of index to use under certain circumstances

Answer 135

EXPLAIN ANALYZE; ANALYZE; VACCUM ANALYZE; It is different from Oracle to IBM, to PostgreSQL

Answer 136

Parsing. The DBMS parses the SQL query and chooses the most efficient access/execution plan. Execution. The DBMS executes the SQL query using the chosen execution plan. Fetching. The DBMS fetches the data and sends the result set back to the client.

Answer 137

Data sparsity refers to the number of different values a column could have. Table size. Small tables don't necessarily warrant indexes.

Answer 138

uses preset rules and points to determine the best approach to execute a query. The rules assign a “fixed cost” to each SQL operation; the costs are then added to yield the cost of the execution plan. F

Answer 139

uses sophisticated algorithms based on statistics about the objects being accessed to determine the best approach to execute a query. In this case, the optimizer process adds up the processing cost, the I/O costs, and the resource costs (RAM and temporary space) to determine the total cost of a given execution plan.

Answer 140

Create indexes for each single attribute used in a WHERE, HAVING, ORDER BY, or GROUP BY clause. Declare primary and foreign keys Declare indexes in join columns other than PK or FK

Answer 141

refers to how much a column’s values are repeated versus how many unique values it contains.

Answer 142

Use simple columns or literals as operands in a conditional expression Numeric field comparisons are faster than character, date, and NULL comparisons. Equality comparisons are generally faster than inequality When using multiple conditional expressions, write the equality conditions first. If you use multiple AND conditions, write the condition most likely to be false first. Try to avoid using NOT whenever possible

Answer 143

keeping the same number of systems but migrating each system to a larger system: for example, changing from a server with 16 CPU cores and a 1 TB storage system to a server with 64 CPU

Answer 144

means that when the workload exceeds the capacity of a server, the workload is spread out across a number of servers. This is also referred to as clustering—creating a cluster of low-cost servers to share a workload

Answer 145

requires analysis of the data stream as it enters the system. In some situations, large volumes of data can enter the system at such a rapid pace that it is not feasible to try to store all of the data. The data must be processed and filtered as it enters the system to determine which data to keep and which data to discard.

Answer 146

Capturing the data, processing it into usable information, and then acting on that information is a feedback loop. processing to provide immediate results requires analyzing large amounts of data within just a few seconds ..

Answer 147

also called viability, refers to the degree to which the data can be analyzed to provide meaningful information that can add value to the organization.

Answer 148

the ability to graphically present the data in such a way as to make it understandable.

Answer 149

One of the keys to data modeling is that only the data that is of interest to the users should be included in the data model. Data that is not of value should not be recorded in any data store

Answer 150

(atomicity, consistency, isolation, and durability): four essential properties that ensure data integrity and reliability in databases, especially in transactional systems.

Answer 151

A transaction is fully completed or fully rolled back—no partial updates.

Answer 152

The database moves from one valid state to another (no broken constraints).

Answer 153

Transactions are independent of each other—one transaction doesn’t affect another until it’s committed.

Answer 154

Once committed, data is permanently saved (even if there’s a system crash).

Answer 155

Key-value database Document databases Column-oriented databases Graph databases

Answer 156

interdependent queries about relationships that could take hours to run in a relational database are the forte of graph databases. Graph databases can complete these queries in seconds. In fact, you often encounter the phrase “minutes to milliseconds”

Answer 157

Key-value, document, and column family databases are aggregate aware. Aggregate aware means that the data is collected or aggregated around a central topic or entity.

Answer 158

Both. It is on the decline with faster more efficient ways of combating cloud-based solutions, slow performance, high storage cofts, and design for realtime processing

Answer 159

An INNER JOIN only returns records where there is a match between the primary key and foreign key. If a record is missing from one side, it won't appear in the result set.

Data Management Flashcards

(184 cards)