Managing and querying DB Flashcards

Question

*Recovery algorithms*

Answer 1

Techniques to ensure database consistency and transaction atomicity and durability despite failures. Contains two parts 1. Actions taken during normal transcation processing to ensure enough information eists to recover from failures 2. Actions taken after a failure to recover the database contents to a state that ensure atomicity, consistency and durability.

Answer 2

* Volatile storage * Does not survive system crashes * examples: main memory, cache memory * Non-volatie storage * Survive system crashes * examples: disk, tape, flash meomry * Stable storage * a "mythical" form of storage that survives all failures * Approximated by maintining copies on distinct non-volative media; copies can be at remote sites (to protect against fire, food etc..)

Answer 3

* The log is a sequence of records. Log of each transaction is maintained in some stable storage so that if any failure occurs, then it can be recovered from there. * For a transaction T\_i write following to vlog * **Transaction start:** * **before write**: let V1 be value of X before write and V2 the value to be written to X * **Transaction finishes:** When transaction T\_i it last statment, write commit log We assume that log records are written directly to stable storage There are two approaches to modify the database: 1. Deferred database modification 2. immediate database modifcation

Answer 4

* Transactions operations do not immediately update the physical database * Only transaction log is update * Physical database is only updated after transcation reaches commit

Answer 5

* approache to modify the database: * Database is immediately updated by the transaction operation during the execution of the transcation even before it reaches the commit point * Update log record must be written before database item is written.

Answer 6

* **Detection of failure** * * **Transfer of control** * **Time to recover** * **Hot-spare**

Answer 7

* Entity * A thing or object * Rectangle * Attributes * Entities have attributes * Ovals * Relationships * Connects two entities * Drawn as a diamond between entities * Binary relationships: Many to one, many to "exactly one", one to one * Key * A set of attributes and relationships whose values will be unique for the class * Enable users to accesss specific objects * Make loading and exchanging data possible

Answer 8

* Framework for persistent data * When addint a new data item, it has to comply to the schema * Define constraints on attributes and relationships

Answer 9

* An entitiy that depends on another entity to be uniquely identified * Example: * Course DAT405 is given three times, so we need to know both the course code and the reading period to identify a particular instance of the course * We introduce the concept of a given course, i.e. a course given in a particular reading period. A given course is a weak entity, dependent on the entity course. A given course has a teacher.

Answer 10

* A *table-scan* gets all blocks that contain tuples of the relation, one-by-one * And *index-scan* involved using an index to get blocks that contain tuples that satisfy the predicate * *Table-scan* and *index-scan* are physical query operators * In the picture below we see the *lecture* relation. A table scan would get all 20 blocks. Index would only grab the blocks contains the tuples satisfying the prediate

Answer 11

B. False One example would be a very small relation that fits into a single disk block. Retrieving the entire relation would require just a single disk block transfer, whereas using an index we might need to first transfer the disk block that contains the index into main memory before we transfer the disk block containing the data (2 disk block transfers in total).

Answer 12

A datastruture (i.e hash map giving O(1) complexity to look up) that makes it efficient to find those tuples that have a fixed value for an attribute

Answer 13

* **Nested Loops Join** The most straightforward algorithm is Nested Loops Join. For each row on the left-hand side, the right-hand side is scanned for a match based on the join condition. Ordinarily, rows on the right-hand side are accessed through an index to reduce the overall execution cost. This scenario is frequently referred to as an Index Nested Loops Join. * **Merge join** * **Hash join**: Partition tuples of ech relation into sets with same has hvalue on the join attributes

Answer 14

* Sorting while scanning a table * If there exists a B-tree index on the sort attribute(s), scan the index to find tuples in the required order * If relation can fit into main memory, use table-scan or index-scan to get all tuples into main memory, then use a main-memory sorting algorithm * If relation is too large to fit into main memory, use a *multiway merge sort* algorithm

Answer 15

* Selection * Choose rows from a relation * State conditions which must satisfy * σ\_{condition}(T) i.e σ\_{seats \>100}(Rooms) * Projection * Choose columns (attributes) from a relation * π\_{name,seats}(Rooms) * Cartesian produc * Combine each row of the two relations * Join * Combine each row of the two relations if the conditions is true

Answer 16

* Left and right joing arguments play differnt roles; not symmetric * Left deep join trees * Fewver trees to be considered * Fit well with common join algorithms Form 1: (Left joins) ( R ⋈ S ) ⋈ T ( S ⋈ R ) ⋈ T ( R ⋈ T ) ⋈ S ( T ⋈ R ) ⋈ S ( S ⋈ T ) ⋈ R ( T ⋈ S ) ⋈ R Form 2: Right joins R ⋈ ( S ⋈ T ) R ⋈ ( T ⋈ S ) S ⋈ ( R ⋈ T ) S ⋈ ( T ⋈ R ) T ⋈ ( R ⋈ S ) T ⋈ ( S ⋈ R )

Answer 17

* Determines the most efficient way of executing a query * **Logical query plan generation** * Which of the **algebraically equivalent forms** of a query lead to the most efficient algorithms * **Physical query plan generation** * For each **operation** (e..g each joint step) choose an algorithm that implement that operation * How should data be passed from operation to another, e.. in a pipelined fashion, in main-memory buffers, or via the disk? * Choices depend on * size of relations * availability of indexes, * layout of data on disk * approximate ferquency of different values for an attribute,

Answer 18

* In a logical query plan it doesnt (they are communitative) * In physical query plan the relation on the left and on the right play different roles * Scan through each tuple on the relation on the left * Find matching tuples on the right

Answer 19

* represent entities and the relationships between them **directly** * match closely our conceptual view of data. Uses three "abstractions machanisms" * **Classification:** Entities which share common characteristics are gouped together into instances of a class * **Aggregation:** Regard a collection of values as properties of a single compound object or aggregate * **Generalisation:** If two or more classes have characterstics in common, then thes commonalities can be abstracted into a general class

Answer 20

A graph data model that consists of triples * **Subject:** A reference to a resource (commonly a URI) * **Predicate:** Can represent relationships or attributes * **Object**: A reference to a resource or a literal

Answer 21

A data modelling vocabulary for RDF data Can define class hierarchies * rdfs:class * rdfs:subclass Can define the domain and range of a predicate * rdfs:domain * rdfs:range

Answer 22

* Huge graph database public graph database with 9.5billion RDF triples * public resource extract structured content from the information created in various Wikimedia projects * Accessed with SPARQL query

Answer 23

• Table with three columns storing subject, predicate and object values • Table with three columns of integers, and a separate table that maps RDF terms to integers • Property tables where many properties of similar subjects are combined into n-ary tables • Binary tables for each predicate • Hexastore, where an index is created for every combination of subject, predicate and object (spo, sop, pso, pos, osp, ops)

Answer 24

* OWL 2, is an ontology language for the Semantic Web * Langue for expressing ontologies * Protege ontology editor can be used to create ontologie

Answer 25

* A set of precise descriptive statements about some part of the world * "world" * The domain of interest * The subject matter of the ontology * A _vocabulary_ is a set of terms with fixed meaning * A _terminology_ provides a vocabulary and sates how terms are related

Answer 26

* An ontology describes all terms in the domain of interest * A database *schema* covers all concepts that are to be modelled * We **wouldn't** model synoms like "person" and "human" in a database schema, bu we might want to escribe these terms and the relationship between them in an ontology.

Answer 27

* **Axioms:** the basic statements that an OWL ontology expresses * **Entities:** elements used to refer to real world objects * **Expressions:** combinations of entities to form exomplex descriptions from basic ones

Answer 28

* Each key is a unique identifier

Answer 29

* NoSQL database * Aggregate oriented database

Answer 30

* NoSQL database * Aggregreate orientated database * The column-family is an aggregate * addressed by row key and columnfamily name * Each column-family is a combination of columns that “fit together” * Good for: * Event logging * Content management systems, blogging platforms * Counters (e.g. visitors to a page) * Expiring usage (e.g. revoke access or remove a banner after some seconds) * Bad for: * Transaction-heavy applications * Aggregate queries (aggregation must be done on client) * Prototype systems (column-family design may have to be changed)

Answer 31

* A Universal Resource Identifier, in this WIKI, is defined to be an ASCII string used to identify things on the Semantic Web. * Like URI's but allows a larger character set to be used in the string

Answer 32

* No SQL database * graph structures (nodes and edges) for semantic queries) * Examples * Neo4j * GraphDB * **Good for:** * Connected data, e.g. social networks * Routing and location-based services (e.g. recommend nearby restaurant) * Recommendation systems * **Bad for:** * Update-intensive applications * Problems requiring global graph operations

Answer 33

* Nodes * Nodes are connected to other nodes via **relationships** * Nodes can have one or more **properties** (i.e., attributes stored as key/value pairs) * Nodes have one or more labels that describes its role in the graph * Relationship * Relationships can have one or more **properties** (i.e., attributes stored as key/value pairs) * Relationships are directional * Nodes can have multiple, even recursive relationships * Labels * Labels are used to group nodes into sets * A node may have multiple labels * Labels are indexed to accelerate finding nodes in the graph * Properties in Neo4j * Properties are named values where the name (or key) is a string (in relationships or nodes) * Properties can be indexed and constrained

Answer 34

* Nodes * Entities, things that can be grouped into labels * Node propertiers to represent entity attribute * Relationships * Relationship between entities * Relationship properties to express strength, weight, or quality of relationship, plus any relationship metadata,. i.e timestamp, version number

Answer 35

* Graph enhance AI by providing **context** * i.e AI explainability - can use semantic connections in knowledge graphs to help construct explanations

Answer 36

* Classes for entities * Subclasses * :woman rdfs:subClassof :Person * equivalent classes * disjoint classes * Domain and range restrictions for **relationships** * For relationships * :hasWife rdfs:domain :Man * Inverse relationships * :hasParent owl: inverseOf :hasChild

Answer 37

* Identify stakeholders * Identify benefits and possible harms for each stakeholder * Weight benefits against possible harm

Answer 38

1. Acknowledge that data are people and can do harm 2. Recognize that privacy is more than a binary value 3. Guard against the reidentification of your data 4. Practice ethical data sharing 5. Consider strengths and limitations of your data; big does not automatically mean better 6. debate the tought, ethical choices 7. Develop a code of conduct for your organization, research community, or industry 8. Design your data system for auditability 9. Engage with the broader consequences of data and analysis practices 10. Know when to break these rules

Answer 39

* **Active**, the initial state; the transaction stays in this state while it is executing. * **Partially committed**, after the final statement has been executed. * **Failed,** after the discovery that normal execution can no longer proceed. * **Aborted**, after the transaction has been rolled back and the database restored to its state prior to the start of the transaction. Two options after it has been aborted: * restart the transaction – only if no internal logical error * kill the transaction * **Committed**, after successful completion.

Answer 40

“2\*n + 1” is a rough indication of query cost If we assume that: * each disk block that contains part of the Lectures relation contains one tuple, * each request to find a matching row in the Courses relation requires one block to be * brought in from disk, and * the entire index fits into a single block which must be input() once, and * the query plan produced has a join with the Lectures relation on he left Then the cost will be 2\*n + 1, but ... For example, if we assume that: * each disk block that contains part of the Lectures relation contains several tuples, and * each request to find a matching row in the Courses relation requires one block to be brought in from disk, and * for each tuple in the Lectures relation retrieving the matching tuple from the Courses relation requires one block to be brought in from disk and/or * the entire index is stored across many disk blocks Then the cost will be greater than 2\*n + 1 Cold database vs. warm database * Running same query, some of the required disc blocks might already be in main memory (i.e the database is warm)

Answer 41

* Query from two or more endpoint (wikidata and DBpedia) * Super slow depends on the service that supporta frederated query

Answer 42

* Both are graph databases with slighlty different data models * RDF tripples for graph DB * Labeled property graph * With cypher query langauge (for neo4j) it is easier to sketch an specific path pattern