Managing and querying DB Flashcards

1
Q

Concurrency control schemes

A

Control the interactions among the concurrent transactions in order to prevent them from destroying the consistency of the database

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

ACID transactions

A
  • Atomic: Either the whole transaction is run, or nothing
  • Consistent: Database constrainst are preserved
  • Isolated: Different transactions may not interact with each other
  • Durable: Effects of a transactions are not lost in case of a system crash
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Transaction concept

A
  • A transaction is a unit of pgrogram execution that accesses and possibly updates various data items.
  • A transaction must see a consistent databse
  • During transaction execution the database may be inconsistent
  • When the transaction is committed, the database must be consistent
  • Two main issues to deal with
    • Failures of various kinds such as hardware failures and system crashes
    • Concurrent exection of multiple transaction
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Implicit schema

A
  • document can be stored as it is without having to define a schema for it
  • without checking that it conforms to a schema.
  • An explicit schema requires the schema to be defined and that documents are compliant before they can be stored successfull
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Charactaristics of NoSQL

A
  • Non-relational
  • Cluster-friendly
  • schema-less
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Impedance mismatch problem

A
  • The programming model of data doesn’t match the database model of
  • This led to the rise of object-oriented databases in the 1990s.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Aggregate-orientation

A
  • Nested hierarchical structure
  • Aggregate orientation fits naturally with clusters
  • Can store a whole aggreagte on a single node
  • For applications where we need to slice and dice data in diferent ways, RDBMS and graph databases are more appropriate
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Which of the NoSQL is most different ot the others ?

A
  • Graph database
  • Other are aggregate oriented (Column-family, Document databases, and key-value)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

polyglot persistence

A
  • Different kinds of data are best dealt with different data storage technologies.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Sharding

A
  • Taking one copy of the data and splitting it across many machines
  • Nothing is shared
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Technical debt

A

Implied cost of additional rework caused by choosing an easy (limited) solution now instead of using a better approach that would take longer.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

CAP theorem

A

it is impossible for a distributed data store to simultaneously provide more than two out of the following three guarantees:

  1. Consistency: Every read receives the most recent write or an error
  2. Availability: Every request receives a (non-error) response, without the guarantee that it contains the most recent write
  3. Partition tolerance: The system continues to operate despite an arbitrary number of messages being dropped (or delayed) by the network between nodes
  • Single-server RDBMS is a “CA” system
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Difference between relation and relationship

A

Relation refers to a table in a relational model based database while relationship refers to how two tables are connected

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Relational DBMS

A
  • Tabular structure
  • Foundation in set theory
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Database implementation topics

A
  • Database sytem architecutre
  • Block buffer
  • Transactions and concurrency
  • Recovery
  • Indexes
  • Query processing and optimisation
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Three-level schema

A
  1. External:
  2. Conceptual: Logical level
  3. Interal: Storage/physical level
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Data independence (Logical and physical)

A
  • Logical data independence
    • Conceptual schema must be able to evolve without having to change external application program
  • Physical data independence
    • Must be able to substitute a different physical sotrage schema for an existing one without having to change external application programs
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Data models

A
  • “No data model”
    • Flat files
  • “Classical” data models
    • Hierarchical
    • Network
    • Relational
  • Semantic data models
    • Entity-Relatoinship model
    • Functional dat model
    • SDM
  • Objet oriented
  • NoSQL
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Failure classifcation in DB

A
  • Transaction failure
    • Logical errors: transaction cannot complete due to some internal error condition
    • System errors: the DB system must terminate an active transaction due to an error condition (e.g deadlock)
  • System crash: a power failure or other hardware or software failure causes the system to crash
    • fail-stop assumption: non-volatile storage contens are assumed to not be corrupted by system crash
  • Disk failure: a head crash or similiar disk failre destroy all or part of disk storage
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q
A

D. input() and output()

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q
A

C. Main memory

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Locking

A
  • A lock is a mechanism to control concurrent access to a data item.
  • There is potential for deadlock; transactions(s) must be rolled back to release lock(s) and resolve the deadlock
  • Starvation is also possible e.g
    • A transaction may be waiting for an exclusive lock on an item, while a sequence of other transactions request and are granted shared locks on the same item.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

Database system architecture

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q
A

B. False

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

Recovery algorithms

A

Techniques to ensure database consistency and transaction atomicity and durability despite failures. Contains two parts

  1. Actions taken during normal transcation processing to ensure enough information eists to recover from failures
  2. Actions taken after a failure to recover the database contents to a state that ensure atomicity, consistency and durability.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

Storage structure

A
  • Volatile storage
    • Does not survive system crashes
    • examples: main memory, cache memory
  • Non-volatie storage
    • Survive system crashes
    • examples: disk, tape, flash meomry
  • Stable storage
    • a “mythical” form of storage that survives all failures
    • Approximated by maintining copies on distinct non-volative media; copies can be at remote sites (to protect against fire, food etc..)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

Log-based recovery

A
  • The log is a sequence of records. Log of each transaction is maintained in some stable storage so that if any failure occurs, then it can be recovered from there.
  • For a transaction T_i write following to vlog
    • Transaction start: <t_i></t_i>
    • before write: let V1 be value of X before write and V2 the value to be written to X <t_i></t_i>
    • Transaction finishes: When transaction T_i it last statment, write commit log <t_i></t_i>

We assume that log records are written directly to stable storage

There are two approaches to modify the database:

  1. Deferred database modification
  2. immediate database modifcation
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

Deferred database modification

A
  • Transactions operations do not immediately update the physical database
  • Only transaction log is update
  • Physical database is only updated after transcation reaches commit
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

Immediate database modification

A
  • approache to modify the database:
  • Database is immediately updated by the transaction operation during the execution of the transcation even before it reaches the commit point
  • Update log record must be written before database item is written.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

Remote backup systems

A
  • Detection of failure
      • Transfer of control
  • Time to recover
  • Hot-spare
33
Q

Entity-relationship diagram

A
  • Entity
    • A thing or object
    • Rectangle
  • Attributes
    • Entities have attributes
    • Ovals
  • Relationships
    • Connects two entities
    • Drawn as a diamond between entities
    • Binary relationships: Many to one, many to “exactly one”, one to one
  • Key
    • A set of attributes and relationships whose values will be unique for the class
    • Enable users to accesss specific objects
    • Make loading and exchanging data possible
34
Q

Schema

A
  • Framework for persistent data
  • When addint a new data item, it has to comply to the schema
  • Define constraints on attributes and relationships
35
Q

Weak entities

A
  • An entitiy that depends on another entity to be uniquely identified
  • Example:
    • Course DAT405 is given three times, so we need to know both the course code and
      the reading period to identify a particular instance of the course
    • We introduce the concept of a given course, i.e. a course given in a particular reading
      period. A given course is a weak entity, dependent on the entity course. A given
      course has a teacher.
36
Q

Describe table scan and index scan

A
  • A table-scan gets all blocks that contain tuples of the relation, one-by-one
  • And index-scan involved using an index to get blocks that contain tuples that satisfy the predicate
  • Table-scan and index-scan are physical query operators
  • In the picture below we see the lecture relation. A table scan would get all 20 blocks. Index would only grab the blocks contains the tuples satisfying the prediate
37
Q

Performing an index-scan is always faster than peforming a table-scan

A. True

B. False

A

B. False

One example would be a very small relation that fits into a single disk block. Retrieving the entire relation would require just a single disk block transfer, whereas using an index we might need to first transfer the disk block that contains the index into main memory before we transfer the disk block containing the data (2 disk block transfers in total).

38
Q

index

A

A datastruture (i.e hash map giving O(1) complexity to look up) that makes it efficient to find those tuples that have a fixed value for an attribute

39
Q

Join algorithms

A
  • Nested Loops Join The most straightforward algorithm is Nested Loops Join. For each row on the left-hand side, the right-hand side is scanned for a match based on the join condition. Ordinarily, rows on the right-hand side are accessed through an index to reduce the overall execution cost. This scenario is frequently referred to as an Index Nested Loops Join.
  • Merge join
  • Hash join: Partition tuples of ech relation into sets with same has hvalue on the join attributes
40
Q

sort scan

A
  • Sorting while scanning a table
  • If there exists a B-tree index on the sort attribute(s), scan the index to find tuples in the required order
  • If relation can fit into main memory, use table-scan or index-scan to get all tuples into main memory, then use a main-memory sorting algorithm
  • If relation is too large to fit into main memory, use a multiway merge sort algorithm
41
Q

Relational operators

A
  • Selection
    • Choose rows from a relation
    • State conditions which must satisfy
    • σ_{condition}(T) i.e σ_{seats >100}(Rooms)
  • Projection
    • Choose columns (attributes) from a relation
    • π_{name,seats}(Rooms)
  • Cartesian produc
    • Combine each row of the two relations
  • Join
    • Combine each row of the two relations if the conditions is true
42
Q
A
  • Left and right joing arguments play differnt roles; not symmetric
  • Left deep join trees
    • Fewver trees to be considered
    • Fit well with common join algorithms

Form 1: (Left joins)

( R ⋈ S ) ⋈ T

( S ⋈ R ) ⋈ T

( R ⋈ T ) ⋈ S

( T ⋈ R ) ⋈ S

( S ⋈ T ) ⋈ R

( T ⋈ S ) ⋈ R

Form 2: Right joins

R ⋈ ( S ⋈ T )

R ⋈ ( T ⋈ S )

S ⋈ ( R ⋈ T )

S ⋈ ( T ⋈ R )

T ⋈ ( R ⋈ S )

T ⋈ ( S ⋈ R )

43
Q

Query optimization

A
  • Determines the most efficient way of executing a query
  • Logical query plan generation
    • Which of the algebraically equivalent forms of a query lead to the most efficient algorithms
  • Physical query plan generation
    • For each operation (e..g each joint step) choose an algorithm that implement that operation
    • How should data be passed from operation to another, e.. in a pipelined fashion, in main-memory buffers, or via the disk?
  • Choices depend on
    • size of relations
    • availability of indexes,
    • layout of data on disk
    • approximate ferquency of different values for an attribute,
44
Q

Does switching inner and outer relation make an effect ?

A
  • In a logical query plan it doesnt (they are communitative)
  • In physical query plan the relation on the left and on the right play different roles
    • Scan through each tuple on the relation on the left
    • Find matching tuples on the right
45
Q
A
46
Q

Query costs

A
47
Q

What is the query cost ?

A
48
Q

Why don’t we use indexes on all attributes ?

A
49
Q
A
50
Q

Semantic data modeling

A
  • represent entities and the relationships between them directly
  • match closely our conceptual view of data.

Uses three “abstractions machanisms”

  • Classification: Entities which share common characteristics are gouped together into instances of a class
  • Aggregation: Regard a collection of values as properties of a single compound object or aggregate
  • Generalisation: If two or more classes have characterstics in common, then thes commonalities can be abstracted into a general class
51
Q

Resource description framework (RDF)

A

A graph data model that consists of triples

  • Subject: A reference to a resource (commonly a URI)
  • Predicate: Can represent relationships or attributes
  • Object: A reference to a resource or a literal
52
Q

RDF Schema (RDFS)

A

A data modelling vocabulary for RDF data

Can define class hierarchies

  • rdfs:class
  • rdfs:subclass

Can define the domain and range of a predicate

  • rdfs:domain
  • rdfs:range
53
Q
A
54
Q

DBpedia

A
  • Huge graph database public graph database with 9.5billion RDF triples
  • public resource extract structured content from the information created in various Wikimedia projects
  • Accessed with SPARQL query
55
Q

Storing RDF triples in Relational DB

A

• Table with three columns storing subject, predicate and object values
• Table with three columns of integers, and a separate table that maps RDF
terms to integers
• Property tables where many properties of similar subjects are combined
into n-ary tables
• Binary tables for each predicate
• Hexastore, where an index is created for every combination of subject,
predicate and object (spo, sop, pso, pos, osp, ops)

56
Q

SPARQL query from DBpedia - Show all universities and their respective countries

A
57
Q

OWL 2 Web Ontology Language

A
  • OWL 2, is an ontology language for the Semantic Web
  • Langue for expressing ontologies
  • Protege ontology editor can be used to create ontologie
58
Q

Ontology

A
  • A set of precise descriptive statements about some part of the world
  • “world”
    • The domain of interest
    • The subject matter of the ontology
  • A vocabulary is a set of terms with fixed meaning
  • A terminology provides a vocabulary and sates how terms are related
59
Q

Ontology vs. schema

A
  • An ontology describes all terms in the domain of interest
  • A database schema covers all concepts that are to be modelled
  • We wouldn’t model synoms like “person” and “human” in a database schema, bu we might want to escribe these terms and the relationship between them in an ontology.
60
Q

Basic notions in OWL 2

A
  • Axioms: the basic statements that an OWL ontology expresses
  • Entities: elements used to refer to real world objects
  • Expressions: combinations of entities to form exomplex descriptions from basic ones
61
Q

Key-value databases

A
  • Each key is a unique identifier
62
Q

Document database

A
  • NoSQL database
  • Aggregate oriented database
63
Q

Collumn family

A
  • NoSQL database
  • Aggregreate orientated database
  • The column-family is an aggregate
    • addressed by row key and columnfamily name
  • Each column-family is a combination of columns that “fit together”
  • Good for:
    • Event logging
    • Content management systems, blogging platforms
    • Counters (e.g. visitors to a page)
    • Expiring usage (e.g. revoke access or remove a banner after some seconds)
  • Bad for:
    • Transaction-heavy applications
    • Aggregate queries (aggregation must be done on client)
    • Prototype systems (column-family design may have to be changed)
64
Q

Universal resrouce identifier (URI)

and

Internationalized resource identifer (IRI)

A
  • A Universal Resource Identifier, in this WIKI, is defined to be an ASCII string used to identify things on the Semantic Web.
  • Like URI’s but allows a larger character set to be used in the string
65
Q

Graph databases

A
  • No SQL database
  • graph structures (nodes and edges) for semantic queries)
  • Examples
    • Neo4j
    • GraphDB
  • Good for:
    • Connected data, e.g. social networks
    • Routing and location-based services (e.g. recommend nearby restaurant)
    • Recommendation systems
  • Bad for:
    • Update-intensive applications
    • Problems requiring global graph operations
66
Q

Data complexity of Databases

A
67
Q

Labeled property graph model

A
  • Nodes
    • Nodes are connected to other nodes via relationships
    • Nodes can have one or more properties (i.e., attributes stored as key/value pairs)
    • Nodes have one or more labels that describes its role in the graph
  • Relationship
    • Relationships can have one or more properties (i.e., attributes stored as key/value pairs)
    • Relationships are directional
    • Nodes can have multiple, even recursive relationships
  • Labels
    • Labels are used to group nodes into sets
    • A node may have multiple labels
    • Labels are indexed to accelerate finding nodes in the graph
  • Properties in Neo4j
    • Properties are named values where the name (or key) is a string (in relationships or nodes)
    • Properties can be indexed and constrained
68
Q

Neo4j graph design guidelines

A
  • Nodes
    • Entities, things that can be grouped into labels
    • Node propertiers to represent entity attribute
  • Relationships
    • Relationship between entities
    • Relationship properties to express strength, weight, or quality of relationship, plus any relationship metadata,. i.e timestamp, version number
69
Q

AI and graph databases

A
  • Graph enhance AI by providing context
  • i.e AI explainability - can use semantic connections in knowledge graphs to help construct explanations
70
Q

What are axioms can be written in an ontology

A
  • Classes for entities
    • Subclasses
      • :woman rdfs:subClassof :Person
    • equivalent classes
    • disjoint classes
  • Domain and range restrictions for relationships
    • For relationships
    • :hasWife rdfs:domain :Man
  • Inverse relationships
    • :hasParent owl: inverseOf :hasChild
71
Q

Discussing ethical issues

A
  • Identify stakeholders
  • Identify benefits and possible harms for each stakeholder
  • Weight benefits against possible harm
72
Q

10 rules for responsible big data research

A
  1. Acknowledge that data are people and can do harm
  2. Recognize that privacy is more than a binary value
  3. Guard against the reidentification of your data
  4. Practice ethical data sharing
  5. Consider strengths and limitations of your data; big does not automatically mean better
  6. debate the tought, ethical choices
  7. Develop a code of conduct for your organization, research community, or industry
  8. Design your data system for auditability
  9. Engage with the broader consequences of data and analysis practices
  10. Know when to break these rules
73
Q

Transaction states

A
  • Active, the initial state; the transaction stays in this state while it is executing.
  • Partially committed, after the final statement has been executed.
  • Failed, after the discovery that normal execution can no longer proceed.
  • Aborted, after the transaction has been rolled back and the database
    restored to its state prior to the start of the transaction. Two options after it has been aborted:
    • restart the transaction – only if no internal logical error
    • kill the transaction
  • Committed, after successful completion.
74
Q

Factors that affect query cost

A

“2*n + 1” is a rough indication of query cost

If we assume that:

  • each disk block that contains part of the Lectures relation contains one tuple,
  • each request to find a matching row in the Courses relation requires one block to be
  • brought in from disk, and
  • the entire index fits into a single block which must be input() once, and
  • the query plan produced has a join with the Lectures relation on he left

Then the cost will be 2*n + 1, but …

For example, if we assume that:

  • each disk block that contains part of the Lectures relation contains several tuples, and
  • each request to find a matching row in the Courses relation requires one
    block to be brought in from disk, and
  • for each tuple in the Lectures relation retrieving the matching tuple from the Courses relation requires one block to be brought in from disk and/or
  • the entire index is stored across many disk blocks

Then the cost will be greater than 2*n + 1

Cold database vs. warm database

  • Running same query, some of the required disc blocks might already be in main memory (i.e the database is warm)
75
Q

federated SPARQL query

A
  • Query from two or more endpoint (wikidata and DBpedia)
  • Super slow depends on the service that supporta frederated query
76
Q

Difference between GraphDB and Neo4j

A
  • Both are graph databases with slighlty different data models
    • RDF tripples for graph DB
    • Labeled property graph
  • With cypher query langauge (for neo4j) it is easier to sketch an specific path pattern