Software Architecture Flashcards

Kleppman: Designing Data-Intensive Applications

1
Q

What are main driving forces behind using a NoSQL database?

A
  1. Better scalability than relational databases for cases with large datasets or high throughput requirements
  2. Open source revolution
  3. Query operations that relational models don’t do well
  4. Desire for a more expressive data model than a rigid schema imposed by relational databases
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is a key problem with the relational data model for applications that are commonly written today, and what is a way to mitigate this problem?

A

Most business applications are written in object-oriented languages, which require a translation layer between the inherit representation of an object and how it is represented in tables, rows and columns. ORM frameworks partially abstract away this mismatch between models.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are some of the inherent advantages to a JSON data model and when are these advantages realised?

A
  1. Nested structures are better-suited to database entries that are largely self-contained, like a profile or a blog post with associated comments in a one-to-many tree structure
  2. Closer relationship between the object model from OOP and the JSON model for representing the data
  3. Better locality of related data, as all related data is contained within the structure, rather than requiring joins’
  4. No schema, so the model is more flexible if the exact fields in an entry cannot be rigidly defined
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is the primary idea behind normalisation in databases?

A

Removing duplication of meaningful data in databases that will be shared across multiple records in the database by instead using an ID foreign key mapping to a table with standardised values.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are the main benefits of normalising a database?

A
  1. Updating information needs to be done in only one place and will be replicated consistently across all records, such as when a country name changes
  2. Ensures consistency for values that exist within a logical set, e.g. countries and cities
  3. Easier and more semantically significant search, e.g. determining nearby users by storing metadata about their profile location
  4. Easier localisation by having human-readable labels translated into multiple different languages, but all associated with the same primary key
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are the major drawback of a document-based model?

A
  1. If there is a many-to-one where many records reference one common record, meaning it would need to be stored as an ID in the document to avoid duplication and multiple queries may be required to retrieve the related data
  2. Even though a model may not have originally required many-to-one or many-to-many relationships or joins, it may evolve over time into a more interconnected structure that does and document databases have limited support for joins
  3. Denormalising data or replicating joins in code can lead to worse maintainability, reliability and performance for the data model
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What are the main advantages of a relational database model over a document database model?

A
  1. Better support for joins
  2. Better support for many-to-one relationships
  3. Better support for many-to-many relationships
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Why would one denormalise a database?

A

If using a non-relational data model (such as a document database), support for joins is typically more limited, so denormalising data (by duplicating common data across records), reduces the need for replicating joins in application code.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is the distinction between schema-on-read and schema-on-write models in a database?

A

For a schema-on-read model, no explicit schema is enforced on data added to the database (as in a JSON document model), and the structure of a record is only known once it is read and interpreted by application code. By contrast, schema-on-write databases (like conventional relational databases), enforce a structure on each record as it’s inserted in the database. This guarantees that any record read from the database will match certain constraints.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Under which circumstances does the difference between a schema-on-read and a schema-on-write database become most apparent and what is the distinction on a high level?

A
  1. The distinction arises when updating the structure of records in that model, such as fields in a document or columns in a particular table
  2. In a schema-on-read model, application code handles cases where certain records have the old structure and new records, after the change, have a different structure
  3. On a schema-on-write model, a migration must be done on existing records so that they conform with the updated schema, but no special casing in application code is subsequently required
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Under which situations would one prefer a data model without an explicit schema, as opposed to an explicit schema-on-write model?

A
  1. When there are many different types of objects (in terms of the names and nature of their fields) or they change so frequently that it would be impractical to maintain them as separate tables and perform regular migrations
  2. When the data being stored in the database comes from an external source and there are no guarantees on the structure of data from that source
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

When is the data locality provided by a document database model disadvantageous?

A
  1. As the document is stored in a serialised format, the whole document needs to be read even though only one field, or a relatively subset of the fields, are needed
  2. When updating a field in a record, the whole record needs to be rewritten. This especially becomes a problem if the record increases in size as a result of writes and limits how writes can be done in-place
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is a good compromise between the choices of using a relational database and a document database in an application that doesn’t perfectly fit either model?

A
  1. One option would be to use support in modern relational databases for document column types, enabling a relational structure with document querying aspects
  2. The logical alternative is to use a document database with built-in support for joins or does joins on document references implicitly to simplify application code
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What are the major differences between a declarative query language like SQL and an imperative query model?

A
  1. Declarative languages are more limited in expressive power, but are more concise than imperative queries to write
  2. Declarative languages make no inherent assumptions about the ordering of data records, while there is no guarantee that an imperative query does not assume records are ordered
  3. Declarative languages make no assumptions about how a query is processed by the database, enabling the database to optimise queries more transparently
  4. Imperative code is harder to parallelise, whereas databases can freely use distributed or parallel implementations of a query language
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is MapReduce and when would it be useful?

A
  1. MapReduce is a programming model for read-only queries over a large set of documents distributed over many machines
  2. It is useful in scenarios where data needs to be aggregated en masse
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is one implementation of MapReduce in a document-based model and what are its key components?

A
  1. MongoDB implements MapReduce
  2. The key components are two functions
    - Map: Processing function performed on each record
    - Reduce: Aggregates all records processed by map
17
Q

What is one major drawback of MongoDB’s MapReduce implementation and what did MongoDB introduce to mitigate it?

A
  1. MapReduce involves authoring two carefully-coordinated functions to correctly aggregate data
  2. The aggregation pipeline mitigates this issue by providing a declarative JSON syntax for querying records
18
Q

When would it be advantageous to use a graph-like data model and what is the primary advantage?

A
  1. Graph-like data models are useful when the objects in your data model are all highly-connected and have a high degree many-to-many relationships
  2. By employing a standard graph structure with vertices and edges, standard graph algorithms can be used when analysing data in the network
19
Q

What are the key components of a property graph model?

A
  1. A table representing the vertices in the graph, which has a unique identifier and a set of key-value pairs (properties)
  2. A table representing the edges in the graph, has a unique identifier, properties, and most importantly foreign keys pointing to the head and tail node for the edge
20
Q

What are some important aspects of the property graph model and what advantages does each provide?

A
  1. No schema defines which nodes can be connected to other nodes, enabling easy evolution of the model
  2. By using indexes on the head and tail node foreign keys, graph traversal is efficient
  3. Edges can define arbitrary relationship types through their properties, enabling the expression of a mix of model types in a graph
21
Q

What is a real example of a graph database and a query language for that graph database?

A
  1. Neo4j is one graph database

2. Cypher is an example of a query language for that database

22
Q

What is a key feature of the Cypher query language that allows it to efficiently query items in a graph database?

A

Cypher’s expressive graph traversal queries allows defining queries through relationships succinctly, e.g.

MATCH
(person) -[:BORN_IN]-> () -[:Within*0]-> (us:Location {name:’Australia’})
RETURN person.name

This allows the query to traverse any number of edges (:Within*0) that ultimately point back to the Australia vertex, efficiently searching multiple paths.

23
Q

What is one method that SQL implements graph database query structures and what is a major disadvantage of it?

A
  1. SQL has recursive common table expressions, which can be used to represent an arbitrary number of joins in a query
  2. Its main disadvantage is that is rather clumsy to use in comparison with a native graph database query structure like Cypher
24
Q

What is a triple-store model?

A

A triple-store model is similar to a property graph model, and each graph entry is stored as the following triple: (subject, predicate, object), where

  1. The subject is a vertex in the graph
  2. The predicate is a graph edge if the object is also a vertex, otherwise it defines a property
  3. The object, which is either a vertex (if describing a relationship) or a literal if describing a property of a vertex
25
Q

What is SPARQL and when would one use it?

A

SPARQL is an efficient query language that was written to deal with triple-store databases, and is an efficient and concise choice when working with graph databases that use this model, e.g. when working with the Resource Description Framework (RDF).