Software Architecture Flashcards
Kleppman: Designing Data-Intensive Applications
What are main driving forces behind using a NoSQL database?
- Better scalability than relational databases for cases with large datasets or high throughput requirements
- Open source revolution
- Query operations that relational models don’t do well
- Desire for a more expressive data model than a rigid schema imposed by relational databases
What is a key problem with the relational data model for applications that are commonly written today, and what is a way to mitigate this problem?
Most business applications are written in object-oriented languages, which require a translation layer between the inherit representation of an object and how it is represented in tables, rows and columns. ORM frameworks partially abstract away this mismatch between models.
What are some of the inherent advantages to a JSON data model and when are these advantages realised?
- Nested structures are better-suited to database entries that are largely self-contained, like a profile or a blog post with associated comments in a one-to-many tree structure
- Closer relationship between the object model from OOP and the JSON model for representing the data
- Better locality of related data, as all related data is contained within the structure, rather than requiring joins’
- No schema, so the model is more flexible if the exact fields in an entry cannot be rigidly defined
What is the primary idea behind normalisation in databases?
Removing duplication of meaningful data in databases that will be shared across multiple records in the database by instead using an ID foreign key mapping to a table with standardised values.
What are the main benefits of normalising a database?
- Updating information needs to be done in only one place and will be replicated consistently across all records, such as when a country name changes
- Ensures consistency for values that exist within a logical set, e.g. countries and cities
- Easier and more semantically significant search, e.g. determining nearby users by storing metadata about their profile location
- Easier localisation by having human-readable labels translated into multiple different languages, but all associated with the same primary key
What are the major drawback of a document-based model?
- If there is a many-to-one where many records reference one common record, meaning it would need to be stored as an ID in the document to avoid duplication and multiple queries may be required to retrieve the related data
- Even though a model may not have originally required many-to-one or many-to-many relationships or joins, it may evolve over time into a more interconnected structure that does and document databases have limited support for joins
- Denormalising data or replicating joins in code can lead to worse maintainability, reliability and performance for the data model
What are the main advantages of a relational database model over a document database model?
- Better support for joins
- Better support for many-to-one relationships
- Better support for many-to-many relationships
Why would one denormalise a database?
If using a non-relational data model (such as a document database), support for joins is typically more limited, so denormalising data (by duplicating common data across records), reduces the need for replicating joins in application code.
What is the distinction between schema-on-read and schema-on-write models in a database?
For a schema-on-read model, no explicit schema is enforced on data added to the database (as in a JSON document model), and the structure of a record is only known once it is read and interpreted by application code. By contrast, schema-on-write databases (like conventional relational databases), enforce a structure on each record as it’s inserted in the database. This guarantees that any record read from the database will match certain constraints.
Under which circumstances does the difference between a schema-on-read and a schema-on-write database become most apparent and what is the distinction on a high level?
- The distinction arises when updating the structure of records in that model, such as fields in a document or columns in a particular table
- In a schema-on-read model, application code handles cases where certain records have the old structure and new records, after the change, have a different structure
- On a schema-on-write model, a migration must be done on existing records so that they conform with the updated schema, but no special casing in application code is subsequently required
Under which situations would one prefer a data model without an explicit schema, as opposed to an explicit schema-on-write model?
- When there are many different types of objects (in terms of the names and nature of their fields) or they change so frequently that it would be impractical to maintain them as separate tables and perform regular migrations
- When the data being stored in the database comes from an external source and there are no guarantees on the structure of data from that source
When is the data locality provided by a document database model disadvantageous?
- As the document is stored in a serialised format, the whole document needs to be read even though only one field, or a relatively subset of the fields, are needed
- When updating a field in a record, the whole record needs to be rewritten. This especially becomes a problem if the record increases in size as a result of writes and limits how writes can be done in-place
What is a good compromise between the choices of using a relational database and a document database in an application that doesn’t perfectly fit either model?
- One option would be to use support in modern relational databases for document column types, enabling a relational structure with document querying aspects
- The logical alternative is to use a document database with built-in support for joins or does joins on document references implicitly to simplify application code
What are the major differences between a declarative query language like SQL and an imperative query model?
- Declarative languages are more limited in expressive power, but are more concise than imperative queries to write
- Declarative languages make no inherent assumptions about the ordering of data records, while there is no guarantee that an imperative query does not assume records are ordered
- Declarative languages make no assumptions about how a query is processed by the database, enabling the database to optimise queries more transparently
- Imperative code is harder to parallelise, whereas databases can freely use distributed or parallel implementations of a query language
What is MapReduce and when would it be useful?
- MapReduce is a programming model for read-only queries over a large set of documents distributed over many machines
- It is useful in scenarios where data needs to be aggregated en masse