Data Science at Scale Flashcards

Question 1

Q

How have companies changed in the big data era and what has enabled this?

Answer

A

Become more social, customer-orientated and dynamic. They have done this by collecting data, learning from the data, and improving and adapting in response. This is because of cheaper storage and processing, faster networks, and free open-source tools.

Question 2

Q

What technological shift enabled widespread data analytics?

Answer

A

Cloud-based infrastructure (e.g., AWS, GCP) and Infrastructure as a Service solutions from internet giants like Google, Amazon, and Microsoft.

Question 3

Q

How has big data transformed marketing?

Answer

A

Customer profiling, targeted ads, and personalised communication and recommendations.

Question 4

Q

What are the 3 V’s of big data?

Answer

A

Volume, Velocity and Variety

Question 5

Q

What are the four fundamental functionalities that Data-intensive applications are built from?

Answer

A

Database - Store data so it can be retrieved later.
Caching - Store the results of expensive operations to be used again soon.
Indexing - Allow users to efficiently search the data.
Batch Processing - Periodically run specific routines on large amounts of accumulated data.

Question 6

Q

What are the three important things that a Data-intensive application needs to be?

Answer

A

Reliable, Scalable, Maintainable

Question 7

Q

What does it mean for a system to be reliable?

Answer

A

It performs the function the user expected. It can tolerate the user making mistakes. Its performance is good enough for the requires use case, under the expected load and volume. It prevents any unauthorised access.

Question 8

Q

What are faults and failures?

Answer

A

A fault is when a component (hardware/software) of the system works in an unexpected way, and a failure is when the entire system stops providing the service.

Question 9

Q

What are Hardware Faults and what measures can be taken to stop them?

Answer

A

Usually when a HDD, memory module or PSU stops working. In large data centres this is common. We can use hardware measures such as RAID for HDD’s, redundant PSU’s, and hot-swappable CPU’s.

Question 10

Q

What are Software Faults?

Answer

A

When the software stops working. These are harder to anticipate, and can be present on many nodes of a system causing widespread failures.

Question 11

Q

What is Scalability?

Answer

A

A system’s ability to cope with increased load

Question 12

Q

What is Load?

Answer

A

A measure of the amount of use of a system, for example: requests per second, number of players, read/write ratio.

Question 13

Q

What is performance?

Answer

A

How well the system is responding to the load, for example: response time, or time taken to process a dataset. The average and distribution are both important.

Question 14

Q

What is Vertical Scaling?

Answer

A

Upping the specs of the current system, this does not scale linearly, and has limited fault tolerance.

Question 15

Q

What is Horizontal Scaling?

Answer

A

Increasing the amount of devices in the system, which scales better and has better fault tolerance.

Question 16

Q

What is Maintainability?

Answer

A

The overall cost to maintain a system operational and updated.

Question 17

Q

What is Operability?

Answer

A

How easy it is for the operation team to keep it running. Includes good monitoring, automation, and predictable behaviour.

Question 18

Q

What is Simplicity?

Answer

A

How easy it is for new people working on the system to understand it, without reducing functionality, just accidental complexity.

Question 19

Q

What is Evolvability?

Answer

A

How easy it is to make changes and update the system, which is closely linked with simplicity.

Question 20

Q

What is a data model in the context of data storage?

Answer

A

A structure that maps real-world entities (e.g., objects in code) to how they are stored (e.g., tables, JSON).

Question 21

Q

What is an ORM (Object-Relational Mapping)?

Answer

A

A system that maps classes in code to a relational database schema.

Question 22

Q

In the relational model, how are one-to-many relationships handled?

Answer

A

By using separate tables with foreign keys pointing to the parent table.

Question 23

Q

What structured field types can be used within relational DBs for complex fields?

Answer

A

XML or JSON fields within a table.

Question 24

Q

What is a document model in NoSQL?

Answer

A

A system where data is stored in documents (e.g., JSON, XML) representing semi-structured data.

Question 25

Q

Why are document models considered more flexible than relational models?

Answer

A

They don’t require a rigid schema; each document can be different.

Question 26

Q

What’s a key weakness of document models?

Answer

A

Difficulty handling many-to-many relationships efficiently.

Question 27

Q

What are graph models used for?

Answer

A

Representing complex many-to-many relationships using nodes and edges.

Question 28

Q

In graph models, what are nodes and edges?

Answer

A

Nodes are entities or objects; edges are the relationships between them.

Question 29

Q

What is a NoSQL database?

Answer

A

A database that doesn’t use traditional relational models; prioritizes scalability and availability.

Question 30

Q

What is the CAP Theorem?

Answer

A

In a distributed system, you can only choose two: Consistency, Availability, and Partition Tolerance.

Question 31

Q

What does consistency mean in CAP?

Answer

A

Every read returns the most recent write or an error.

Question 32

Q

What does availability mean in CAP?

Answer

A

Every request gets a response, even if it’s not the most up-to-date.

Question 33

Q

What does partition tolerance mean in CAP?

Answer

A

The system continues functioning despite network partitions.

Question 34

Q

What are the four types of NoSQL databases?

Answer

A

Document, Key-value, Wide-column, and Graph databases.

Question 35

Q

What is a key-value store?

Answer

A

A NoSQL database that stores data as key-value pairs, like a dictionary.

Question 36

Q

What is a wide-column store?

Answer

A

A database model where data is stored in rows and columns, but columns can vary between rows.

Question 37

Q

How is data organized in wide-column stores?

Answer

A

By column families instead of rows.

Question 38

Q

What is a property graph in a graph database?

Answer

A

A graph where nodes and edges can have associated properties.

Question 39

Q

What is the main advantage of using a graph store over relational or document models?

Answer

A

Better handling of complex and interconnected data.

Question 40

Q

What does schema-on-read mean in document databases?

Answer

A

The structure of data is defined when it’s read by the application, not enforced when written.

Question 41

Q

Why is schema-on-read useful?

Answer

A

t’s good for heterogeneous data or when you can’t control the data structure, like tweets or logs.

Question 42

Q

How does schema flexibility in document DBs compare to relational DBs?

Answer

A

Document DBs allow format changes without changing the schema; relational DBs often need downtime and schema updates.

Question 43

Q

What is “locality” in document databases?

Answer

A

It means storing a whole document as a continuous string (like JSON), keeping related data together.

Question 44

Q

What’s a drawback of locality in document databases?

Answer

A

Even if you need only part of the document, the DB loads the whole thing, which can be inefficient.

Question 45

Q

What are the two main jobs of a database?

Answer

A

Store data and retrieve data.

Question 46

Q

What is a transactional workload?

Answer

A

A write-heavy workload (e.g., bank transactions).

Question 47

Q

What is an analytics workload?

Answer

A

A read-heavy workload (e.g., dashboards, reports).

Question 48

Q

What is a storage engine?

Answer

A

The part of the database that handles how data is written to and read from disk.

Question 49

Q

What is a log-structured database?

Answer

A

A database that appends all writes to a log file.

Question 50

Q

Why are appends fast in a log file?

Answer

A

Because they avoid random disk access and just add to the end of the file.

Question 51

Q

What’s the downside of using a plain log for reads?

Answer

A

You have to scan the whole file (O(n) time complexity).

Question 52

Q

What is a hash index?

Answer

A

A map of keys to byte offsets in a log file.

Question 53

Q

What is the benefit of a hash index?

Answer

A

Fast lookups for keys.

Question 54

Q

What is a limitation of a hash index?

Answer

A

It doesn’t support range queries and must fit in memory.

Question 55

Q

Why are log files split into segments?

Answer

A

To prevent a single log file from growing too large.

Question 56

Q

What is compaction?

Answer

A

A process that merges segments and removes duplicates or deleted records.

Question 57

Q

What does SSTable stand for?

Answer

A

Sorted String Table.

Question 58

Q

What is the key property of SSTables?

Answer

A

They store keys in sorted order and only once per segment.

Question 59

Q

Why are SSTables efficient for reads?

Answer

A

Because they support binary search and smaller indexes.

Question 60

Q

Do SSTables still need an index?

Answer

A

Yes, but only for some keys (sparse index).

Question 61

Q

How is data written to an SSTable?

Answer

A

Data is first written to a memtable (e.g., an AVL tree) in memory. When full, the memtable is flushed to disk as a new SSTable file.

Question 62

Q

How does reading from an SSTable work?

Answer

A

First, check the memtable; if not found, search the newest SSTable, then older ones.