Data Science at Scale Flashcards

1
Q

How have companies changed in the big data era and what has enabled this?

A

Become more social, customer-orientated and dynamic. They have done this by collecting data, learning from the data, and improving and adapting in response. This is because of cheaper storage and processing, faster networks, and free open-source tools.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What technological shift enabled widespread data analytics?

A

Cloud-based infrastructure (e.g., AWS, GCP) and Infrastructure as a Service solutions from internet giants like Google, Amazon, and Microsoft.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

How has big data transformed marketing?

A

Customer profiling, targeted ads, and personalised communication and recommendations.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What are the 3 V’s of big data?

A

Volume, Velocity and Variety

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are the four fundamental functionalities that Data-intensive applications are built from?

A

Database - Store data so it can be retrieved later.
Caching - Store the results of expensive operations to be used again soon.
Indexing - Allow users to efficiently search the data.
Batch Processing - Periodically run specific routines on large amounts of accumulated data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are the three important things that a Data-intensive application needs to be?

A

Reliable, Scalable, Maintainable

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What does it mean for a system to be reliable?

A

It performs the function the user expected. It can tolerate the user making mistakes. Its performance is good enough for the requires use case, under the expected load and volume. It prevents any unauthorised access.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What are faults and failures?

A

A fault is when a component (hardware/software) of the system works in an unexpected way, and a failure is when the entire system stops providing the service.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What are Hardware Faults and what measures can be taken to stop them?

A

Usually when a HDD, memory module or PSU stops working. In large data centres this is common. We can use hardware measures such as RAID for HDD’s, redundant PSU’s, and hot-swappable CPU’s.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What are Software Faults?

A

When the software stops working. These are harder to anticipate, and can be present on many nodes of a system causing widespread failures.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is Scalability?

A

A system’s ability to cope with increased load

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is Load?

A

A measure of the amount of use of a system, for example: requests per second, number of players, read/write ratio.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is performance?

A

How well the system is responding to the load, for example: response time, or time taken to process a dataset. The average and distribution are both important.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is Vertical Scaling?

A

Upping the specs of the current system, this does not scale linearly, and has limited fault tolerance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is Horizontal Scaling?

A

Increasing the amount of devices in the system, which scales better and has better fault tolerance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is Maintainability?

A

The overall cost to maintain a system operational and updated.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is Operability?

A

How easy it is for the operation team to keep it running. Includes good monitoring, automation, and predictable behaviour.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What is Simplicity?

A

How easy it is for new people working on the system to understand it, without reducing functionality, just accidental complexity.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What is Evolvability?

A

How easy it is to make changes and update the system, which is closely linked with simplicity.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What is a data model in the context of data storage?

A

A structure that maps real-world entities (e.g., objects in code) to how they are stored (e.g., tables, JSON).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What is an ORM (Object-Relational Mapping)?

A

A system that maps classes in code to a relational database schema.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

In the relational model, how are one-to-many relationships handled?

A

By using separate tables with foreign keys pointing to the parent table.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What structured field types can be used within relational DBs for complex fields?

A

XML or JSON fields within a table.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What is a document model in NoSQL?

A

A system where data is stored in documents (e.g., JSON, XML) representing semi-structured data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Why are document models considered more flexible than relational models?

A

They don’t require a rigid schema; each document can be different.

25
Q

What’s a key weakness of document models?

A

Difficulty handling many-to-many relationships efficiently.

26
Q

What are graph models used for?

A

Representing complex many-to-many relationships using nodes and edges.

27
Q

In graph models, what are nodes and edges?

A

Nodes are entities or objects; edges are the relationships between them.

28
Q

What is a NoSQL database?

A

A database that doesn’t use traditional relational models; prioritizes scalability and availability.

29
Q

What is the CAP Theorem?

A

In a distributed system, you can only choose two: Consistency, Availability, and Partition Tolerance.

30
Q

What does consistency mean in CAP?

A

Every read returns the most recent write or an error.

31
Q

What does availability mean in CAP?

A

Every request gets a response, even if it’s not the most up-to-date.

32
Q

What does partition tolerance mean in CAP?

A

The system continues functioning despite network partitions.

33
Q

What are the four types of NoSQL databases?

A

Document, Key-value, Wide-column, and Graph databases.

34
Q

What is a key-value store?

A

A NoSQL database that stores data as key-value pairs, like a dictionary.

35
Q

What is a wide-column store?

A

A database model where data is stored in rows and columns, but columns can vary between rows.

36
Q

How is data organized in wide-column stores?

A

By column families instead of rows.

37
Q

What is a property graph in a graph database?

A

A graph where nodes and edges can have associated properties.

38
Q

What is the main advantage of using a graph store over relational or document models?

A

Better handling of complex and interconnected data.

39
Q

What does schema-on-read mean in document databases?

A

The structure of data is defined when it’s read by the application, not enforced when written.

40
Q

Why is schema-on-read useful?

A

t’s good for heterogeneous data or when you can’t control the data structure, like tweets or logs.

41
Q

How does schema flexibility in document DBs compare to relational DBs?

A

Document DBs allow format changes without changing the schema; relational DBs often need downtime and schema updates.

42
Q

What is “locality” in document databases?

A

It means storing a whole document as a continuous string (like JSON), keeping related data together.

43
Q

What’s a drawback of locality in document databases?

A

Even if you need only part of the document, the DB loads the whole thing, which can be inefficient.

44
Q

What are the two main jobs of a database?

A

Store data and retrieve data.

45
Q

What is a transactional workload?

A

A write-heavy workload (e.g., bank transactions).

46
Q

What is an analytics workload?

A

A read-heavy workload (e.g., dashboards, reports).

47
Q

What is a storage engine?

A

The part of the database that handles how data is written to and read from disk.

48
Q

What is a log-structured database?

A

A database that appends all writes to a log file.

49
Q

Why are appends fast in a log file?

A

Because they avoid random disk access and just add to the end of the file.

50
Q

What’s the downside of using a plain log for reads?

A

You have to scan the whole file (O(n) time complexity).

51
Q

What is a hash index?

A

A map of keys to byte offsets in a log file.

52
Q

What is the benefit of a hash index?

A

Fast lookups for keys.

53
Q

What is a limitation of a hash index?

A

It doesn’t support range queries and must fit in memory.

54
Q

Why are log files split into segments?

A

To prevent a single log file from growing too large.

55
Q

What is compaction?

A

A process that merges segments and removes duplicates or deleted records.

56
Q

What does SSTable stand for?

A

Sorted String Table.

57
Q

What is the key property of SSTables?

A

They store keys in sorted order and only once per segment.

58
Q

Why are SSTables efficient for reads?

A

Because they support binary search and smaller indexes.

59
Q

Do SSTables still need an index?

A

Yes, but only for some keys (sparse index).

60
Q

How is data written to an SSTable?

A

Data is first written to a memtable (e.g., an AVL tree) in memory. When full, the memtable is flushed to disk as a new SSTable file.

61
Q

How does reading from an SSTable work?

A

First, check the memtable; if not found, search the newest SSTable, then older ones.