Chapter 7&8 Knowledge Testers Flashcards

1
Q

What is the difference between syntax and data models?

A

Syntax - physical
Data Models - logical

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

How can trees be used to model denormalized data?

A

In JSON and XML, they can be represented logically as trees, allowing for nesting that tables cannot accommodate.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What syntaxes relate to data models?

A

CSV to tables, XML/JSON to trees

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Why do trees modeling XML have labels on the nodes, while trees modeling JSON have labels on edges?

A

JSON info items do not know with which key they are associated, while XML elements know their names.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Name a few data models for XML.

A

(Infoset, PSVI, JDM)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Can you sketch a tree representing an XML or JSON document?

A

For XML, it includes a document node, elements, attributes, text; for JSON, keys are on edges.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is the difference between an atomic type and a structured type?

A

Atomic is string, number, boolean; structured is a collection of elements like a row in a table or a JSON object.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is the lexical space of an atomic type?

A

The representation of the number.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is the value space of an atomic type?

A

The value or meaning of the atomic type.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Give examples of type cardinalities and their associated symbols.

A
  • exactly once: implicit optional
  • 0 or 1: ?
  • any amount: 0+
  • one or more: +
  • 1+: +
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is the difference between well-formedness and validity?

A

Well-formed means it can be compiled; validity means it can be validated against a schema.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Name further data modeling technologies and formats for tree-like data.

A

Parquet, Avro, protocol buffers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Can JSON data be represented as a DataFrame?

A

Yes, if the schema has no open object types and no heterogeneity in field types.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Are DataFrames more efficient than JSON?

A

True, time-wise and space-wise.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Explain the map and shuffle patterns in large-scale data processing.

A
  • Map: splitting pokemons among people to count each type
  • Shuffle: cutting up the dictionary and assigning it to designated people
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Describe the physical architecture of MapReduce.

A

Centralized architecture with JobTracker and TaskTrackers.

17
Q

What is the difference between a map function and a map task?

A
  • Function: performs mapping
  • Task: an assignment for every input split
18
Q

What is a map slot in MapReduce?

A

Resources to compute the mapping, one slot is one CPU core with allocated memory.

19
Q

What is a reduce function in MapReduce?

A

A function that takes key-value pairs and returns intermediate key-value pairs.

20
Q

What is a combine function in MapReduce?

A

A function that takes one or more intermediate key-value pairs and returns 0, 1, or more intermediate key-value pairs.

21
Q

How does combining improve MapReduce’s performance?

A

It stores intermediate values, making reduce quicker.

22
Q

What assumptions are behind reusing the reduce function as a combine function?

A

The reduce function is commutative and associative.

23
Q

How can a combine function be designed to speed up a MapReduce job?

A

By computing an average and keeping track of weights in the output.

24
Q

Why does MapReduce perform well on a distributed file system?

A

It brings the query to the data, reducing the need to transfer data.

25
Q

What bottleneck suggests the use of MapReduce?

A

If the bottleneck is the speed of reading and writing data from the disk.

26
Q

How do MapReduce splits differ from HDFS blocks?

A

MapReduce uses shards that are larger than HDFS blocks, allowing for parallelism.

27
Q

What does the Java API of MapReduce look like at a high level?

A

User defines a mapper class and a reducer class for the functions.