Chapter 7&8 Knowledge Testers Flashcards by Mabel Wylie

What is the difference between syntax and data models?

Syntax - physical
Data Models - logical

How well did you know this?

Not at all

Perfectly

How can trees be used to model denormalized data?

In JSON and XML, they can be represented logically as trees, allowing for nesting that tables cannot accommodate.

How well did you know this?

Not at all

Perfectly

What syntaxes relate to data models?

CSV to tables, XML/JSON to trees

How well did you know this?

Not at all

Perfectly

Why do trees modeling XML have labels on the nodes, while trees modeling JSON have labels on edges?

JSON info items do not know with which key they are associated, while XML elements know their names.

How well did you know this?

Not at all

Perfectly

Name a few data models for XML.

(Infoset, PSVI, JDM)

How well did you know this?

Not at all

Perfectly

Can you sketch a tree representing an XML or JSON document?

For XML, it includes a document node, elements, attributes, text; for JSON, keys are on edges.

How well did you know this?

Not at all

Perfectly

What is the difference between an atomic type and a structured type?

Atomic is string, number, boolean; structured is a collection of elements like a row in a table or a JSON object.

How well did you know this?

Not at all

Perfectly

What is the lexical space of an atomic type?

The representation of the number.

How well did you know this?

Not at all

Perfectly

What is the value space of an atomic type?

The value or meaning of the atomic type.

How well did you know this?

Not at all

Perfectly

Give examples of type cardinalities and their associated symbols.

exactly once: implicit optional
0 or 1: ?
any amount: 0+
one or more: +
1+: +

How well did you know this?

Not at all

Perfectly

What is the difference between well-formedness and validity?

Well-formed means it can be compiled; validity means it can be validated against a schema.

How well did you know this?

Not at all

Perfectly

Name further data modeling technologies and formats for tree-like data.

Parquet, Avro, protocol buffers

How well did you know this?

Not at all

Perfectly

Can JSON data be represented as a DataFrame?

Yes, if the schema has no open object types and no heterogeneity in field types.

How well did you know this?

Not at all

Perfectly

Are DataFrames more efficient than JSON?

True, time-wise and space-wise.

How well did you know this?

Not at all

Perfectly

Explain the map and shuffle patterns in large-scale data processing.

Map: splitting pokemons among people to count each type
Shuffle: cutting up the dictionary and assigning it to designated people

How well did you know this?

Not at all

Perfectly

Describe the physical architecture of MapReduce.

Study These Flashcards

Centralized architecture with JobTracker and TaskTrackers.

What is the difference between a map function and a map task?

Study These Flashcards

Function: performs mapping
Task: an assignment for every input split

What is a map slot in MapReduce?

Study These Flashcards

Resources to compute the mapping, one slot is one CPU core with allocated memory.

What is a reduce function in MapReduce?

Study These Flashcards

A function that takes key-value pairs and returns intermediate key-value pairs.

What is a combine function in MapReduce?

Study These Flashcards

A function that takes one or more intermediate key-value pairs and returns 0, 1, or more intermediate key-value pairs.

How does combining improve MapReduce’s performance?

Study These Flashcards

It stores intermediate values, making reduce quicker.

What assumptions are behind reusing the reduce function as a combine function?

Study These Flashcards

The reduce function is commutative and associative.

How can a combine function be designed to speed up a MapReduce job?

Study These Flashcards

By computing an average and keeping track of weights in the output.

Why does MapReduce perform well on a distributed file system?

Study These Flashcards

It brings the query to the data, reducing the need to transfer data.

What bottleneck suggests the use of MapReduce?

If the bottleneck is the speed of reading and writing data from the disk.

How do MapReduce splits differ from HDFS blocks?

MapReduce uses shards that are larger than HDFS blocks, allowing for parallelism.

What does the Java API of MapReduce look like at a high level?

User defines a mapper class and a reducer class for the functions.

Chapter 7&8 Knowledge Testers Flashcards

(27 cards)