Chapter 7&8 Knowledge Testers Flashcards
What is the difference between syntax and data models?
Syntax - physical
Data Models - logical
How can trees be used to model denormalized data?
In JSON and XML, they can be represented logically as trees, allowing for nesting that tables cannot accommodate.
What syntaxes relate to data models?
CSV to tables, XML/JSON to trees
Why do trees modeling XML have labels on the nodes, while trees modeling JSON have labels on edges?
JSON info items do not know with which key they are associated, while XML elements know their names.
Name a few data models for XML.
(Infoset, PSVI, JDM)
Can you sketch a tree representing an XML or JSON document?
For XML, it includes a document node, elements, attributes, text; for JSON, keys are on edges.
What is the difference between an atomic type and a structured type?
Atomic is string, number, boolean; structured is a collection of elements like a row in a table or a JSON object.
What is the lexical space of an atomic type?
The representation of the number.
What is the value space of an atomic type?
The value or meaning of the atomic type.
Give examples of type cardinalities and their associated symbols.
- exactly once: implicit optional
- 0 or 1: ?
- any amount: 0+
- one or more: +
- 1+: +
What is the difference between well-formedness and validity?
Well-formed means it can be compiled; validity means it can be validated against a schema.
Name further data modeling technologies and formats for tree-like data.
Parquet, Avro, protocol buffers
Can JSON data be represented as a DataFrame?
Yes, if the schema has no open object types and no heterogeneity in field types.
Are DataFrames more efficient than JSON?
True, time-wise and space-wise.
Explain the map and shuffle patterns in large-scale data processing.
- Map: splitting pokemons among people to count each type
- Shuffle: cutting up the dictionary and assigning it to designated people
Describe the physical architecture of MapReduce.
Centralized architecture with JobTracker and TaskTrackers.
What is the difference between a map function and a map task?
- Function: performs mapping
- Task: an assignment for every input split
What is a map slot in MapReduce?
Resources to compute the mapping, one slot is one CPU core with allocated memory.
What is a reduce function in MapReduce?
A function that takes key-value pairs and returns intermediate key-value pairs.
What is a combine function in MapReduce?
A function that takes one or more intermediate key-value pairs and returns 0, 1, or more intermediate key-value pairs.
How does combining improve MapReduce’s performance?
It stores intermediate values, making reduce quicker.
What assumptions are behind reusing the reduce function as a combine function?
The reduce function is commutative and associative.
How can a combine function be designed to speed up a MapReduce job?
By computing an average and keeping track of weights in the output.
Why does MapReduce perform well on a distributed file system?
It brings the query to the data, reducing the need to transfer data.
What bottleneck suggests the use of MapReduce?
If the bottleneck is the speed of reading and writing data from the disk.
How do MapReduce splits differ from HDFS blocks?
MapReduce uses shards that are larger than HDFS blocks, allowing for parallelism.
What does the Java API of MapReduce look like at a high level?
User defines a mapper class and a reducer class for the functions.