Big Data Lecture 12 Querying Trees Flashcards

1
Q

Why is not true that we should always build as many indices as possible?

A

Indexing on a certain attribute takes space, and computational resources. So even though it makes possibly the query on the attribute faster, it does not mean we should do it.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is the goal of JSONiq?

A

We can query unstructured JSON data in data independent way.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What does RumbleDB actually do?

A

Rumble is an engine that connects the logical expressions of JSONiq to physical implementation underneath (HDFS, WCS, you name it!).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What kind of language is JSONiq?

A

<ul><li>Declarative (we say that we want, and it just happens),</li><li>functional (you can nest it like a boss),</li><li>set-based (everything is a sequence).</li></ul>

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is a data LakeHouse?

A

Datalake warehouse, something that can be queried from top-level, but at the lower level it is just a data lake.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

How is data represented in JSONiq?

A

Everything is just a sequence of items, even nothing () and a single item (“lol”).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What happens when you enter valid JSON into JSONiq?

A

It evaluates to itself.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Can JSONiq item sequences be nested?

A

No, it naturally removes the nesting, e.g. ((1), 2) == (1, 2).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Are all types in JSONiq comparable like in MongoDB?

A

No.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Can we compare two arrays in JSONiq?

A

Yes, the predicate will be evaluated as an existential quantifier.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are FLWOR expressions?

A

<ul><li>For,</li><li>Let,<br></br></li><li>Where,</li><li>Order by,</li><li>Return.</li></ul>

Typical pattern in JSONiq, under the hood they are nicely optimized.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What are tuple streams in JSONiq?

A

Formalism to explain combinations of operations, how they chain together, and how they are applied to each other and in order.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

How are types formally expressed in JSONiq?

A

Using the given type, plus quantification (+/*/?/nothing if required).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

How is query of JSONiq parsed in the background?

A

<ol><li>Abstract Syntax Tree is created,</li><li>using VisitorPattern converted to ExpressionTree,</li><li>the ExpressionTree is optimized for execution,</li><li>conversion to Iterator Tree (volcano iterators).</li></ol>

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What are volcano iterators?

A

Objects that we can open(), ask hasNext(), next() and close().

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Is iterator tree composed only of volcano iterators?

A

No, there can also be tuple stream in between.

17
Q

What are 3 possible ways of execution?

A

<ol><li>Materialized execution (everything is materialized on the go before the next iteration of the algorithms is applied),</li><li>streamed execution (only necessary components are materialized for the next step),</li><li>parallel execution (separate execution on different machines, and let those materialize each part).</li></ol>

18
Q

What are the (dis)advantages of each type of execution?

A

<ul><li>Materialized: memory overhead,</li><li>streamed: function call overhead (time),</li><li>parallel: incompleteness of data, YARN executor initialization overhead.&nbsp;</li></ul>

19
Q

As what data types does Spark handle sequences?

A

RDDs (heterogeneous) or DataFrames, if possible (for homogeneous, faster).

20
Q

What is a UDF?

A

User-defined-function, anything we pack and send into Spark to handle it internally (if not native, it can be slow).

21
Q

Do we have to push processes down manually?

A

No, they are pushed down automatically as much as possible.