Big Data Lecture 11 Document Stores Flashcards

1
Q

Why do we need Document Stores?

A

We need to build the same stack we had for tables now for XML/JSON. From textual stored information on, we have to do it all again!

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

How can we make trees fit into tables?

A

We can:<br></br><ul><li>push data flat trees into tables,</li><li>put linked nested data into relational tables,</li><li>fill missing heterogeneous with NULLs.</li></ul>

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is the optimal maximum document size for document stores?

A

<=16 MB

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Which functions of RDBMS do Document Stores implement?

A

<ul><li>Projection,</li><li>Selection,</li><li>Aggregation,</li><li>but NOT Joins! (To be implemented on the user side.)</li></ul>

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Is data in Document Stores validated? If so, when?

A

If schema is added, then data is validated on pupolation.<br></br><br></br>Schema can also be added later, and them the stored data is validated.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are implementations of Document Stores?

A

MongoDB, elasticsearch, MarkLogic, ArangoDB…

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

How is data loaded into mongoDB?

A

It is ETLed, we do not have a data lake anymore!

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

How is data stored in MongoDB?

A

Using binary encoding of JSON, called BSON (used even if the data is validated).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is the CRUD paradigm?

A

Lower level APIs do:<br></br><ul><li>Create,</li><li>Read,</li><li>Update,</li><li>Delete,</li></ul>data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

How does selection work in Document Stores?

A

We select the data that is matching a certain attribute value, we can also access nested elements (not to be confused with searching for nested elements with exact children).<br></br><br></br>We can have a disjunction of condition or a range query.<br></br><br></br>We can search for values that are not there as for ‘Null’.<br></br><br></br>We can also check contents of an array, if something is in there or not.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

How does projection work in Document Stores?

A

We select using 1/0 columns we want to project, or project away. We cannot mix the values, as we do not know what the full set of columns is, hence we can only choose what we want or what we certainly do not want.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

How can we aggregate data from query from Document Stores?

A

We can use sort, count, skip, limit, distinct, … all same as in RDBMS. We can also use the ‘aggregate’, which takes parameters like Spark query.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

How to insert, update and delete in MongoDB?

A

<ul><li>Using insertOne, or insertMany,</li><li>using updateOne, or updateMany,</li><li>using deleteOne, or deleteMany.</li></ul>

<div>Where one does it for all matching, and one just for the first one in the collection.</div>

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is the granularity of MongoDB?

A

One document, many people can work on the same database, but only one person can alter one document at one time.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

How to query documents on higher level?

A

Using query langugae! Just like JSONiq, or XQuery!

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

How is data replicated and sharded in MongoDB?

A

Data is sharded by how it is sorted lexigographically, then each shard is replicated multiple times. There is a primary replica, and many secondary ones. Data is replicated first synchronously (up to a certain number of times), and then asynchronously.

17
Q

How to have fast look up of specific data? How fast is it?

A

Build hash index, has the the key you want to search by, and store pointers to the records connected to this hash in memory.<br></br>Instant look up of O(1).

18
Q

What are the limitations of hash indices?

A

<ol><li>Take time to build,</li><li>do not support range queries (consecutive values are not stored together),</li><li>hash function is not perfect (not even, colissions).</li></ol>

19
Q

What is a B+tree? How to use it for look up? Time complexity?

A

<ul><li>Values are stored at the leaves,</li><li>use binary tree like lookup on the values in each level to decide where to go (go left or right, bigger or smaller),</li><li>but every level holds multiple values, so you can also be in the middle of an interval specified ther (watch out, 3 intervals need 2 delimiters),</li><li>each leaf holds a pointer to memory.</li><li>Look up in log(N).</li></ul>

20
Q

What is the default index in MongoDB?

A

Default index is on _id field, but others can be selected! This is really fast, the rest of the post processing is done in memory!

21
Q

How can we build range queries on multiple attributes? Can we reduce to one of the values only later?

A

Yes, we can build order on multiple values. Then we can only sort using prefixes (‘prefixes are implied’) of the value set, if otherwise it has to be done in the memory.