Big Data Lecture 11 Document Stores Flashcards

Question 1

Q

Why do we need Document Stores?

Answer

A

We need to build the same stack we had for tables now for XML/JSON. From textual stored information on, we have to do it all again!

Question 2

Q

How can we make trees fit into tables?

Answer

A

We can:<br></br><ul><li>push data flat trees into tables,</li><li>put linked nested data into relational tables,</li><li>fill missing heterogeneous with NULLs.</li></ul>

Question 3

Q

What is the optimal maximum document size for document stores?

Question 4

Q

Which functions of RDBMS do Document Stores implement?

Answer

A

<ul><li>Projection,</li><li>Selection,</li><li>Aggregation,</li><li>but NOT Joins! (To be implemented on the user side.)</li></ul>

Question 5

Q

Is data in Document Stores validated? If so, when?

Answer

A

If schema is added, then data is validated on pupolation.<br></br><br></br>Schema can also be added later, and them the stored data is validated.

Question 6

Q

What are implementations of Document Stores?

Answer

A

MongoDB, elasticsearch, MarkLogic, ArangoDB…

Question 7

Q

How is data loaded into mongoDB?

Answer

A

It is ETLed, we do not have a data lake anymore!

Question 8

Q

How is data stored in MongoDB?

Answer

A

Using binary encoding of JSON, called BSON (used even if the data is validated).

Question 9

Q

What is the CRUD paradigm?

Answer

A

Lower level APIs do:
Create,Read,Update,Delete
data.

Question 10

Q

How does selection work in Document Stores?

Answer

A

We select the data that is matching a certain attribute value, we can also access nested elements (not to be confused with searching for nested elements with exact children).

We can have a disjunction of condition or a range query.

We can search for values that are not there as for ‘Null’.

We can also check contents of an array, if something is in there or not.

Question 11

Q

How does projection work in Document Stores?

Answer

A

We select using 1/0 columns we want to project, or project away. We cannot mix the values, as we do not know what the full set of columns is, hence we can only choose what we want or what we certainly do not want.

Question 12

Q

How can we aggregate data from query from Document Stores?

Answer

A

We can use sort, count, skip, limit, distinct, … all same as in RDBMS. We can also use the ‘aggregate’, which takes parameters like Spark query.

Question 13

Q

How to insert, update and delete in MongoDB?

Answer

A

<ul><li>Using insertOne, or insertMany,</li><li>using updateOne, or updateMany,</li><li>using deleteOne, or deleteMany.</li></ul>

<div>Where one does it for all matching, and one just for the first one in the collection.</div>

Question 14

Q

What is the granularity of MongoDB?

Answer

A

One document, many people can work on the same database, but only one person can alter one document at one time.

Question 15

Q

How to query documents on higher level?

Answer

A

Using query langugae! Just like JSONiq, or XQuery!

Question 16

Q

How is data replicated and sharded in MongoDB?

Answer

Study These Flashcards

A

Data is sharded by how it is sorted lexigographically, then each shard is replicated multiple times. There is a primary replica, and many secondary ones. Data is replicated first synchronously (up to a certain number of times), and then asynchronously.

Question 17

Q

How to have fast look up of specific data? How fast is it?

Answer

Study These Flashcards

A

Build hash index, has the the key you want to search by, and store pointers to the records connected to this hash in memory.<br></br>Instant look up of O(1).

Question 18

Q

What are the limitations of hash indices?

Answer

Study These Flashcards

A

<ol><li>Take time to build,</li><li>do not support range queries (consecutive values are not stored together),</li><li>hash function is not perfect (not even, colissions).</li></ol>

Question 19

Q

What is a B+tree? How to use it for look up? Time complexity?

Answer

Study These Flashcards

A

<ul><li>Values are stored at the leaves,</li><li>use binary tree like lookup on the values in each level to decide where to go (go left or right, bigger or smaller),</li><li>but every level holds multiple values, so you can also be in the middle of an interval specified ther (watch out, 3 intervals need 2 delimiters),</li><li>each leaf holds a pointer to memory.</li><li>Look up in log(N).</li></ul>

Question 20

Q

What is the default index in MongoDB?

Answer

Study These Flashcards

A

Default index is on _id field, but others can be selected! This is really fast, the rest of the post processing is done in memory!

Question 21

Q

How can we build range queries on multiple attributes? Can we reduce to one of the values only later?

Answer

Study These Flashcards

A

Yes, we can build order on multiple values. Then we can only sort using prefixes (‘prefixes are implied’) of the value set, if otherwise it has to be done in the memory.

Big Data Lecture 11 Document Stores Flashcards

(21 cards)