Big Data Lecture 07 Data Models Flashcards
How is JSON/XML stored in memory?
Using a tree:<br></br><ul><li>JSON: Nodes are either values, objects or array, edges are keys in objects with annotations on them (graph is directed downwards).</li><li>XML: Nodes are tags, whose values is the text inside, the attributes and their values are attached to the respective node (graphs is undirected).</li></ul>
What is XML information set?
The way parse tree is stored in memory:<br></br><ul><li>Document Information Item (child is the main element, stores version metadata),</li><li>Element Information Item (stores local name, children, attributes, parent),</li><li>Text Information Item (stores character string, and owner element),</li><li>Attribute Information Item (stores local name, normalized value (without quotes) owner element).</li></ul>
What is validation and how does it relate to well-formedness?
Well-formedness is against a language, validation is done later, it is our constraint on the structure and values within the language.
What is the difference between validation and annotation?
Validation is just True/False, while annotation is screening the data and adding metadata/changing value structure to prepare the data. Annotation also includes conversion of the data into correct data types in the storage.
Why do we do data validation?
When data is validated against a schema, it is not heterogeneous anymore, it is homogeneous instantly (w.r.t. to the schema) and we can use that to pre-load/query the data faster.
What are cardinality markers?
<ul><li>Required, must be exactly once,</li><li>repeated, *, zero or more,<br></br></li><li>optional, ?, zero or one,</li><li>no name, +, one or more.</li></ul>
What is JSound?
Validation schema for JSON:<br></br><ul><li>type the wanted type using quotes: “string”, “integer”,</li><li>everything is optional by default, put a “!” up fron to make it required,</li><li>use “item” if you want to store anything you want,</li><li>you can nest [] and {} just like in any item.</li></ul>
When is data set heterogeneous or homogeneous?
If it follows a schema, it is homogeneous, with respect to the schema.
HBase runs on top of HDFS, how so it is still fast?
<ol><li>It is storing stuff using MemStore and cache, which are fast,</li><li>it shortcircuits DataNodes in HDFS.</li></ol>
What is the difference between atomic and structured types?
<ul><li>Atomic cannot be reduced, e.g. int,</li><li>structured are nested, e.g. arrays and objects.</li></ul>
What atomic types are there? (7)
<ul><li>Strings,</li><li>numbers,</li><li>booleans,</li><li>dates and times,</li><li>time intervals,</li><li>binary,</li><li>null.</li></ul>
What is the difference between lexical and value space?
Value space is the actual value, lexical value is the encoding in characters.<br></br><br></br>There can be a big number of lexical values connected to one actual value.
What is the difference and relation of subtype and supertype?
Subtype’s value space is a subset of supertype’s value space.
What structured types are there? (2)
Maps (e.g. JSON Object),<br></br>and Lists (e.g. JSON Array).
When is adding of default values done in the schema?
During the annotation phase.
On which data representation is schema validation done?
It is done when the document is well-formed already on the stored representation in the memory.
What are the rules for JSON Schema?
{<br></br> “type” : “here write the type”,<br></br> “required”: “here specify required properties”<br></br> “properties”: {<br></br> “fake_name”: { “type” : “give type here”}<br></br> }<br></br> “additionalProperties”: “here define if there can be more stuff, true by default”<br></br>}<br></br>Making type true makes it validate with anything, it will be always okay. Type false means that this attribute must not exist.
What is impedance mismatch?
If lexical values in the file do not correspond well do the values represented in the programming language.
How does XML validation work?
We have schemas stored in the domain xmlns:xs=”https://www.w3.org/2001/XMLSchema” and we use those to validate our schema.<br></br><br></br>We use complexType to make stuff with attributes.<br></br><br></br>We use sequence for repeated elements, we specify repetitions for each element in the sequence.
What is schema of schemas?
Usually, JSON schema is also a JSON document. Same for XML schema, or JSound. So there is, naturally, also schema for validating the schema. This is called <i>schema of schema!</i>
What are DataFrames?
Collections of valid JSON objects.
What is a dataset from an abstract perspective?
A list of maps!
What is a table w.r.t. a dataframe?
It is a special type of a dataframe, one that has no nesting.
What is Parquet?
JSON compression schema, it does not store attribute names once again, to save space.
What is P1DT7H2M3S?
Lexical value of a date.
Can XML without a validation schema be valid?
No, there is nothing to validate it against.