Beyond relational data Flashcards
What is semi-structured data?
Semi-structured data lies in between fully structured data (like relational databases) and entirely unstructured database (arbitrary data files)
What is fully structured data?
Data that fits a strong schema, which allows you to make highly efficient queries possible but you do need highly specific shapes/structure
What is unstructured data?
- Can store basically anything, pictures, music, arbitrary text files
- No precise description of the structure of this data which means that programs that work with these kinds of files need to know exactly how to extract the data
What is semi-structured data?
In between the two extremes you have semistructured data
Tries to pick best features of both extremes, has lots of flexibility but no schema
Describe a semistructured data model
- End up with a tree like structure, these don’t need to be trees however they can just be paths.
- Like B+ trees, data is found in the leaves. However it’s not as good at searching and doesn’t have strong balancing properties
- Each edge has a label and the label defines the relationship between the two nodes
Describe each of the elements that make up a semi-structured data tree like model
- leaf nodes: have associated data
- Inner nodes: have edges going to other nodes
- Root: no incoming edges
What advantage do semi-structured data models have over structured data models?
- As it’s semi-structured we can include some data but not others,
- It’s not a requirement that each node has the same kind of property as every other node,
- We can very easily add in attributes, by just traversing to the correct place in the tree and adding a node
What is semi-structured data useful for storing?
- Often used for sharing things between companies over the internet
- useful for storing documents
What are some of the forms for storing semi-structured data?
XML, JSON, KEY-VALUE, Graphs
Order these types of databases from fastest to slowest for accessing data: XML, JSON, Key-value, relational database
- Key-value
- relational database
- XML, JSON
What is the structure of an XML document
- the first line says that it’s an XML file, so it’s XML version 1.0, encoding UTF-8 and standalone = yes, standalone meaning that we don’t have a schema for the file.
- inside we have a bunch of lecturers with tags around them
- opening tags have no slash inside them, closing tags have a slash inside them
- so in between tags we have an element
What is not a problem in XML but is in file systems?
- Can think of the tree as a file system, with each node as folders. - Children can however have the same name and this would be a problem in a normal file, however is fine in XML because when we query we search all paths that satisfy the condition
What can XML trees not have?
in XML, we can’t have nodes with multiple parents because XML files are always trees
- We can have references in trees though, that say this node points to this other node and it’s basically how shortcuts are done in a file system.
What is the form for an XML element?
- XML files are made up of a bunch of elements
- we have opening and closing tags and in between some arbitrary text (an element)
What do you do if you want to leave an element empty?
You just combine the opening and closing tags by writing <keyword></keyword>
- elements are case sensitive so the keywords defining them must be the same
How are attributes defined in elements in XML documents?
- Write opening tag then the attribute name = value, then another attribute name = value (if there’s more than one) then the closing tag
- each attribute can only have one attribute per name, you can have as many attributes as you like but they must all be uniquely named
When should something be an attribute and when should something just be another element
- staff ID can be either because there’s only one ID, whereas you shouldn’t use email addresses as attributes because there could be multiple email addresses and you’re not allowed to write more than one value for an attribute
When should something be an attribute and when should something just be another element
- staff ID can be either because there’s only one ID, whereas you shouldn’t use email addresses as attributes because there could be multiple email addresses and you’re not allowed to write more than one value for an attribute
What is document order?
- Document order defines how XML files are ordered - they’re just ordered how they appear in the file. Whichever element comes first in the physical file, is what’s first
What is a DTD?
Document type definition or XML schema are used to define a schema for your XML files, this must be done at the start of the document
What are Entity references
Entity references are basically the shortcuts, so if you wanted to say that two elements were both members of a group, you need to point to one of them instead of writing them on both of them.
Why do we use Entity references
We do this because if you just read it as a file, then this could insinuate that there are two different groups instead of two places pointing to the same group
What is CDATA used for?
For passing information onto the processor or the application being used by you XML file for
-for example if you want to use < or > inside your text then you need to define that the XML processor knows this isn’t an error - this can be done with CDATA sections.
What is a good way of defining format of an XML file
- A DTD such as a schema of an XML file
- DTD provides information about the structure of your XML documents such as what elements may occur, what sub elements may occur inside an element and what attributes we have.
What is the value of standalone in the first line when we have a DTD for the XML file?
We set it to “no” because we do have a schema
What are the meanings of the symbols +, *, ?
+ means 1 or more of an item
* means 0 or more of an item
? means 0 or 1 of an item
- How do we define that an element must have an attribute in a DTD?
- first you define the element without the attribute by writing <!ELEMENT module EMPTY>
- then we define the attributes, by writing <!ATTLIST module code CDATA #IMPLIED>
and <!ATTLIST module title CDATA #IMPLIED>
What does #IMPLIED mean in DTD
- # IMPLIED means that the data is optional
How do we say a specific an element can’t be left empty in the DTD
- # REQUIRED if this attribute can’t be left empty or Some value “COMPXXX” as a default value or #FIXED and some value if it’s a constant
What are IDREF/IDREFS?
- IDREF references one element
- IDREFS references a list of elements
What does ID allow you to define?
ID allows you to define a unique key to be associated with an element that you can use to point to this element later using IDREF
What are the two levels of document processing?
- Well formed and valid
What does a non-validating processor ensure?
That an XML document is well formed before passing information on to an application
What features does a well formed document have?
- all elements must be within one root element
- elements must be nested in a tree structure without any overlaps
What is XPath?
XPath is basically an (advanced) “file” path in XML
What does XPath do when we have multiple children with the same name?
- You can return all of them
- Or you can return, the first, last or ith item depending on what you want
What does XPath allow us to do?
- write queries that return a set of values or nodes from an XML document
- values are string, ints, reals, etc
- nodes are the entire document, an element node or an attribute
What is the format for the most basic path in XPath?
- most basic path looks like this /E1/E2/E3/…En
- this is a slash then the name of an attribute then a slash and so on until you reach a desired attribute
- Whatever you reach by traversing down the path is what is returned.
If there is more than one result when traversing a an XML document tree what is the order of the results returned?
- The result is returned in document order, so it will return them in the order they’re written in the x-file
What is a relative path expression in XPath?
- If you don’t start the expression with a / then it will evaluate it relative to the node, so it will return anything below the given node
- so if we put student/name it will start at student as opposed to the root node students
How do we return attributes in XPath queries?
- You write a command like before but in the last tag your write an @ and then the name of an attribute and it will output whatever attribute you wrote in there
What does * mean in an XPath query
Use this * wildcard symbol to return anything directly below the named attribute
So /students/student/* will return the program code, module code