Beyond relational data Flashcards
What is semi-structured data?
Semi-structured data lies in between fully structured data (like relational databases) and entirely unstructured database (arbitrary data files)
What is fully structured data?
Data that fits a strong schema, which allows you to make highly efficient queries possible but you do need highly specific shapes/structure
What is unstructured data?
- Can store basically anything, pictures, music, arbitrary text files
- No precise description of the structure of this data which means that programs that work with these kinds of files need to know exactly how to extract the data
What is semi-structured data?
In between the two extremes you have semistructured data
Tries to pick best features of both extremes, has lots of flexibility but no schema
Describe a semistructured data model
- End up with a tree like structure, these don’t need to be trees however they can just be paths.
- Like B+ trees, data is found in the leaves. However it’s not as good at searching and doesn’t have strong balancing properties
- Each edge has a label and the label defines the relationship between the two nodes
Describe each of the elements that make up a semi-structured data tree like model
- leaf nodes: have associated data
- Inner nodes: have edges going to other nodes
- Root: no incoming edges
What advantage do semi-structured data models have over structured data models?
- As it’s semi-structured we can include some data but not others,
- It’s not a requirement that each node has the same kind of property as every other node,
- We can very easily add in attributes, by just traversing to the correct place in the tree and adding a node
What is semi-structured data useful for storing?
- Often used for sharing things between companies over the internet
- useful for storing documents
What are some of the forms for storing semi-structured data?
XML, JSON, KEY-VALUE, Graphs
Order these types of databases from fastest to slowest for accessing data: XML, JSON, Key-value, relational database
- Key-value
- relational database
- XML, JSON
What is the structure of an XML document
- the first line says that it’s an XML file, so it’s XML version 1.0, encoding UTF-8 and standalone = yes, standalone meaning that we don’t have a schema for the file.
- inside we have a bunch of lecturers with tags around them
- opening tags have no slash inside them, closing tags have a slash inside them
- so in between tags we have an element
What is not a problem in XML but is in file systems?
- Can think of the tree as a file system, with each node as folders. - Children can however have the same name and this would be a problem in a normal file, however is fine in XML because when we query we search all paths that satisfy the condition
What can XML trees not have?
in XML, we can’t have nodes with multiple parents because XML files are always trees
- We can have references in trees though, that say this node points to this other node and it’s basically how shortcuts are done in a file system.
What is the form for an XML element?
- XML files are made up of a bunch of elements
- we have opening and closing tags and in between some arbitrary text (an element)
What do you do if you want to leave an element empty?
You just combine the opening and closing tags by writing <keyword></keyword>
- elements are case sensitive so the keywords defining them must be the same
How are attributes defined in elements in XML documents?
- Write opening tag then the attribute name = value, then another attribute name = value (if there’s more than one) then the closing tag
- each attribute can only have one attribute per name, you can have as many attributes as you like but they must all be uniquely named
When should something be an attribute and when should something just be another element
- staff ID can be either because there’s only one ID, whereas you shouldn’t use email addresses as attributes because there could be multiple email addresses and you’re not allowed to write more than one value for an attribute
When should something be an attribute and when should something just be another element
- staff ID can be either because there’s only one ID, whereas you shouldn’t use email addresses as attributes because there could be multiple email addresses and you’re not allowed to write more than one value for an attribute
What is document order?
- Document order defines how XML files are ordered - they’re just ordered how they appear in the file. Whichever element comes first in the physical file, is what’s first
What is a DTD?
Document type definition or XML schema are used to define a schema for your XML files, this must be done at the start of the document
What are Entity references
Entity references are basically the shortcuts, so if you wanted to say that two elements were both members of a group, you need to point to one of them instead of writing them on both of them.
Why do we use Entity references
We do this because if you just read it as a file, then this could insinuate that there are two different groups instead of two places pointing to the same group
What is CDATA used for?
For passing information onto the processor or the application being used by you XML file for
-for example if you want to use < or > inside your text then you need to define that the XML processor knows this isn’t an error - this can be done with CDATA sections.
What is a good way of defining format of an XML file
- A DTD such as a schema of an XML file
- DTD provides information about the structure of your XML documents such as what elements may occur, what sub elements may occur inside an element and what attributes we have.