Comp 1314 Data Management 1 Flashcards
What are the three key features of an OS?
Multi-user: Many users same system at the same time
Multi-processing: Multiple processors at the same time
Multi-tasking: Multiple processes at the same time
What is the philosophy of UNIX?
Set the cultural norms for minimalistic modular software development
What is special about UNIX?
Programs can be stringed together but it is secure as programs do not know about each other
What is Linux based on?
UNIX
What is piping?
Redirecting the input and/or output of a program
What is the symbol for piping?
|
Program1 | Program2
What is the Input/Output redirection symbol?
<: Input
>: Output
»: Append
List some programs that you can use
Head
Tail
sort
wc
uniq
du
xargs
more
cut
find
tar
gzip
nohup
parallel
basename
What are the environmental variables?
A set of variables that every running process has access to
Set using the export command
How do you write a for loop in bash?
for var in directionory;
do
something
done;
How do you do a while loop in bash?
cat file | while read line;
do
something
done;
What does grep do?
Searches for the input provided in the text provided
Can even use regular expressions
What does SED stand for?
Text Stream Editor
What does SED do?
Reads the input provided and modifies it as specified by the command and then writes that to the standard input
sed [options] command [file]
What does scp do?
Securely copies files from a secured server
What does awk do?
For processing structured text files into rows and columns
What are wildcards for?
Allows multiple arguments for commands (accessing multiple files becomes easy) using regular expressions
What are the three permission categories?
u: users
g: group
o: others
How do you change the permissions of a file?
chmod u+=x
chmod g-w*
You can also specify a decimal number which will be converted into binary and then assigned where each bit is a 1
Who can change the permissions of a file?
The owner of a file and the superuser
What is everything in UNIX?
Either a file or a process, this include directories.
How can you move a process to the background?
bg
Adding a & symbol will begin the process in the background
What are the two options with the kill command?
SIGTERM: A gentle request to kill, giving the process time to close
SIGKILL: A hard-request with no clean up time.
How can you prevent all processes terminating when you log off?
With screens
What is a CSV file?
Comma separate values
Easy to manipulate and process but new lines and commas in text can be problematic
What does YAML stand for?
Yet Another Markup Language
Fixed Row and column format
Widely used within config files and passing messages between applications
Made of key-value pairs
What does JSON stand for?
JavaScript Object Notation
What are the benefits of being understandable as a machine?
Searching
Aggregation and Summarisation
Prediction
What is metadata?
Data that is useful for the machine, but not for the human.
What is markup?
Contains a bunch of semantic links to other pages.
What is SGML?
Standard Generalised Markup Language
A superset of all markup language
Separate structure from content
What is XML?
Extensible Markup Language
Designed to carry data, not display data
What must XML have?
A root element
What is the issue with namespaces?
XML files can reference other XML files and people can define two tags with the same name
How is the namespace problem resolved?
Using the special namespace tag to separate the two documents apart.
What is a URI?
A unique full name of the namespaces, by convention this is used as the URL but they are not the same.
What does a schema enable us to do?
Gives meaning to the structured data we work with, lots of people have already done this.
So there are a lot of different schemas
Give an example schema
<Pizza>
<base></base>Pan</Base>
<Cheese>Cheddar</Cheese>
</Pizza>
How can we specify the number of occurrences in a sequence?
minOccurs
maxOccurs
What does DTD stand for?
Document Type Definition
What is XSD?
XML Schema Definition
Every element is either simple or complex
How many parsers does HTML5 contain?
2
One which can parse HTML or XML
Why is XML not used for large scale solutions?
We could easily loose all of the data, database can mitigate this.
What is a DBMS?
Database Management System
A collection of software that manages a database
Why do we need DBMS?
Handles all the data exchange and creates data independence
Applications don’t need to care about the database
What is logical and physical independence?
Logical: Protect from change in the data structure
Physical: Protect from changes in how the data is stored
What does the DBMS manage?
The data model
Store large amounts of data persistently, conveniently and efficiently
Transaction management
Concurrency control
Access Control
Resiliency
What is a data model?
A collection of mathematical concepts for describing data
Every model has a data language
What are the two parts of a data language?
A data definition language
A data manipulation language
What is XPATH?
The Data manipulation language for XML
What properties does a relation have?
Each row represents a k-tuple of R
The ordering of rows is immaterial
All rows are distinct
The order of attributes should not be significant
The significance of each column is conveyed by the name we give it.
What is a k-ary relation scheme?
A relation name and an ordered sequence of k attributes
What is an instance of a relation scheme?
A relation that conforms to the schema
Arities match
Data types of attributes match
What is a key?
A set of attributes where any two different tuples cannot have the same value
What is a superkey?
For every relation R, X is a set of attributes of R, X is a superkey of R, if X -> A, for every attribute A of R
The set of all attributes will always be a superkey
What is a candidate key?
X is a candidate key of R where X is a minimal superkey of R
Where X is a superkey and there is no superkey Y such that Y is a subset of X
Describe the steps of the closure algorithm
X+ = {Ai : F ⊧ X → Ai } is the closure of X with respect to F * F ⊧ X → Y if and only if Y ⊆X+
INPUT: R, attribute set U, F, X is a subset of U
OUTPUT: X+ = {Ai : F⊧X → Ai }
Repeat until: X_n+1 == X_n
R(ABCDEFG)
F = {AB -> CD, C -> EG, D->H}
0. X={A,C}
1. X_0={A,C} only path is to EG
2. X_1={A<C<E<G}
3. X_2 = X_1
4. This imples that X is not a superkey
- X={A,B}
- X_0={A,B}
- X_1={A,B,C,D}
- X_2={A,B,C,D,E,G,H}
- X_3=X_2
- As X_3 = R, this implies X is a superkey
- X is also the only candidate key, this can be determined by apply the algorithm to each subset
Why is bad database design a problem?
It can lead to anomalies
What is a functional dependency?
When two tuples within a relation agree on the values of A1,…,An then they also agree on B
A –> B
So the right side cannot change where the left side is the same.
What is splitting/combining in Functional dependencies?
We can split bigger functional dependencies into smaller ones and vice versa.
What is normalisation for?
Avoiding anomalies
Give examples of anomalies
Redundancy
Update anomalies
Insert anomalies
Deletion anomalies
What is 1st Normal Form?
A relation that contains only atomic values with no repeating groups
What is 2nd Normal Form?
No partial key dependencies
Every non-key attributes is dependant on all attributes of all candidates keys
What is 3rd Normal Form?
All attributes are determined only by the keys
What is transitive dependence?
A -> B -> C but B does not determine A
What is Boyce-Codd Normal Form?
Every determinant is a candidate key
What are the benefits and drawbacks of BCNF?
Advantages:
No redundancy
Efficiency
No duplication
Changes can cascade across relations
Disadvantages:
More tables
More complex
More relationships
Queries become more complex
What is SQL?
Structured Query Language
Converts a data model to a physical databases by specifying a DDL and a DML
What is SQLite?
All of the database contained within a single file.
Simple databases
What types does SQL use?
INTEGER
REAL
TEXT
BLOB
NULL
What joins are possible in SQLite?
LEFT: All of 1 and matches in 2
RIGHT: All of 2 and matches in 1
FULL OUTER: Everything from both
What are views?
Virtual tables which have a name and can run queries. Pre-joined for convenience
What are indexes?
Data structures associated with tables to support queries, logically ordered by the values of the key.
Why are indexes used?
To improve the performance of looking up data
How is an index created?
CREATE INDEX <name> ON <table></name>
What does NO SQL stand for?
Not Only SQL
Less adherence to ACID and schemas
What does NOSQL do?
Storage and retrieval in a non-tabular format
More flexible with various types of data.
Why is NoSQL more flexible?
As it has no fixed schema and no fields are necessary
What is scalability?
Ability of a system to handle increasing amounts of workload or data by adding resources to the system.
What is vertical scalability?
Scaling by adding more CPUs to handle increased workload.
What is horizontal scalability?
As workload increases, we distribute between multiple nodes and so we add more nodes.
What architecture does NoSQL operate as?
Distributed Architecture
What is sharding?
Partitioning data across multiple nodes to distribute the workload
What does sharding require and allow?
Every nodes is responsible for the data on it, and a request can be handled in parallel. The system will require a shard management system to keep track of shards.
What is a shard key?
An attribute used to determine how data is distributed between nodes
Why is data replicated between multiple nodes?
To provide fault tolerance and high availability
The replication factor determines the number of replicas
What is CAP theorem?
Consistency - Every read is the most recent
Availability - Every request receives a response without guaranteeing consistency
Partition Tolerance - The system continues to operate despite network partitions.
What are trade offs with CAP?
You can’t have CAP, so we have CP or AP.
If down then availability is sacrificed - CP
If down then consistency is sacrificed - AP
When is CA possible?
In a single-node system or partition free environment.
What is BASE?
BA: Basically Available
System guarantees availability, will always respond to a request
S: Soft State
The state of the database changes over time even with no new input
E: Eventual Consistency
System does not guarantee immediate consistency across all nodes after a write operation, instead it will ensure that id no new updates are made, all nodes will eventually converge to the same state.
What are the types of NoSQL database?
Key-value
Document
Graph
Column-family
What is key-value style NoSQL?
Storing data as a collection of key-value pairs. Each item has a unique key used to access it. With no schema
What is Document style NoSQL?
Semi-structured format, typically in JSON, BSON or XML.
Each document is self-contained unit of data that can contain key-value pairs. Allows hierarchical data.
What is graph style NoSQL?
Storing data in the form of a graph with nodes and edges. This is good for interconnected data with lots of relationships.
What is a column-family style NoSQL?
Organises data into columns rather than rows. Handles large volumes of data with support for distributed architectures,
Each row can have a different set of columns
What are hybrid NoSQL databases?
Combining features from multiple types to offer flexibility to handle diverse data models.
What factors should be considered when picking a type of NoSQL database?
Current Data
Application requirements
Evaluate NoSQL database types
Consider consistency models
Access scalability and performance
Examine operational conditions
Evaluate ecosystem and integration
Perform a proof of concept.
Explain the basics of using XPATH
// Skip to node
/ To next node
@Attribute
[Filter by something]
text() Only the text stored within the attribute
How do indexes work?
The chosen attribute for the index becomes an index key. The system is implemented as either a binary tree or hash indexes.
DBMS plans the best way to execute a query with it’s indexes.
What is the drawback of indexes?
They slow down INSERT, UPDATE and DELETE queries as they must account for the indexes and update them as well.
How do you create views?
CREATE VIEW Name AS SELECT …
What is ACID in relational database transactions?
Atomicity
Consistency
Isolation
Durability
What is atomicity in a transaction?
Each statement within a transaction is treated separately. Either the entire statement is executed though, or none of it is.
What is consistency in a transaction?
Changes to tables can only happen in predefined, predictable ways.
What is isolation in a transaction?
Isolation of transactions between multiple users to ensure that transactions from multiple users don’t affect others.
What is durability in a transaction?
Ensures that changes to your data model by successful transactions are saved, even in the event of system failure.
Give an example of each type of NoSQL database.
Column-family:
Cassandra
Key-value:
Dynamo DB
Document:
MongoDB
Graph:
Neo4j