Data Management Flashcards
Operating System
Intermediary between software and hardware, managing hardware allocation.
UNIX Philosophy
- Each program does 1 thing well
- Output of every program expected to be input of another
- Try software early, expect wasted effort
- Use tools to help program over unskilled help
Linux Benefits
- Largely virus free as limited user system access + not many viruses are made for Linux
- Kernel separate from rest of OS preventing bugs elsewhere in OS from crashing whole system
Index Node (inode)
Describes a file-system object (file/directory). Stores attributes and location of data (metadata).
Inode number
References an inode. Associated with a file object name.
Separating metadata benefits.
- Allows for fast moving of files
- Can alter file while opened by another applicaiton.
pwd
Gives the current absolute path.
ls
Lists the files at the current location.
cd
Move Directory
Meaning of UNIX files starting with a dot?
They are hidden.
man
man [cmd] - Gives help with command.
mkdir
mkdir [directory name] - Creates a directory(folder)
rmdir
rmdir [directory name] - removes a directory. Must be empty
touch
touch [filename] - creates empty file.
cat
displays the contents of entire file
less
Displays part of file allowing for forwards and backwards movement.
head
UNIX
Top (default) 10 lines of file
tail
Bottom (default) 10 lines of file
program to program piping
program1 | program2 (program 1 output goes to program2 input)
program to file piping
program > file - program output written to file. > > used to append (no space)
file to program piping
program < file - program takes input from file
filter
Program that accepts text and changes it
pipe
Connection between two filters
wc
Prints the number of lines, words and characters
uniq
Removes duplicate adjacent lines
Version Control System
Records changes to files over time so they can be undone and viewed
Local VCS
Database on your computer holds all the changes. Does not allow collaboration
Centralised VCS
Single server stores changes, with users checking out files. Allows for collaboration but single point of failure
Distributed VCS
Changes (repository) stored in server and locally, with changes copied to each other.
VCS add
put file in local repository
VCS commit
commit changes to file to local repository
VCS commit message
Message describing a commit
VCS check in/push
Upload local repository content to remote repository
VCS check out/pull
Download file from remote repository
Conflict
When changes made cannot be merged automatically, resolved by manually applying changes to latest version
reverse integration
Copies new features from a branch to main. keeps new code out of main
Forward Integration
Copies latest changes from main to branch, keeping branch up to date
ps
Views current processes
top
CPU usage of processes
kill | UNIX
kill <PID> PID is process ID
options:
SIGTERM - requests process to stop, time for graceful shutdown
SIGKILL - forces process to stop execution</PID>
bg/fg
UNIX
Moves a process to the background/foregorund
screen
UNIX
Allows for creation of screens to run processes in the background
Environment Variables
Accessible by all processes run in the shell
PATH
Ordered List of directories that store executables to be run
export
Sets environment variables export variableName=’value’. Gives all environment variables if no argument
grep
Searches for lines containg the given input. grep
[pattern] [input]
Special Characters
Regular Expressions (grep)
* - zero or more of previous
? - zero or one of previous
+ - one or more of previous
. - wildcard
[] - range of characters
sed
Takes in text and modifies it. sed [command] [file]. commands e.g. ‘s/Hello/Hi/g’ (replaces first Hello on each line with Hi, g means it affects every instance on a line), ‘/Run/d’ (deletes all lines that contain Run)
awk
Allows for processing of tables. awk [pattern] {action}. By default actions are run on every line. $number used to give column
BEGIN
for awk
action run once at the start
END
for awk
action run once at the end
applying conditions
for awk
(condition){action}, action only run on lines that meet condition
LaTeX benefits over Word
- Easily allows for displaying of complex equations
- Can compile large documents easily
- Placement of figures and tables is easy
- Automatic referencing
- OS independant
Creating LaTeX documents
Typed as a .tex file and the LaTeX engine compiles to .pdf
Math Mode
LaTeX
Open and close with $. Allows for mathematical symbols
Wildcards
UNIX
Allows for operating on multiple files at once
* Any characters
? Any singular character
[] One character out of those given
chmod
UNIX
Changes permssions for files/directories. If using number each number represents 3 digit binary
ls -l permission information
first column shows permission infromation as 10 character string. First character shows directory/file, then split into 3 character chunks for each accessor (owner, group and other). The three characters represent whether the file/directory is readable/writable/executable.
Directory Permissions
UNIX
Executable directories can be opened
TSV
Tab Seperated values, form of structured data
Benefits of machine readable data
- Searching
- aggregation and summation
- Prediction
- Linking - links info from different sources
Relationship Modelling
A way to make human data machine readable. Involves creating a model that shows relationships between elements
Hierarchical
Ralationship Modelling
Entities are connected with each other and attributes in a tree structure
Network
Relationship Modelling
Entities and attributes are connected in a directional graph.
Object Oriented
Relationship Modelling
Entities and attributes are connected in a directional graph. Additionally, Entities are classes, with attributes values coming with a pointer to the attribute
Entity Relational
Relationship Modelling
Entites are now tables. They are linked by a key property
Markup
Addition of metadata to document. Allows for structure and additional meaning to text. Allows for machine to gleam meaning from text
YAML
Uses whitespace for structure to allow for easy reading but harder writing. Written with key-value pairs, i.e. variableName: data
Uses of YAML
- Config Files
- Passing data between application
- Storing simple application states
Issues with YAML
The syntax can be ambiguous, so may get different results with different parsers (code that splits up text). Not widely used
JSON
Stores objects. Subset of YAML. Can be read by most languages. Contains:
* Objects
* Values - “object”: “value”
* Lists - “object”: [value1, value2]
JSON uses
Sending data on the web/between programs. Sometimes used for config data
HTML
A markup language used for documents with hypertext (links). Tags say how to display data i.e. <text> TEXT HERE <\text>
Liquid | structured data
Markup language for Shopify
SGML
Markup
Standard Generalised Markup Language - A standard for defining markup languages. Super set of all markup languages e.g. XML, HTML
SGML issues
Markup
- complex
- no strict structure
- Requires a definition of structure
XML benefits
Markup
- Easier to parse
- Simplifies SGML
- Don’t need to define structure
XML
Markup
eXtensible Markup Language - Hierarchical with only tags, attributes and content. Made to carry data not display data. No defined tags
XML Syntax
Markup
Defines how it is written:
* closing tags for all tags
* case sensitive
* must have root element
* attributes are quoted
Schema
XML
Used as a template to ensure an XML file is written in a certain way
SimpleType
XML element
Only contains text
ComplexType
XML element
Can contain attributes and children
Namespaces
XML
Gives a prefix to tags with the same name, allowing for distinguishing between tags with the same name. xmlns:<localname>=”someurl”. Then all tags in the namespace have <localname>:tag
XML + CSS
Can apply a stylesheet to add presentation to the XML file
XPath
XML path language. The way to query data from an XML file
/path/to/element
XPath
Gives the node at the path given
//
XPath
Gets all nodes with that match
*
XPath
Matches any node
@attribute
XPath
Gets nodes with the given attribute. i.e. //tag[@attribute = ‘value’]
text()
XPath
Gets the text directly inside of the node
node[n]
XPath
Gets the nth occurence of that node
node[last()]
XPath
Gets the last occurence of that node
node1[node2]
XPath
Gets node1s that contain a node2
node1[@attribute=”value”]
XPath
Gets node1s with the given attribute being equal to the given value
..
XPath
Gets the parent of the current node
and/or/not
XPath
used in square brackets, also use brackets
|
XPath
Gets both the nodes returned from the query on the left and right
contains(location, “text”)
XPath
Used in square brackets, true for nodes containing the given text in the given location. location examples: @attribute, text()
string-length(string)
XPath
Gets the length of the given string
starts-with(location, “text”)
True if the location (attribute/text) starts with the text.
ends-with(string, “text”)
True if the string ends with the text.
DBMS
Database Managment System - Collection of software that manages a database
Data Independance
Idea that other applications and users should be insulated from data structure (logical independace) and storage (physical independance).
Logical Independance
Protection from changes to logical structure of data i.e. the schema of it
Physical independance
Protection from changes to physical storage of data i.e. whether it’s stored on a hard drive
Data Model
Defines how data is represented, organised and structured. Relational model is most widely used. Has a data language containing DDL and DML
Data Language
used to modify and retrieve data. Contains data definition language (DDL) and data manipulation langage (DML)
DDL
Data definition language - syntax for describing database templates. Includes creating tables and defining keys e.g. XML schema
DML
Data manipulation language - used for querying data e.g. XPath
Relation
Relational Database Model
A table, formally defined in set theory R ⊆ P(S1 × S2)
k-tuple
relations
ordered sequence of k elements
k-ary relation
relations
unordered subset of cartesion product of k sets(attributes). Contains many k-tuples. An instance of a k-ary relation schema
k-ary relation schema
relations
An ordered sequence of k-attributes. A template for a k-ary relation
relational database schema
relations
set of relation schemas
relational database instance
relations
set of relations, each of which being an instance of a relation schema. Called a database
Intension
relations
schemas. Changes rarely
Extension
relations
instances i.e. relations
Key
A set of attributes which is unique for all tuples. Can be made by combining attributes
Functional Dependancy
A relation r satisfies a functional dependancy A -> B if all tuples in r with the same value for attributes in A have the same value for attributes in B. Allows deduction of value of B for a given value of A
Determinant
Functional Dependancy
The set of attributes on the left hand side of a functional dependancy
Dependant Set
Functional Dependancy
The set of attributes on the right hand side of a functional dependancy
Splitting/Combining Rule
Functional Dependacy
A -> B is equivalent to A -> every element of B
Trivial Dependancy
Functional Dependancy
If B ⊆ A then A -> B
Obvious as e.g. if height and weight known, then height known
Implication
Functional Dependancy
Some functional dependancies S ⊨ (implies) A -> B if every relation instance that satisfies S also satisfies A -> B. A relation that fits the requirements for all of S also follows A -> B
Equivalence
Functional Dependancy
S is equivalent to T if S ⊨ T and T ⊨ S
superkey
If the attribute(s) are the determinant for every attribute. The set of all attributes is always a superkey
Surrogate Key
Database
Uniquely identifies each attribute, created for this purpose
Candidate Key
A superkey that has no other superkey included in it. e.g. if {height, weight} and {height} were superkeys, only {height} would be a candidate key
Closure Algorithm
Functional Dependancy
Determines if a set of attributes is a superkey. Steps:
1. Get a dependant set that can be reached with the current attributes
2. If there are no new attributes in any dependant sets, the original attributes are not a superkey so stop
3. Else, union the current attributes with the dependant set
4. Repeat 1-3 until the current attributes is the set of all attributes
Poor Relation
Functional Dependancy
Poor relation if X -> A and X is not a super key as it can lead to redundant data
Anomalies
Databases
Issues that can happen in a bad database:
* Redundancy - same data in multiple places
* Updates - updates can cause data to be inconsistent
* Inserts - Forced to fill in extra irrelevant attributes
* Deletion - Extra data that wasn’t intended to be deleted deleted
0th Normal Form
Databases
Unormalised, all data stored in one table
1st Normal Form
Databases
- Cannot have multiple values for one attribute
- Cannot have the same attribute in multiple columns
Minimal Set of Functional Dependancies
- Each FD has 1 attribute on the right hand side
- Minimal amount of attributes on the left hand side
- No redundant FDs, i.e. implied by other FDs
Partial Key Dependancy
Normalisation
Where an attribute only depends on part of any candidate key
2nd Normal Form
Databases
- 1st Normal Form
- All attributes not part of any candidate key are dependant on all parts of all candidate keys
1st Normal Form Creation
Databases
Take the illegal data and place it in a new table using the key
2nd Normal Form Creation
Databases
If values (not part of any candidate keys) only depend on (need to be determined) part of the candidate keys then split into tables with the key being the depended on attributes
3rd Normal Form
Databases
- 2nd Normal Form
- All non-key attributes are only determined by the keys, not anything else
3rd Normal Form Creation
Databases
If a non-superkey attribute depends on non-super key attribute(s) place the depended on attribute as the key for a new table with the attributes that depend on it
Boyce-Codd Normal Form
Databases
- 3rd Normal Form
- Every determinant is a candidate key
Boyce-Codd Normal Form Creation
Databases
If a functional dependancy exists with the determinant being non-key, a new table is made with the key being the non-key determinant
Normalisation Benefits
- Less Redundancy, so less storage
- More efficient to query
- No duplication so no inconsistency
Relation
Databases
A collection of tuples. Visually represented as a table
Relationship
Databases
A link between relations
Conceptual Modelling
Databases
Type of modelling that Identifies entity names and relationships, sometimes attributes. Made from requirements directly. No database design.
Logical Modelling
Databases
Type of modelling that identifies attributes and attribute types(e.g. int)
Physical Modelling
Databases
Type of modelling Aiming to represent database structure. Has actual tables and attributes. Implements relationships, i.e. keys, join tables, indexes
SQLite Benifits
- Serverless
- No configuration
SQLite issues
- Not multi-user
- No concurrency
- Just a file
creating a table
SQL
CREATE TABLE table
(
column TYPE (NOT) NULL,
…
PRIMARY KEY (column,…)
)
types
SQL
INTEGER
REAL
TEXT
BLOB - any uninterpretable data, e.g. image
NULL
Renaming a table
SQL
ALTER TABLE oldTable
RENAME TO newTable
Deleting a table
SQL
DROP TABLE table
Retrieving data
SQL
Done with SELECT column,… followed by FROM tables and then any number of optional further constraints.
ORDER BY
SQL
ORDER BY column ASC/DESC
LIMIT
SQL
LIMIT x. Result only shows the frist x results
WHERE
SQL
WHERE condition. Result only shows the rows that match the given condition. At most basic level is column = value. Can use AND, OR and NOT, comparison(>), !=.
LIKE
SQL
Used in a WHERE condition when using % is a string as a wildcard. e.g. WHERE id LIKE ‘%1%’
(NOT) IN
SQL
Used in a WHERE condition when checking if the column values is in a given list. e.g. WHERE id NOT IN (1,3,5)
JOIN
SQL
Allows for selecting data from multiple tables. Done with: JOIN table2 ON table1.foreignkey=table2.primarykey. If primary key composite must use ON for every column in the primary key
LEFT JOIN
SQL
Returns all combined rows and the rows from the first table that do not match a row in the second table. e.g. first table people and second table banks, if a person has a bank not in the second table still display that persons row
using multiple tables in SQL statement
SQL
must reference columns by their table. e.g. table1.column
AS
SQL renaming
SELECT table.column1 AS column1, table2.column2 AS column2. Allows for renaming of columns for a query
INSERT
SQL
INSERT INTO table (column,…)
VALUES (value,…)
Adds an entry to the table. The list of columns can be omitted if the entry has values for all columns. Can use a SELECT query in place of VALUES
View
SQL
Acts as a virtual table. Is a query allowing for data to be seen from tables in a specific way but does not store any data it self
View Commands
SQL
CREATE VIEW view AS SELECT …
DROP VIEW view
UPDATE
SQL
Specifies new values for columns in the table.
UPDATE table
SET column=value…
WHERE conditions
WHERE statement determines which rows the changed will be applied to but is optional
DELETE
SQL
DELETE FROM table
WHERE conditions
WHERE statement determines which rows will be deleted but is optional
functions
SQL
Allows for applying many different opertaions to a database. Replaces a column after SELECT.i.e. SELECT function(column) FROM table
GROUP BY
SQL
Allows for function output for rows with the same given column value to have the function applied to separately. e.g. SELECT student, avg(mark) FROM scores GROUP BY student gives the average mark of each student rather than the total average mark
defining foreign keys
SQL
In CREATE TABLE:
FOREIGN KEY (column) REFERENCES table(column)
This ensures that any values in the current table in the foreign key column is in the other table
Ensuring Referential Integrity
SQL
Add actions when doing CREATE TABLE after a FOREIGN KEY
ON DELETE action - happens when parent record deleted
ON UPDATE action - happens when parent key updated
actions
SQL foreign key referencing
CASCADE - delete/update happens to current table
RESTRICT - prevent delete/update
SET DEFAULT - change value to default
SET NULL - change value to NULL
Index
SQL
Data structure associated with a table to improve query speed. Ordered by value of a key accessed regularly (think of it being stored in a binary tree). Increases table modifaction time
CREATE
SQL Index
CREATE (UNIQUE) INDEX index
ON table(column,…)
UNIQUE ensures only 1 entry of each value will be in the index, is optional
Relational Databases Limitations
- Inflexible - changing requirements are tough and e.g. lists are hard to do
- ACID costs - limits performance and scalability
- JOINs complexity - creates complex queries
- Structure issues - Optional data makes bad tables
Vertical Scaling
Databases
Add more capacity to the server, not great as has a limit and is expensive. Relational databases have to mainly use this
Horizontal Scaling
Databases
- Replication - duplicate data to be stored on many server, expensive and has synchronisation issues
- Partitioning - store parts of the database on many server, prevents joins across tabls in different servers
- Both not good
NoSQL
Databases
Not Only SQL. Can use SQL languages on their databases. Less strict schemas.
Document
Type Of NoSQL
Typically JSON/BSON. Schema free (mostly?). Structured as a tree with nodes being documents
Key-Value
Type Of NoSQL
Contains keys that are linked to values. Often contains partition keys, splits data into partitions, and sort key, gives a single entry
Column
Type Of NoSQL
A row represents an entry but only has columns where there is data for it. Like RDBMS and Key-Value
Graph
Type Of NoSQL
Entries represented as nodes with edges connexting the data together
Sharding
NoSQL
Distributing data across multiple nodes, allows for easy scaling
Hot Partition
NoSQL
When one databse is overloaded with traffic while others are underused
Consistency Levels
Databses
strict/strong - all reads must wait for writes to be consistent
sequential - all writes happen in order
causal - opertations that can change the outcome of each other happen in order
eventual - data will eventually be stored correctly
CAP
databases
consistency - all users see the same data
availabilty - all users can always get a response
partition tolerance - the database works if communication breaks between nodes
CAP theorem
databases
For databases with data split across multiple servers only 2 of the three attributes of CAP can be met
NoSQL Advantages
- fast and simple
- Flexible structure choices
- Easily scale horizontally infinately
NoSQL Issues
- Design has to be done right early
- Can only access data in the way it’s designed
- Functions are very difficult to use
- Changes later on get very expensive
MongoDB
NoSQL
Document-based using JSON/BSON documents. The documents are key value pairs, but MongoDB is not a key-value type database. Can have schema
MongoDB relationships
NoSQL
Can add sections to certain documents referencing another document
MongoDB Indexes
NoSQL
Needed to allow for searching by certain fields and/or sorting by them
DynamoDB
NoSQL
AWS key-value based serverless NoSQL database. Everything stored in 1 table
DynamoDB issues
NoSQL
- Only allows basic lookups
- The data has to be modelled specifically for this purpose, so difficult to leave
DynamoDB secondary indexes
NoSQL
Replace the partition and/or sort key automatically with a secondary index allowing for entries to be searched by some of it’s data.
DTD
XML
Document Type Definition - Defines the structure of an XML document. (kinda like a schema)
XSD
XML
XML Schema Definition - Using XML defines a structure for an XML document.
XML Suite
XML
Syntax - How XML is written
Namespaces - Gives IDs to tags to make them unique
Schema - Defines how an XML document is structured
XPATH - Finds data in an XML document
Schema Benefits
XML
- Ensures valid file
- Can be used as template
- Identifies errors
- Eases Parsing
Normalisation Issues
- More Complex
- More tables/relationships
- Longer Queries
Denormalisation
Improves query speed by adding data to uneeded location but increases operation complexity
Identifying/non-identifying
Databases
When a foreign key is/is not part of the primary key of a child table
Cardinality
Databases
Maximum number of times an entity can be related to another entity
Modality
Databases
Minimum number of times an entity can be rated to another
Turning on foreign key constraints
SQL
PRAGMA foreign_keys = ON;
Ensures that values in a foreign key must be in the table that it is referencing
OS Potential Features
- Multi-user - Allows for multiple people to use a computer concurrently/at different times
- Multi-processing - Utilizing multiple processors
- Multi-tasking - Running multiple processes at the same time
~
UNIX
Used in a file path to go to the home directory e.g. ~/downloads