Data Management Flashcards
Operating System
Intermediary between software and hardware, managing hardware allocation.
UNIX Philosophy
- Each program does 1 thing well
- Output of every program expected to be input of another
- Try software early, expect wasted effort
- Use tools to help program over unskilled help
Linux Benefits
- Largely virus free as limited user system access + not many viruses are made for Linux
- Kernel separate from rest of OS preventing bugs elsewhere in OS from crashing whole system
Index Node (inode)
Describes a file-system object (file/directory). Stores attributes and location of data (metadata).
Inode number
References an inode. Associated with a file object name.
Separating metadata benefits.
- Allows for fast moving of files
- Can alter file while opened by another applicaiton.
pwd
Gives the current absolute path.
ls
Lists the files at the current location.
cd
Move Directory
Meaning of UNIX files starting with a dot?
They are hidden.
man
man [cmd] - Gives help with command.
mkdir
mkdir [directory name] - Creates a directory(folder)
rmdir
rmdir [directory name] - removes a directory. Must be empty
touch
touch [filename] - creates empty file.
cat
displays the contents of entire file
less
Displays part of file allowing for forwards and backwards movement.
head
UNIX
Top (default) 10 lines of file
tail
Bottom (default) 10 lines of file
program to program piping
program1 | program2 (program 1 output goes to program2 input)
program to file piping
program > file - program output written to file. > > used to append (no space)
file to program piping
program < file - program takes input from file
filter
Program that accepts text and changes it
pipe
Connection between two filters
wc
Prints the number of lines, words and characters
uniq
Removes duplicate adjacent lines
Version Control System
Records changes to files over time so they can be undone and viewed
Local VCS
Database on your computer holds all the changes. Does not allow collaboration
Centralised VCS
Single server stores changes, with users checking out files. Allows for collaboration but single point of failure
Distributed VCS
Changes (repository) stored in server and locally, with changes copied to each other.
VCS add
put file in local repository
VCS commit
commit changes to file to local repository
VCS commit message
Message describing a commit
VCS check in/push
Upload local repository content to remote repository
VCS check out/pull
Download file from remote repository
Conflict
When changes made cannot be merged automatically, resolved by manually applying changes to latest version
reverse integration
Copies new features from a branch to main. keeps new code out of main
Forward Integration
Copies latest changes from main to branch, keeping branch up to date
ps
Views current processes
top
CPU usage of processes
kill | UNIX
kill <PID> PID is process ID
options:
SIGTERM - requests process to stop, time for graceful shutdown
SIGKILL - forces process to stop execution</PID>
bg/fg
UNIX
Moves a process to the background/foregorund
screen
UNIX
Allows for creation of screens to run processes in the background
Environment Variables
Accessible by all processes run in the shell
PATH
Ordered List of directories that store executables to be run
export
Sets environment variables export variableName=’value’. Gives all environment variables if no argument
grep
Searches for lines containg the given input. grep
[pattern] [input]
Special Characters
Regular Expressions (grep)
* - zero or more of previous
? - zero or one of previous
+ - one or more of previous
. - wildcard
[] - range of characters
sed
Takes in text and modifies it. sed [command] [file]. commands e.g. ‘s/Hello/Hi/g’ (replaces first Hello on each line with Hi, g means it affects every instance on a line), ‘/Run/d’ (deletes all lines that contain Run)
awk
Allows for processing of tables. awk [pattern] {action}. By default actions are run on every line. $number used to give column
BEGIN
for awk
action run once at the start
END
for awk
action run once at the end
applying conditions
for awk
(condition){action}, action only run on lines that meet condition
LaTeX benefits over Word
- Easily allows for displaying of complex equations
- Can compile large documents easily
- Placement of figures and tables is easy
- Automatic referencing
- OS independant
Creating LaTeX documents
Typed as a .tex file and the LaTeX engine compiles to .pdf
Math Mode
LaTeX
Open and close with $. Allows for mathematical symbols
Wildcards
UNIX
Allows for operating on multiple files at once
* Any characters
? Any singular character
[] One character out of those given
chmod
UNIX
Changes permssions for files/directories. If using number each number represents 3 digit binary
ls -l permission information
first column shows permission infromation as 10 character string. First character shows directory/file, then split into 3 character chunks for each accessor (owner, group and other). The three characters represent whether the file/directory is readable/writable/executable.
Directory Permissions
UNIX
Executable directories can be opened
TSV
Tab Seperated values, form of structured data
Benefits of machine readable data
- Searching
- aggregation and summation
- Prediction
- Linking - links info from different sources
Relationship Modelling
A way to make human data machine readable. Involves creating a model that shows relationships between elements
Hierarchical
Ralationship Modelling
Entities are connected with each other and attributes in a tree structure
Network
Relationship Modelling
Entities and attributes are connected in a directional graph.
Object Oriented
Relationship Modelling
Entities and attributes are connected in a directional graph. Additionally, Entities are classes, with attributes values coming with a pointer to the attribute
Entity Relational
Relationship Modelling
Entites are now tables. They are linked by a key property
Markup
Addition of metadata to document. Allows for structure and additional meaning to text. Allows for machine to gleam meaning from text
YAML
Uses whitespace for structure to allow for easy reading but harder writing. Written with key-value pairs, i.e. variableName: data
Uses of YAML
- Config Files
- Passing data between application
- Storing simple application states
Issues with YAML
The syntax can be ambiguous, so may get different results with different parsers (code that splits up text). Not widely used
JSON
Stores objects. Subset of YAML. Can be read by most languages. Contains:
* Objects
* Values - “object”: “value”
* Lists - “object”: [value1, value2]
JSON uses
Sending data on the web/between programs. Sometimes used for config data
HTML
A markup language used for documents with hypertext (links). Tags say how to display data i.e. <text> TEXT HERE <\text>
Liquid | structured data
Markup language for Shopify
SGML
Markup
Standard Generalised Markup Language - A standard for defining markup languages. Super set of all markup languages e.g. XML, HTML
SGML issues
Markup
- complex
- no strict structure
- Requires a definition of structure
XML benefits
Markup
- Easier to parse
- Simplifies SGML
- Don’t need to define structure
XML
Markup
eXtensible Markup Language - Hierarchical with only tags, attributes and content. Made to carry data not display data. No defined tags
XML Syntax
Markup
Defines how it is written:
* closing tags for all tags
* case sensitive
* must have root element
* attributes are quoted
Schema
XML
Used as a template to ensure an XML file is written in a certain way
SimpleType
XML element
Only contains text
ComplexType
XML element
Can contain attributes and children
Namespaces
XML
Gives a prefix to tags with the same name, allowing for distinguishing between tags with the same name. xmlns:<localname>=”someurl”. Then all tags in the namespace have <localname>:tag