Extra Flashcards
Data Overview
-Data = Units of information
-Data Documents = Collective form in which data exists (Datasets, Databases, Datastores, Data Warehouses)
-Data Sets = Logical groupings of units of data that generally are closely related or share the same data structure
-Data Types = How single units of data are intented to be used
-Batch and Streaming Data = How do we move our data around?
-Relational and Non Relational = How do access, query and search our data?
-Data Modelling = How do we prepare and design our data?
-Schemas and Schemaless = How do we structure our data for search
-Data Integrity and Data Corruption = How do we trust our data?
-Normalized and Denormalized = How do we trade quality vs speed?
Schema
Schema = A formal language which describes the structure of data of a database.
A schema can define many different data structures that serve different purposes for a database. (relational databases)
-Tables, Fields, Relationships, Views, Indexes, Packages, Procedures, Functions, XML schemas, Queues, Triggers, Sequences, etc.
Schemaless
Schemaless = When the primary “cell” of a database can accept many types.
-This allows developers to forgo upfront data modelling
Common schemaless databases are:
-Key/Value
-Document
-Columns
–Wide Column
-Graph
Query & Querying
A query is a request for data results (reads) or to perform operations such as inserting, updating deleting data (writes)
Querying is the act of performing a query
Query language is a scripting or programming language designed as the format to submit request or actions to the database. (SQL, GraphSQL, Kusto, Gremlin, etc)
Relational Data
-Tables = A logical grouping of rows and columns
-Views = Result set of a stored query on data stored in memory
-Materialized Views = Is a result set of stored query on data stored on disk
-Indexes = A copy of data sorted by one or multiple columns for faster reads a at the cost of storage (improves the speed of reads)
-Constraints = Rules applied to writes, that ensure data integrity
-Triggers = A function that is trigger on specific database events
-Primary Key = One or multiple columns that uniquely identify a table in a row
-Foreign Key = A column which holds the value of PK from another key to establish a relationship
Row-store vs Column-store
-RS-
-Data is organized in rows
-Traditional relational databases are row-stores
-Good for general purpose databases
-Suited for Online Transaction Processing (OLTP)
-Great when needing all possible columns in a row
-Not the best at analytics or massive amounts of data
-CS-
-Data is organized by columns
-Faster at aggregating values for analytics
-NoSQL store or SQL-like-databases
-Great for vast amount of data
-Suited for Online Analytical Processing (OLAP)
-Great when you only need a few columns
Data Integrity & Data Corruption
Data Integrity ensures data is recorded exactly as intended (data quality)
-Have a well defined and documented data modelling
-Logical constraints
-Redundant and versions of your data
-Hash functions
Data Corruption is the act or state of data not being in the intended state
-Hardware failure
-Human error
-Malicious actors
Normalized vs Denormalized Data
Normalized is a schema design to store non-redundant and consistent data
-Data integrity is maintained
-Little to no redundant data
-Many tables
-Optimizes for storage of data
Denormalized combines data so that accessing data (querying) is fast
-Data integrity is not maintained
-Redundant data is common
-Fewer tables
-Excessive data, storage is less optimal
Pivot Table
Is a table of statistics that summarizes the data of a more extensive table from a: Database, Spreadsheet or Business Intelligence (BI) tool
-Are a technique in data processing
-Draw attention to useful information
-Leads to funding figures and facts quickly
Strongly Consistent vs Eventually Consistent
Data consistency is when data is being kept in two different places and whether the data exactly match or do not match
SC = Every time you request data, you can expect consistent data to be returned with X time (1sec) (never returns old data)
EC = When the request data you may get back incosistent data within 2 secs (whatever data is currently in the db, you may get new data or old data)
Synchronus Vs Asynchronous
Can refer to mechanism for data transmission or data replication
Synchronous = continuous stream of data that si synchronized by a timer or clock
-Can only access data one transfer is complete
-Guaranteed consistency of data return at time of access
-Slower access times
Asynchronous = continuous stream of data separated by start and stop bits (no guarantee of time)
-Can access data anytime but may return older version or empty placeholder
-Faster access times, not guarantee of consistency
Non Relational Data
A non-relational database stores data in a non-tabular form and will be optimized for different kinds of data-structures
Types of non-relational databases:
-Key/Value = Each value has a key, designed to scale, only simple lookups
-Document = Primary entity is a JSON-like data-structure called a document
-Columnar = Has a table-like structure but data is stored around columns instead of rows
-Graph = Data is represented with nodes and structures. Where relationships matter.
Data Warehouse
A relational datastored designed for analytic workloads, which generally column-oriented data-store.
Companies will have terabytes and milllions of rows of data and they need a fast way to be able to produce analytics reports.
-They can return queries very very fast even though they have vast amounts of data
-Are infrequently accessed meaning they aren’t intended for real-time reporting
-They need to consume data from a relational database on a regular basis
Data Mart
A data mart is a subset of a data warehouse.
-It will store under 100GB and has a single business focus
-Allows different departments to have control over their own dataset
-Generally designed to be read-only
-Increase the frequency at which data can be accessed
-The cost to query the data is much lower
Data Lakes
A data lake is a centralized storage repository that holds a vast amount of raw data (big data) in either a semi-structured or unstructured format.
Commonly accessed for data workloads such as: Visualizations, Real-time analytics, Machine Learning, On-premise data
Data Concepts
Data Mining is the extraction of patterns and knowledge from large amounts of data (not the extraction of data itself)
Data Wrangling is the process of transforming and mapping data from one “raw” data form into another format.
A Data Model organizes elements of data and standardizes how they relate to one another.
Data Modelling is a process used to define and analyze data requirements needed to support the business processes.
Data Analytics is concerned with examining, transforming, and arranging data so that you can extract and study useful information.