Data Analysis Tools Flashcards
Purpose and Outputs of data integration activities
Integration begins with the ingestion process, and includes steps such as cleansing, ETL mapping and transformation.
Data integration ultimately enables analytics tools
to produce effective, actionable business intelligence.
Functional Requirements
Defines a system or its component
Describes functions a software must perform
A function is nothing but inputs, it’s behaviour and outputs
Calculation / data manipulation/ business process / user interaction
Help you capture the intended behaviour of the system
Behaviour may be expressed as functions, services or tasks
Non-functional requirement
Defined the quality attribute of a software system
They represent a set of standard used to judge the specific operation of a system
E.g how fast does the website load
Is essential to ensure the use ability and effectiveness of the entire software system
Allow you to impose constrains or restrictions on the design of the system across the various agile backlogs
Speed of Data Integration
Faster for data warehouses than relational databases because the access is only write-only.
Data warehouses allow large analytical queries which eliminate the issue by accessing transactional databases (OLTP)
Loading data into warehouses (ETL) is usually carried out in batches and during the loading data the warehouse is unavailable.
Having data integrated into a single source saves time (doesn’t take as long to prepare and analyse the data.
Data Integration: Structure and Rules
- Security policies
- Access Layers
- Should be immutable (not be able to change the content of the integrated data destination)
- Validation checks should be carried out during ETL
- Validate the source and target table structure, data types
- Validation checks should be carried out during ETL
- Validate the source and target table structure, data types
- Validate the column names against a mapping doc
- Verification done using ETL testing
- Verify that the data is accurate
- Verify the data is the right data required to be in the data warehouse
- Verify that dat has no duplicated in the data warehouse
Rationale for using and integrating data from multiple sources
- Consistency
missing data and identifying gaps that need to be filled - through ETL
Benefits of integrating data from multiple sources
- Increased collaborations across teams
- Better business intelligence and insights
- Data availability to all stakeholders
To consider: Data in a business context
- Variability: illustrates how things differ, and by how much
- Uncertainty: Good visualisation practices frame uncertainty that arises from variation in data
- Context: Meaningful context helps us frame uncertainty against underlying variation in data
Descriptive Analysis
Looks at past data and tells what happened. Often used when tracking KPI, revenue, sales
Diagnostic Analysis
Aims to determine why something happened
Predictive Analysis
predicts what is likely to happen in the future
Trends are derived from past data and used to create predictions
Prescriptive Analysis
Combines the information found from the previous 3 types of data analysis and forms a plan of action for the organisation to face the issue or decision
Data Warehouse
An industry standard tool that allows the collection and organisation of data origination from various sources, is the data warehouse.
Reasons for using data from multiple sources
- allows the analyst to pull together data to resolve issues
- perform data analysis
- obtain a 360 view of a problem
- assist with decision making among others
Quality of Data Sources - thins that could go wrong
Data Truncation - prevision is lost/wrong data types
Data Corruption - commas in the wrong place
Data Missing - only a portion of the data set is uploaded
Data Typing - wrong data type uploaded into field
Data Translation - Wrong encoding/symbols
to ensure better quality of data:
CheckSum Spot Checks (Eyeballing) Mins/Max/Aggregates Counts Export Comparison
Data Integration
the process of combining data from different sources to help data managers and executives analyse it and make smarter business decisions
Data Integration Process involves:
A person or system location, retrieving, cleaning, and presenting the data
Manual Data Integration
Occurs when a data manager oversees all aspects of the integration - usually by writing custom code. without automation
Best for one-time instances - untenable for complex or recurring integrations because it is tedious manual process.
Manual Data Integration: Benefits
Reduced cost: requires little maintenance and typically only integrates a small number of data sources
Greater Freedom: user has total control over the integration
Manual Data Integration: Limitations
Less access: A developer or manager must manually orchestrate each integration
Difficulty scaling: Scaling for larger projects requires manually changing the code of each integration, and that takes time
Greater room for error: A manager/analyst must handle the data at each stage
Middleware data integration
Software that connects applications and transfers data between them and databases. Middleware can act as an interpreter between old and new systems
Mostly it is a communication tool and has limited capabilities for data analytics
Middleware data integration: Benefits
Better data streaming: the software conducts the integration automatically
Easier access between systems: software is coded to facilitate communication between the systems in a network
Middleware data integration: Limitations
Less access: middleware needs to be developed and maintained by a developer with technical knowledge
Limited functionality: can only work with certain systems
Application-based integration
Software applications do all the work: locate, retrieve, clean, integrate data from disparate sources. Easy for data to move from one source to the other
Application-based integration: Benefits
Simplified processes: One application does all the work automatically
Easier information exchange: the application allows systems and departments to transfer information seamlessly
Fewer resources are used: because much fo the process is automated, analysts can pursue other projects
Application-based integration: Limitations
Limited access: requires technical knowledge and a analyst to oversee application maintenance/deployment
Inconsistent results: Approach is unstandardised and varies from businesses offering this as a service
Complicated set-up: designing requires developers/analysts etc… with technical knowledge
Difficult data management: accessing different systems can lead to compromised data integrity
Uniform access integration
Accesses data from even more disparate sets and presents it uniformly, while allowing the data to stay in its original location
Uniform access integration: Benefits
Lower storage requirements
Easier data access (works well with multiple data sources/systems)
Simplified view of data: uniformed appearance of data for end user
Uniform access integration: Limitations
Data integrity challenges: lots of sources can lead to compromising data integrity
Strained systems: Data host systems are not usually designed to handle the amount/frequency of data requests
Common storage integration: (data warehousing)
Similar to uniform access - involves creating and storing a copy of the data in a data warehouse. More versatility
Common storage integration (data warehousing): Benefits
Reduced burden: host system isn’t constantly handling data queries
Increased data version management control: one source = better data integrity
Cleaner data appearance
Enhanced data analytics: maintaining a stored copy allows more sophisticated queries
Common storage integration (data warehousing): Limitations
Increased storage costs: creating a copy = paying for storage
Higher maintenance costs =
WHEN TO USE IT:
Manual Data Integration
Merge data for basic analysis between a small amount of data sources
WHEN TO USE IT:
Middleware Data Integration
Automate and translate communication between legacy and modernised systems
WHEN TO USE IT:
Application-based Data Integration
Automate and translate communication between systems and allow for more complicated data analysis
WHEN TO USE IT:
Uniform-based Data Integration
Automate and translate communication between systems and present the data uniformly to allow for complicated data analysis
WHEN TO USE IT:
Common storage Data Integration
Present the data uniformly, create and store a copy and perform the most sophisticated data analysis
Common user interface
Virtual Integration
or Virtual Integration
leaves the data in the source systems and defined a set of views to provide and access the unified view to the customer across the whole enterprise
pros: nearly zero latency of the data updates
cons: limited possibility of datas history and version management + apply to simialr
Physical Data Integration: ETL
Extract:
The process of reading multiple data sources into the Data Warehouse, COPIED (not moved).
Data validation occurs during this stage, you must have the correct:
- structure
- format
- permissions
Method of extraction - for each data source, define whether the extraction process is manual or tool based.
Physical Data Integration: ETL
Transform:
The process of combining the data tables or linking them using relational databases.
The data itself is not changed in any way.
Data engineers must consider the efficiency of the databases as well as ensuring that all necessary data can be accessed.
Before moving the data into a single point of reference we need to:
- remove inconsistencies
- Standardise various data elements
- Make sure of the meanings of the data names in each file
- Deduplication
- Deriving new calculated values
- Data validation
Physical Data Integration: ETL
Load:
The process of writing the data to the target database
Due to the nature of Big Data = necessary to use parallel processing to manage the volume of data being written to the system.
Data Verification is undertaken post-loading to ensure the data is accurate.
Sequence
It is the order we want the compute to execute the instructions we provide as programmers.
Selection
Selecting which path of an algorithm to execute depending on some criteria.
Iteration
Refers to looping or repeating procedures.
SIMPLE QUERY
SELECT FROM WHERE ORDER BY HAVING LIMIT DISTINCT
SELECT column, names, FROM table-name WHERE condition / filter on rows ORDER BY sort-order HAVING filters / sorts / arranges on groupby LIMIT restricts number of rows DISTINCT brings back unique values
Union RUles
Union = multiple tables with a single query
- Must match the number of columns, compatible data types
- Can only have one ORDER BY at the bottom of the full select statement
- UNION = removes exact duplicates where UNION ALL allows for duplicates
- Conditions between UNION SELECT statements should match
SQL Expressions: CASE
groups data into categories or classifications
SQL Expressions: DATETIME
type of variable that allows storing a date/time value in the database YYYY-MM-DD HH:MI:SS
SQL Expressions: Compound
mathematical operations within the relational database: arithmetic, comparison, logistical, string
Arithmetic
+ - * / ** (addition, subtraction, multiplication, division, to the power of)
Comparison
= Equal to
!= or <> Not equal to
< Less than
<= or !> Less than or equal to (not greater than)
> Greater than
>= or !< Greater than or equal to (not less than)
Logistical
ALL / AND Does the value meet ALL criteria
ANY/ OR Does the value meet ANY criteria
BETWEEN Is the value BETWEEN listed values
EXISTS Does a row meeting the criteria Exist
IN Is the value found in the listed literal values
LIKE Compares the value to listed values using wildcard
NOT reverses the meaning of a logical operator
IS NULL checks if value is null
UNIQUE searches for duplicates
String
CHAR(n) returns the character at index n
CONCAT concatenates items (puts together)
FORMAT(n) Returns a number formatted to n decimal places
LOWER() returns the argument in lowercase
UPPER() returns the argument in uppercase
REPEAT() repeats the string a certain number of times
TRIM() removes leading and trailing spaces
Functions
AVG COUNT MAX MIN GROUPBY ROUND CAST CONVERT ISNULL
AVG = average value COUNT = number of rows MAX = largest number MIN = smallest number GROUPBY = indicates dimensions to group data by on aggregates ROUND = specifies number of decimal places CAST = changes data type of an expression (temp) CONVERT = converts the data type of a value (perm) ISNULL = returns a specified value if the expression is null, if not null returns the expression
JOINS
INNER OUTER FULL(OUTER) RIGHT LEFT UNION SELECT INTO SUBQURIES EXCEPTION CROSS
INNER = joins only what matches in both tables
OUTER = Combination of a left join and a right join
FULL(OUTER)=
RIGHT = includes everything from the right table and anything that matches in the left
LEFT = includes everything from the left table and anything that matches in the right
UN|ON = two SELECT queries they union rows
SELECT INTO = copies data into from one table into a new table
SUBQUERIES: Queries within another SQL query and embedded within the WHERE clause, condition to further restrict
EXCEPTION (and outer) help to handle missing data between tables
CROSS JOIN - not desired / cartesian / slow performance
Implicit Data Conversion
server automatically converts the data from one type to another during the query processs
Data Profiling
Review to determine
Accuracy
Completeness
Validity