Data Analysis Tools Flashcards
Purpose and Outputs of data integration activities
Integration begins with the ingestion process, and includes steps such as cleansing, ETL mapping and transformation.
Data integration ultimately enables analytics tools
to produce effective, actionable business intelligence.
Functional Requirements
Defines a system or its component
Describes functions a software must perform
A function is nothing but inputs, it’s behaviour and outputs
Calculation / data manipulation/ business process / user interaction
Help you capture the intended behaviour of the system
Behaviour may be expressed as functions, services or tasks
Non-functional requirement
Defined the quality attribute of a software system
They represent a set of standard used to judge the specific operation of a system
E.g how fast does the website load
Is essential to ensure the use ability and effectiveness of the entire software system
Allow you to impose constrains or restrictions on the design of the system across the various agile backlogs
Speed of Data Integration
Faster for data warehouses than relational databases because the access is only write-only.
Data warehouses allow large analytical queries which eliminate the issue by accessing transactional databases (OLTP)
Loading data into warehouses (ETL) is usually carried out in batches and during the loading data the warehouse is unavailable.
Having data integrated into a single source saves time (doesn’t take as long to prepare and analyse the data.
Data Integration: Structure and Rules
- Security policies
- Access Layers
- Should be immutable (not be able to change the content of the integrated data destination)
- Validation checks should be carried out during ETL
- Validate the source and target table structure, data types
- Validation checks should be carried out during ETL
- Validate the source and target table structure, data types
- Validate the column names against a mapping doc
- Verification done using ETL testing
- Verify that the data is accurate
- Verify the data is the right data required to be in the data warehouse
- Verify that dat has no duplicated in the data warehouse
Rationale for using and integrating data from multiple sources
- Consistency
missing data and identifying gaps that need to be filled - through ETL
Benefits of integrating data from multiple sources
- Increased collaborations across teams
- Better business intelligence and insights
- Data availability to all stakeholders
To consider: Data in a business context
- Variability: illustrates how things differ, and by how much
- Uncertainty: Good visualisation practices frame uncertainty that arises from variation in data
- Context: Meaningful context helps us frame uncertainty against underlying variation in data
Descriptive Analysis
Looks at past data and tells what happened. Often used when tracking KPI, revenue, sales
Diagnostic Analysis
Aims to determine why something happened
Predictive Analysis
predicts what is likely to happen in the future
Trends are derived from past data and used to create predictions
Prescriptive Analysis
Combines the information found from the previous 3 types of data analysis and forms a plan of action for the organisation to face the issue or decision
Data Warehouse
An industry standard tool that allows the collection and organisation of data origination from various sources, is the data warehouse.
Reasons for using data from multiple sources
- allows the analyst to pull together data to resolve issues
- perform data analysis
- obtain a 360 view of a problem
- assist with decision making among others
Quality of Data Sources - thins that could go wrong
Data Truncation - prevision is lost/wrong data types
Data Corruption - commas in the wrong place
Data Missing - only a portion of the data set is uploaded
Data Typing - wrong data type uploaded into field
Data Translation - Wrong encoding/symbols
to ensure better quality of data:
CheckSum Spot Checks (Eyeballing) Mins/Max/Aggregates Counts Export Comparison
Data Integration
the process of combining data from different sources to help data managers and executives analyse it and make smarter business decisions
Data Integration Process involves:
A person or system location, retrieving, cleaning, and presenting the data
Manual Data Integration
Occurs when a data manager oversees all aspects of the integration - usually by writing custom code. without automation
Best for one-time instances - untenable for complex or recurring integrations because it is tedious manual process.
Manual Data Integration: Benefits
Reduced cost: requires little maintenance and typically only integrates a small number of data sources
Greater Freedom: user has total control over the integration
Manual Data Integration: Limitations
Less access: A developer or manager must manually orchestrate each integration
Difficulty scaling: Scaling for larger projects requires manually changing the code of each integration, and that takes time
Greater room for error: A manager/analyst must handle the data at each stage
Middleware data integration
Software that connects applications and transfers data between them and databases. Middleware can act as an interpreter between old and new systems
Mostly it is a communication tool and has limited capabilities for data analytics
Middleware data integration: Benefits
Better data streaming: the software conducts the integration automatically
Easier access between systems: software is coded to facilitate communication between the systems in a network