Data Analysis Tools Flashcards

1
Q

Purpose and Outputs of data integration activities

A

Integration begins with the ingestion process, and includes steps such as cleansing, ETL mapping and transformation.

Data integration ultimately enables analytics tools
to produce effective, actionable business intelligence.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Functional Requirements

A

Defines a system or its component
Describes functions a software must perform
A function is nothing but inputs, it’s behaviour and outputs

Calculation / data manipulation/ business process / user interaction

Help you capture the intended behaviour of the system

Behaviour may be expressed as functions, services or tasks

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Non-functional requirement

A

Defined the quality attribute of a software system

They represent a set of standard used to judge the specific operation of a system

E.g how fast does the website load

Is essential to ensure the use ability and effectiveness of the entire software system

Allow you to impose constrains or restrictions on the design of the system across the various agile backlogs

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Speed of Data Integration

A

Faster for data warehouses than relational databases because the access is only write-only.

Data warehouses allow large analytical queries which eliminate the issue by accessing transactional databases (OLTP)

Loading data into warehouses (ETL) is usually carried out in batches and during the loading data the warehouse is unavailable.

Having data integrated into a single source saves time (doesn’t take as long to prepare and analyse the data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Data Integration: Structure and Rules

A
  • Security policies
  • Access Layers
  • Should be immutable (not be able to change the content of the integrated data destination)
  • Validation checks should be carried out during ETL
  • Validate the source and target table structure, data types
  • Validation checks should be carried out during ETL
  • Validate the source and target table structure, data types
  • Validate the column names against a mapping doc
  • Verification done using ETL testing
  • Verify that the data is accurate
  • Verify the data is the right data required to be in the data warehouse
  • Verify that dat has no duplicated in the data warehouse
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Rationale for using and integrating data from multiple sources

A
  • Consistency

missing data and identifying gaps that need to be filled - through ETL

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Benefits of integrating data from multiple sources

A
  • Increased collaborations across teams
  • Better business intelligence and insights
  • Data availability to all stakeholders
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

To consider: Data in a business context

A
  1. Variability: illustrates how things differ, and by how much
  2. Uncertainty: Good visualisation practices frame uncertainty that arises from variation in data
  3. Context: Meaningful context helps us frame uncertainty against underlying variation in data
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Descriptive Analysis

A

Looks at past data and tells what happened. Often used when tracking KPI, revenue, sales

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Diagnostic Analysis

A

Aims to determine why something happened

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Predictive Analysis

A

predicts what is likely to happen in the future

Trends are derived from past data and used to create predictions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Prescriptive Analysis

A

Combines the information found from the previous 3 types of data analysis and forms a plan of action for the organisation to face the issue or decision

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Data Warehouse

A

An industry standard tool that allows the collection and organisation of data origination from various sources, is the data warehouse.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Reasons for using data from multiple sources

A
  • allows the analyst to pull together data to resolve issues
  • perform data analysis
  • obtain a 360 view of a problem
  • assist with decision making among others
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Quality of Data Sources - thins that could go wrong

A

Data Truncation - prevision is lost/wrong data types
Data Corruption - commas in the wrong place
Data Missing - only a portion of the data set is uploaded
Data Typing - wrong data type uploaded into field
Data Translation - Wrong encoding/symbols

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

to ensure better quality of data:

A
CheckSum
Spot Checks (Eyeballing)
Mins/Max/Aggregates
Counts
Export Comparison
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Data Integration

A

the process of combining data from different sources to help data managers and executives analyse it and make smarter business decisions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Data Integration Process involves:

A

A person or system location, retrieving, cleaning, and presenting the data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Manual Data Integration

A

Occurs when a data manager oversees all aspects of the integration - usually by writing custom code. without automation
Best for one-time instances - untenable for complex or recurring integrations because it is tedious manual process.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Manual Data Integration: Benefits

A

Reduced cost: requires little maintenance and typically only integrates a small number of data sources
Greater Freedom: user has total control over the integration

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Manual Data Integration: Limitations

A

Less access: A developer or manager must manually orchestrate each integration
Difficulty scaling: Scaling for larger projects requires manually changing the code of each integration, and that takes time
Greater room for error: A manager/analyst must handle the data at each stage

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Middleware data integration

A

Software that connects applications and transfers data between them and databases. Middleware can act as an interpreter between old and new systems
Mostly it is a communication tool and has limited capabilities for data analytics

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Middleware data integration: Benefits

A

Better data streaming: the software conducts the integration automatically
Easier access between systems: software is coded to facilitate communication between the systems in a network

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Middleware data integration: Limitations

A

Less access: middleware needs to be developed and maintained by a developer with technical knowledge
Limited functionality: can only work with certain systems

25
Q

Application-based integration

A

Software applications do all the work: locate, retrieve, clean, integrate data from disparate sources. Easy for data to move from one source to the other

26
Q

Application-based integration: Benefits

A

Simplified processes: One application does all the work automatically
Easier information exchange: the application allows systems and departments to transfer information seamlessly
Fewer resources are used: because much fo the process is automated, analysts can pursue other projects

27
Q

Application-based integration: Limitations

A

Limited access: requires technical knowledge and a analyst to oversee application maintenance/deployment
Inconsistent results: Approach is unstandardised and varies from businesses offering this as a service
Complicated set-up: designing requires developers/analysts etc… with technical knowledge
Difficult data management: accessing different systems can lead to compromised data integrity

28
Q

Uniform access integration

A

Accesses data from even more disparate sets and presents it uniformly, while allowing the data to stay in its original location

29
Q

Uniform access integration: Benefits

A

Lower storage requirements
Easier data access (works well with multiple data sources/systems)
Simplified view of data: uniformed appearance of data for end user

30
Q

Uniform access integration: Limitations

A

Data integrity challenges: lots of sources can lead to compromising data integrity
Strained systems: Data host systems are not usually designed to handle the amount/frequency of data requests

31
Q

Common storage integration: (data warehousing)

A

Similar to uniform access - involves creating and storing a copy of the data in a data warehouse. More versatility

32
Q

Common storage integration (data warehousing): Benefits

A

Reduced burden: host system isn’t constantly handling data queries
Increased data version management control: one source = better data integrity
Cleaner data appearance
Enhanced data analytics: maintaining a stored copy allows more sophisticated queries

33
Q

Common storage integration (data warehousing): Limitations

A

Increased storage costs: creating a copy = paying for storage
Higher maintenance costs =

34
Q

WHEN TO USE IT:

Manual Data Integration

A

Merge data for basic analysis between a small amount of data sources

35
Q

WHEN TO USE IT:

Middleware Data Integration

A

Automate and translate communication between legacy and modernised systems

36
Q

WHEN TO USE IT:

Application-based Data Integration

A

Automate and translate communication between systems and allow for more complicated data analysis

37
Q

WHEN TO USE IT:

Uniform-based Data Integration

A

Automate and translate communication between systems and present the data uniformly to allow for complicated data analysis

38
Q

WHEN TO USE IT:

Common storage Data Integration

A

Present the data uniformly, create and store a copy and perform the most sophisticated data analysis

39
Q

Common user interface

A
40
Q

Virtual Integration

or Virtual Integration

A

leaves the data in the source systems and defined a set of views to provide and access the unified view to the customer across the whole enterprise

pros: nearly zero latency of the data updates
cons: limited possibility of datas history and version management + apply to simialr

41
Q

Physical Data Integration: ETL

Extract:

A

The process of reading multiple data sources into the Data Warehouse, COPIED (not moved).

Data validation occurs during this stage, you must have the correct:

  • structure
  • format
  • permissions

Method of extraction - for each data source, define whether the extraction process is manual or tool based.

42
Q

Physical Data Integration: ETL

Transform:

A

The process of combining the data tables or linking them using relational databases.

The data itself is not changed in any way.

Data engineers must consider the efficiency of the databases as well as ensuring that all necessary data can be accessed.

Before moving the data into a single point of reference we need to:

  • remove inconsistencies
  • Standardise various data elements
  • Make sure of the meanings of the data names in each file
  • Deduplication
  • Deriving new calculated values
  • Data validation
43
Q

Physical Data Integration: ETL

Load:

A

The process of writing the data to the target database

Due to the nature of Big Data = necessary to use parallel processing to manage the volume of data being written to the system.

Data Verification is undertaken post-loading to ensure the data is accurate.

44
Q

Sequence

A

It is the order we want the compute to execute the instructions we provide as programmers.

45
Q

Selection

A

Selecting which path of an algorithm to execute depending on some criteria.

46
Q

Iteration

A

Refers to looping or repeating procedures.

47
Q

SIMPLE QUERY

SELECT 
FROM
WHERE 
ORDER BY 
HAVING 
LIMIT 
DISTINCT
A
SELECT column, names, 
FROM table-name
WHERE condition / filter on rows
ORDER BY sort-order
HAVING filters / sorts / arranges on groupby 
LIMIT  restricts number of rows
DISTINCT brings back unique values
48
Q

Union RUles

A

Union = multiple tables with a single query

  1. Must match the number of columns, compatible data types
  2. Can only have one ORDER BY at the bottom of the full select statement
  3. UNION = removes exact duplicates where UNION ALL allows for duplicates
  4. Conditions between UNION SELECT statements should match
49
Q

SQL Expressions: CASE

A

groups data into categories or classifications

50
Q

SQL Expressions: DATETIME

A

type of variable that allows storing a date/time value in the database YYYY-MM-DD HH:MI:SS

51
Q

SQL Expressions: Compound

A

mathematical operations within the relational database: arithmetic, comparison, logistical, string

52
Q

Arithmetic

A

+ - * / ** (addition, subtraction, multiplication, division, to the power of)

53
Q

Comparison

A

= Equal to
!= or <> Not equal to
< Less than
<= or !> Less than or equal to (not greater than)
> Greater than
>= or !< Greater than or equal to (not less than)

54
Q

Logistical

A

ALL / AND Does the value meet ALL criteria
ANY/ OR Does the value meet ANY criteria
BETWEEN Is the value BETWEEN listed values
EXISTS Does a row meeting the criteria Exist
IN Is the value found in the listed literal values
LIKE Compares the value to listed values using wildcard
NOT reverses the meaning of a logical operator
IS NULL checks if value is null
UNIQUE searches for duplicates

55
Q

String

A

CHAR(n) returns the character at index n
CONCAT concatenates items (puts together)
FORMAT(n) Returns a number formatted to n decimal places
LOWER() returns the argument in lowercase
UPPER() returns the argument in uppercase
REPEAT() repeats the string a certain number of times
TRIM() removes leading and trailing spaces

56
Q

Functions

AVG
COUNT
MAX
MIN
GROUPBY 
ROUND
CAST
CONVERT 
ISNULL
A
AVG = average value
COUNT = number of rows 
MAX = largest number
MIN = smallest number 
GROUPBY = indicates dimensions to group data by on aggregates 
ROUND = specifies number of decimal places 
CAST = changes data type of an expression (temp)
CONVERT = converts the data type of a value (perm) 
ISNULL =  returns a specified value if the expression is null, if not null returns the expression
57
Q

JOINS

INNER
OUTER
FULL(OUTER)
RIGHT
LEFT 
UNION
SELECT INTO
SUBQURIES
EXCEPTION
CROSS
A

INNER = joins only what matches in both tables
OUTER = Combination of a left join and a right join
FULL(OUTER)=
RIGHT = includes everything from the right table and anything that matches in the left
LEFT = includes everything from the left table and anything that matches in the right
UN|ON = two SELECT queries they union rows
SELECT INTO = copies data into from one table into a new table
SUBQUERIES: Queries within another SQL query and embedded within the WHERE clause, condition to further restrict
EXCEPTION (and outer) help to handle missing data between tables
CROSS JOIN - not desired / cartesian / slow performance

58
Q

Implicit Data Conversion

A

server automatically converts the data from one type to another during the query processs

59
Q

Data Profiling

A

Review to determine
Accuracy
Completeness
Validity