Data Analysis 1 Flashcards
What are the 4 types of data analytics?
- Descriptive - What happened?
- Diagnostic - Why did it happen?
- Predictive - What will happen next?
- Prescriptive - What should be done about it?
What are the 6 most basic steps in data analysis?
- Understanding the problem and desired result
- Setting a clear metric - what and how will be measured?
- Gathering data
- Cleaning data
- Analysing and mining data
- Interpret and present results
What is the difference between analysis and analytics?
Analysis can be done without numbers or data, such as business analysis psycho analysis, etc.
Whereas Analytics, even when used without the prefix “Data”, almost invariably implies use of data for performing numerical manipulation and inference.
What is the ETL process?
Extract, Transform, Load. Describes taking data from disparate sources and centralising them in a data warehouse.
What is a data warehouse?
Data warehouse - your single source of truth for all data that has been extracted, transformed, loaded from any source
What is a data mart?
Data mart - Subsection of the data warehouse, built for a specific business function, purpose, or community of users (e.g. individual stakeholder data). Isolated security and performance.
What is a data lake?
Data lake - A repository that can store structured, semi-structured and unstructured data in their raw format, classified and tagged with meta data
What is a data pipeline?
Encompasses the entire journey of moving data from one system to another, including the ETL process. Typically loads into a data lake.
What are the 5 V’s of big data?
- Velocity - data is being generated fast and constantly
- Volume - scale and storage of data
- Variety - diversity (structured, non-structured, people-data and machine-data etc.)
- Veracity - quality and origin
- Value - ability to turn data into value
What is a data repository?
A Data Repository is a general term that refers to data that has been collected, organized, and isolated so that it can be used for reporting, analytics, and also for archival purposes.
This can include databases, marts, warehouses etc.
What is data wrangling?
Exploration, transformation validation and publishing of data to prepare it for analysis
What is ‘normalising’ data?
Cleaning unused data, reducing redundancy, reducing inconsistency
What is ‘denormalising’ data?
Combining data from multiple tables into a single table for faster queries and analysis
What is ‘enriching’ data?
Adding to your data to get more value out of it, e.g. using the metadata
What are descriptive statistics and inferential statistics
Descriptive is focused on describing the visible characteristics of a dataset, without necessarily making any inferences or drawing conclusions about it. E.g. Mean/Median/Mode.
Inferential statistics takes data from a sample to make inferences about a larger population from which the sample was drawn
What is central tendency and what are 3 measures of it?
Locating the centre of a data sample. E.g. Mean, Median and Mode.
What is dispersion and what are 3 examples of it?
Measure of the variability of a data set. E.g. Variance, Standard Deviation and Range
Define variance, standard deviation and range.
Variance - How far data points fall away from the centre, i.e. the distribution of values.
Lower variability = more consistent values in the dataset
Higher variability = data points that are more dissimilar, with higher likelihood of extreme values.
What is standard deviation?
Tells you how tightly your data is clustered around the mean
What is range?
Tells you the distance between smallest and largest values in the data set
What is skewness and why does it matter?
A measure of whether the distribution of values is symmetrical around a central value, or skewed to the left or right. Can affect which types of anaysis are valid to perform
Explain 3 common types of inferential statistics.
Hypothesis testing - e.g. comparing efficacy of a vaccine by comparing outcomes in a control group.
Confidence intervals - incorporate the uncertainty and sample error to create a range of values the actual population is likely to fall within.
Regression analysis - incorporates hypothesis tests to determine whether relationships observed in the sample actually exist in the population data as well
What’s the difference between patterns and trends?
Patterns recur regularly, e.g. the time of day when most users are logged into an application.
A trend is the general tendency of a set of data to change over time e.g. global temperatures because of climate change.
What is data mining?
The process of extracting knowledge from data