Data Analysis 1 Flashcards

1
Q

What are the 4 types of data analytics?

A
  1. Descriptive - What happened?
  2. Diagnostic - Why did it happen?
  3. Predictive - What will happen next?
  4. Prescriptive - What should be done about it?
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are the 6 most basic steps in data analysis?

A
  1. Understanding the problem and desired result
  2. Setting a clear metric - what and how will be measured?
  3. Gathering data
  4. Cleaning data
  5. Analysing and mining data
  6. Interpret and present results
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is the difference between analysis and analytics?

A

Analysis can be done without numbers or data, such as business analysis psycho analysis, etc.

Whereas Analytics, even when used without the prefix “Data”, almost invariably implies use of data for performing numerical manipulation and inference.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is the ETL process?

A

Extract, Transform, Load. Describes taking data from disparate sources and centralising them in a data warehouse.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is a data warehouse?

A

Data warehouse - your single source of truth for all data that has been extracted, transformed, loaded from any source

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is a data mart?

A

Data mart - Subsection of the data warehouse, built for a specific business function, purpose, or community of users (e.g. individual stakeholder data). Isolated security and performance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is a data lake?

A

Data lake - A repository that can store structured, semi-structured and unstructured data in their raw format, classified and tagged with meta data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is a data pipeline?

A

Encompasses the entire journey of moving data from one system to another, including the ETL process. Typically loads into a data lake.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What are the 5 V’s of big data?

A
  1. Velocity - data is being generated fast and constantly
  2. Volume - scale and storage of data
  3. Variety - diversity (structured, non-structured, people-data and machine-data etc.)
  4. Veracity - quality and origin
  5. Value - ability to turn data into value
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is a data repository?

A

A Data Repository is a general term that refers to data that has been collected, organized, and isolated so that it can be used for reporting, analytics, and also for archival purposes.

This can include databases, marts, warehouses etc.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is data wrangling?

A

Exploration, transformation validation and publishing of data to prepare it for analysis

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is ‘normalising’ data?

A

Cleaning unused data, reducing redundancy, reducing inconsistency

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is ‘denormalising’ data?

A

Combining data from multiple tables into a single table for faster queries and analysis

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is ‘enriching’ data?

A

Adding to your data to get more value out of it, e.g. using the metadata

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What are descriptive statistics and inferential statistics

A

Descriptive is focused on describing the visible characteristics of a dataset, without necessarily making any inferences or drawing conclusions about it. E.g. Mean/Median/Mode.

Inferential statistics takes data from a sample to make inferences about a larger population from which the sample was drawn

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is central tendency and what are 3 measures of it?

A

Locating the centre of a data sample. E.g. Mean, Median and Mode.

17
Q

What is dispersion and what are 3 examples of it?

A

Measure of the variability of a data set. E.g. Variance, Standard Deviation and Range

18
Q

Define variance, standard deviation and range.

A

Variance - How far data points fall away from the centre, i.e. the distribution of values.

Lower variability = more consistent values in the dataset

Higher variability = data points that are more dissimilar, with higher likelihood of extreme values.

19
Q

What is standard deviation?

A

Tells you how tightly your data is clustered around the mean

20
Q

What is range?

A

Tells you the distance between smallest and largest values in the data set

21
Q

What is skewness and why does it matter?

A

A measure of whether the distribution of values is symmetrical around a central value, or skewed to the left or right. Can affect which types of anaysis are valid to perform

22
Q

Explain 3 common types of inferential statistics.

A

Hypothesis testing - e.g. comparing efficacy of a vaccine by comparing outcomes in a control group.

Confidence intervals - incorporate the uncertainty and sample error to create a range of values the actual population is likely to fall within.

Regression analysis - incorporates hypothesis tests to determine whether relationships observed in the sample actually exist in the population data as well

23
Q

What’s the difference between patterns and trends?

A

Patterns recur regularly, e.g. the time of day when most users are logged into an application.

A trend is the general tendency of a set of data to change over time e.g. global temperatures because of climate change.

24
Q

What is data mining?

A

The process of extracting knowledge from data

25
Q

Name and describe commonly used data mining techniques

A