Data Analytics Flashcards

1
Q

What is data analytics - Introduction, exam focus, parts

A

It involves collection and analysis of data to provide meaning and insight.
Exam will centre around problems and challenges involved in data collection and analysis
impact new analytical methods will have on a business
Two main parts
1. collection - Ethics and GDPR, Text analysis, Image analysis, sentiment analysis, voice analysis
2 .Descriptive, Diagnostic and predictive

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

what is data collection - GDPR & Ethical issues

A

Data should be collected both ethically and accurately in a timely manner.
The accountant is responsible for data management

GDPR
1.Data should be collected and used for the purpose intended and
declared
2. Only collect data needed for the stated purpose
3. Data subjects have the right to view and change inaccurate data
4. Data should not be held for longer than for period intended
5. Data must be held securely

Ethical Issues
1. Privacy - data is private and should not be exploited, sharing of
data with other businesses could break this principle
2. Analysis from data must be robustly challenged
3. Excessive trust in the black box - Algorithms and AI, how can we
make sure the data remains accurate? testing

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Modern developments in Data collection - Text analysis, Image analysis, voice analysis, sentiment analysis

A

Text analysis - Machine can read and understand within context far better than ever before. e.g. Emails and social media scraping, customer feedback online.

Image analysis - Images can be rapidly analysed to identify
1. who is in the image?
2. what were they doing?
3. which products were in the image?
4. when was it displayed?
This builds up a profile of a data subject which could be used for marketing purposes

Voice analysis
If conversation is recorded within a call centre, then machines can be used to
1. understand what the conversation is about
2. measure resolution period
3. Truth verification
4. document promised actions

Sentiment analysis
It attempts whether the subject was happy, sad, complaining or satisfied by the end of call. it can be summarised and monitored.

General issues with the new systems
1. Privacy - ads tracking, stalking
2. Damage - can system identify uncomfortable connection
3. Depth of info - Traditional data analysis tells you what not why
4. control systems - permission management, controls for GDPR

Performance measurement systems
If it is possible to collect this data, then how well a business does this should be measured. This will require investment and training

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Data Analysis - introduction, outliers

A

Population - Before any calculation is made, the population it is made on should be
1. Free of major errors - misclassification, value, cut-off errors,
2. multiple populations must be separated e.g. men and women
heights.
3. Data has to be segmented into segmented strata or sub
populations - Home country vs overseas sales ow wholesale v
retail rather than total sales for example.
4. if selected based on a sample, then that sample must be
representative of the population and big enough to carry
sufficient confidence in the findings without bias.

Outliers
1. Investigate
2. if an error correct
3. if sub population, - then separate it
4. unwise to proceed with analysis and then explain outlier as a reservation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Data Analysis - Descriptive analysis, centrality, mode, averages

A

Data can be described by way of centrality or by way of dispersion

Descriptive analysis is calculations and analysis to describe the data.

centrality - Averages
problems with averages
1. distorted by extremes in value
2. If drawn from data with multiple populations, it will be a relatively
meaningless stat
3. It is not understood as well as people think e.g. crossing a river
that is on 60cm deep

centrality - Mode
Mode - This is the most frequent occurrence or result, most people experience
1. it is not distorted by extreme values or outliers
2. Mode is under used as a statistic
3. Mode is as widely understood so must be explained to the
audience

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Data Analysis - Descriptive analysis Dispersion

A

Data can be described by dispersion and risk. How variable is this?

How different is the data to each other?

measure of spread is measure of risk

The bigger the standard deviation, the bigger the risk

The easiest way is to measure range of minimum result to maximum result but it doesn’t take into account average dispersion

standard deviation
standard deviation is the average deviation between the items in the population and the mean.
It is a proxy for spread and risk
standard deviation is not commonly used and not really understood
it is also distorted by absolute values so large absolute values produce standard deviation rules. This can be corrected using coefficient of variation
if original data has big numbers standard will also have big numbers without any added risk

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Data Analysis - data visualisation

A

This is a useful way of describing data as it involves using visualisations like charts and graphs.
it aids in showing trends over a period of time
They are also easy to manipulate

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Data Analysis - Diagnostic analysis - correlation, regression

A

This is attempting to understand why something has happened to the data.

It typically involves a two step process
1. correlation - measure of strength between two variables, a good relationship could support the ideas there appears a relationship - a change in X could be causing a change in Y. the extent of the change could then be quantified with regression analysis.

X is always assumed to determine Y so Y is dependent. e.g. marketing determines Sales, spending more money on marketing leading to more sale.

Positive correlation shows more money on marketing leads to more sale, there is always an exception to the rule e.g. buying without seeing marketing and seeing marketing without buying

Negative relationship shows increasing spend on staff reduces waiting time

The strength of the relationship is measured by the correlation coefficient (r)

R value ranges from +1 to -1, R of +1 indicates a perfect positive relationship between X and Y and R of -1 indicates a perfect negative relationship.

Most business will have a values less than 1 even for things that appear to be related.

The lower R gets, in order to proceed to regression you need to make a judgement call, is there too much uncertainty or is it good enough?

To make this judgement call you can use Coefficient of determination (D) D=R² E.G R =0.8 D=64%
This means if X is marketing and Y is sales, then 64% of the increase in sales can be explained by increase in marketing, 36% is caused by other causes and in unexplained

Problems with correlation
1. They are based on historic value
2. The high R value could be spurious, this means it could have been by accident or both variables have a common cause
3. Correlation does not mean causation, If correlation is high you still need to use common sense.
4. if the calculations are based on a small or biased sample. A different relationship might exist than shown from the sample
5. there could be more than one driver of Y variable or a given X could affect more than one Y value. - multivariate correlation

Regression
Regression can tell you how much of marketing would increase sales
Y=12,000+1.8X
Y= sales expected
12,000 of sales occur even if marketing spend is zero
1.8 is the sales generated for every $1 of marketing
X= marketing spend

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Data Analysis - Predictive analysis

A

This uses data gathered to predict what will happen in the future if the X variable changes.

The data is historic and the past does not determine the future

Interpolation is predicting within original data range , there is data to back it up, less risky

Extrapolation is predicting outside of the data range, much more risky, spending an amount we haven’t spent previously and expect and increase in line with previous spend.

Evaluate results of good data analysis, what does standard deviation mean?, reliability of data analysis

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Data Analysis -Things to note

A

Outliers - If errors correct them, if sub group separate them
Recognise any bias in population
Averages - can be skewed by extreme values, mode is a better option
Standard deviation - average deviation from average of population
COV - Used to solve issues of big numbers is standard deviation (SD/MEAN)
Correlation - is there a relationship between X and Y- Marketing and sales, strength of the relationship is measured by the correlation coefficient R.
Regression - how much increase in marketing will increase Sales?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly