wk 8 - business intelligence and big data Flashcards
Business intelligence (def)
Business Intelligence (BI) is the process of collecting, cleansing, combining, consolidating, analysing, interpreting and communicating internal and available external data, relevant for the decision making process in the organisation
why the interest in business data?
Availability of Data
Amount of digital data growing exponentially
In their 2010 report, Gartner defined a small data warehouse as less than 5 terabytes (TB), a medium data warehouse as 5 TB to 20 TB, and a large data warehouse as greater than 20 TB
What’s feeding massive data increase?
Digital channel data, content posted on social media, data collected by smart, connected devices (cf. the Internet of Things), real-time sensor readings, supply chain technologies such as RFID, etc.
Other data is available through “open data” initiatives (e.g. public services making data freely available), Google Maps, real-time data feeds (e.g. stock exchanges), etc.
Business trends that drive Business intelligence
Implementation of corporate performance management systems
Compliance with new regulatory frameworks
Importance of Customer Relationship Management (CRM) and one-to-one marketing
Trends such as market globalisation, company mergers, etc.
Digital business, digital marketing, social media, etc.
Idea of the data-driven organisation that competes on analytics
Corporate performance management
Also referred to as enterprise performance management (EPM), business performance management (BPM), strategic performance management (SPM), or simply performance management
Refers to a set of management processes, often supported by technology, that involve measuring and monitoring performance in support of better strategic decision making
“Trying to improve something without having a goal, a numerical goal, is like trying to lose weight without having a scale.” (Subir Chowdhury, ‘The Power of Six Sigma’)
Examples: balanced scorecard, six sigma, etc.
Customer relationship management
Customer Relationship Management (CRM) systems: designed to help firms manage customer interactions and maximise the customer lifetime value for the firm
“It is 6-7 times more expensive to gain a new customer than retain an existing customer.” (Bain & Co. study, HBR)
Operational CRM: systems supporting customer-facing processes (e.g. sales lead management, call centre & customer service support, etc.)
Analytical CRM: analysis of customer data to provide insights or models to optimise aspects of our customer relationships (e.g. which customer segments to target with retention campaign, cross-selling opportunities, etc.)
One of the key difficulties: having a “single customer view”
why Data Warehousing?
Relational databases are typically optimised to support everyday operations (e.g. sales transaction processing), not so much analytics activities
Working on same database may slow system down
Data structure not ideally suited
History typically not systematically kept
Different systems for various functional areas of the company (sales, marketing, customer service, production, etc.) data from all of which may be needed for analysis
Different data encodings/formats, inconsistencies, etc. that have to be reconciled first
Data warehousing (def)
How to define a data warehouse
“A subject-oriented, integrated, time-variant, and non-volatile collection of data in support of management’s decision-making process” (Inmon)
“A copy of transaction data specifically structured for query and analysis” (Kimball)
Subject-oriented”
Organised around the major subjects of an enterprise (e.g. customers, products, and sales) rather than the major application areas (e.g. customer invoicing, stock control, and order processing)
“Integrated”
Integrates application-oriented data from different source systems, which must be made consistent to present a unified view to the users
Centralised & cross-functional
Common data sources:
Transaction processing systems, relational databases, ERP, etc.
Legacy systems
External data
Extraction, Transformation and loading (ETL)
Typically, ETL tools are used to set up and configure an automated system that regularly updates the data warehouse
Extract data from source systems (operational, legacy, etc.)
Identifying which records in the source have changed since the last update
Transform data
Substeps: format cleanse aggregate and merge enrich
Load transformed data into the data warehouse
Big data
Increasingly data comes from a multitude of different sources, is often unstructured and unintegrated, and there are ever larger amounts of it – hence the term big data
Storing and (processing) big data : Hadoop
Hadoop is an open-source framework that has become popular for distributed storage and parallel processing of massive amounts of data
Distributed file system that can spread the data over a large cluster of (inexpensive) machines
MapReduce – programming model that allows for large batch-processing jobs to be divided into smaller tasks that can be run in parallel
Originally developed by Google to index the exploding volume of content on the web
Hadoop environment may
complement (or replace) a traditional data warehouse
Online analytical processing (OLAP)
Online Analytical Processing (OLAP) is defined as:
interactive analysis of
large volumes of data
from multiple dimensions
Originally applied to data from data warehouse or data mart but now also OLAP-on-Hadoop engines emerging
OLAP tools allow you to interactively break down summary measures such as total unit sales, sales revenue, costs, etc. according to various available grouping criteria
Originally data pre-loaded internally in the form of an OLAP “cube” to facilitate analysis; this pre-processing step now increasingly made obsolete by in-memory analytics solutions
OALP operations : drilling
Drilling: navigating through a dimension hierarchy to desired level of detail
Drill down: go down the hierarchy or introduce extra dimension (i.e. break down in more detail)
Total sales
Total sales per city
Total sales per city per shop
Drill up or roll up: climb up hierarchy or reduce dimensions (i.e. get measure at more aggregate level)
Drill across: within same dimension select another attribute value
After viewing the results for 2011, change the selection to 2012
OALP operations - slicing and dicing
Slicing: take horizontal or vertical cut of cube, i.e. restrict one dimension
Sales data for product X
Sales data for shop A
Dicing: restrict two or more dimensions
Sales data for products X and Y, in shops A and B, during the summer
limits of OALP analysis
You could e.g. break down a product’s revenue according to customer demographics but what if there are 10,000s of products to manually investigate?
Also, does not yet produce any prediction for the future!
Example for customer attrition:
Basic reports or dashboards may show that overall churn rate (i.e. the percentage of customers lost) was up in last quarter
OLAP analysis may reveal churn rates were higher for this product group or that segment of customers
Predictive analytics: can we forecast churn rates for the next quarter, or, even better, identify which individual customers have highest probability of churning given the data we have on them?
Predictive analytics and data mining
Data mining/analytics: applying computational techniques to find interesting patterns or derive a predictive
model
To a large extent computer-driven discovery of (possibly unexpected) patterns or model building
as opposed to OLAP’s user-driven verification of hypothesised patterns
Typically involves more complex analyses at a deeper level of detail than simply aggregating data
Could involve building predictive models, i.e. which predict future events based on past observations (predictive analytics)
example applications
Credit scoring
Detecting fraud
Insurance claims prediction
Stock market forecasting
Market basket analysis (e.g. beer and diapers story)
Online recommender systems
Customer lifetime value (CLV) modelling, churn prediction, etc.
Targeted marketing: response modelling
Market segmentation
Types of analytics tasks
Predictive analytics: use past data to learn to predict some variable of interest (target variable) for an individual or instance, from other observable variables (input variables) – three types of tasks:
Classification
Regression or estimation
Forecasting
Descriptive analytics: identify and describe patterns present in the data (no separate target variable!)
Association analysis
Segmentation / clustering
predictive analytics - classification
Classification: use input variables to classify subject into one of two or more predefined target classes (e.g. predict whether individual customer will be good or bad payer, will churn or stay, will respond to campaign or not, etc.)
Example applications:
Use income, residential status, credit history, etc. to classify customer as likely “good” or “bad” payer (credit scoring)
Use recency, frequency, and monetary value of previous purchases, service calls made, etc. to classify customer as likely churner or not
Example models used are decision trees, scorecards, etc., but analysts can choose from range of statistical or machine learning techniques
predictive analytics - regression
Regression or estimation: predict value of a continuous (numeric) target variable (e.g. profit in GBP, loss, etc.)
Example applications:
Predict volume of future spending of customer, e.g. based on recency, frequency, and monetary value of previous purchases
Predict size of loss if customer defaults on loan based on customer characteristics, type and value of collateral, etc.
Again first collect sample from previous time period, build model for that sample, then apply to current customers
Example technique: linear regression (provides estimate for target variable using weighted sum of input variables, e.g.: estimated spending = 102 + 3.2 x age + 0.1 x income
predictive analytics - forecasting
Forecasting: regression over time-series data
Example applications:
Forecast monthly sales figures, stock prices, total energy consumption, etc., taking into account:
Autocorrelation: next period depends on recent periods
Long-term trend – upward or downward?
Seasonality (e.g. we always sell more in the winter holiday season)
Techniques include various types of time series an
descriptive analytics - association analysis
Detect frequently occurring patterns of items in a large transaction database
Association rule example:
If a customer buys bread, then (s)he is also likely to buy milk (based on previously observed transactions by our customers)
Example applications
Market basket analysis (e.g. selecting product coupons for individual customers, cross-selling related products, etc.)
Web analytics (e.g. shortcuts to pages often part of same visit)
Recommender systems (‘you may also like’)
…
descriptive analytics - segmentation / clustering
Identify clusters or segments of homogenous subjects (i.e. having similar values for a series of variables)
Example applications
Market segmentation: e.g. identify groups of customers that are similar in terms of demographics, interests, etc.
Can then think of best way to market products to a certain group, products they might be interested in, etc.
Note: no separate target variable that we are trying to predict! (difference with classification)
types of big data analytics
Based on type of input data, can distinguish between e.g.:
Text mining
Predicting car insurance fraud based on accident report
Sentiment analysis: is a post, review, etc. positive or negative?
Image processing
e.g., captioning/tagging images based on content they contain
Social network analytics
e.g., a telco making a more accurate prediction of the probability customer X will churn by also considering how many of the customers with whom X regularly calls have recently churned