wk 8 - business intelligence and big data Flashcards
Business intelligence (def)
Business Intelligence (BI) is the process of collecting, cleansing, combining, consolidating, analysing, interpreting and communicating internal and available external data, relevant for the decision making process in the organisation
why the interest in business data?
Availability of Data
Amount of digital data growing exponentially
In their 2010 report, Gartner defined a small data warehouse as less than 5 terabytes (TB), a medium data warehouse as 5 TB to 20 TB, and a large data warehouse as greater than 20 TB
What’s feeding massive data increase?
Digital channel data, content posted on social media, data collected by smart, connected devices (cf. the Internet of Things), real-time sensor readings, supply chain technologies such as RFID, etc.
Other data is available through “open data” initiatives (e.g. public services making data freely available), Google Maps, real-time data feeds (e.g. stock exchanges), etc.
Business trends that drive Business intelligence
Implementation of corporate performance management systems
Compliance with new regulatory frameworks
Importance of Customer Relationship Management (CRM) and one-to-one marketing
Trends such as market globalisation, company mergers, etc.
Digital business, digital marketing, social media, etc.
Idea of the data-driven organisation that competes on analytics
Corporate performance management
Also referred to as enterprise performance management (EPM), business performance management (BPM), strategic performance management (SPM), or simply performance management
Refers to a set of management processes, often supported by technology, that involve measuring and monitoring performance in support of better strategic decision making
“Trying to improve something without having a goal, a numerical goal, is like trying to lose weight without having a scale.” (Subir Chowdhury, ‘The Power of Six Sigma’)
Examples: balanced scorecard, six sigma, etc.
Customer relationship management
Customer Relationship Management (CRM) systems: designed to help firms manage customer interactions and maximise the customer lifetime value for the firm
“It is 6-7 times more expensive to gain a new customer than retain an existing customer.” (Bain & Co. study, HBR)
Operational CRM: systems supporting customer-facing processes (e.g. sales lead management, call centre & customer service support, etc.)
Analytical CRM: analysis of customer data to provide insights or models to optimise aspects of our customer relationships (e.g. which customer segments to target with retention campaign, cross-selling opportunities, etc.)
One of the key difficulties: having a “single customer view”
why Data Warehousing?
Relational databases are typically optimised to support everyday operations (e.g. sales transaction processing), not so much analytics activities
Working on same database may slow system down
Data structure not ideally suited
History typically not systematically kept
Different systems for various functional areas of the company (sales, marketing, customer service, production, etc.) data from all of which may be needed for analysis
Different data encodings/formats, inconsistencies, etc. that have to be reconciled first
Data warehousing (def)
How to define a data warehouse
“A subject-oriented, integrated, time-variant, and non-volatile collection of data in support of management’s decision-making process” (Inmon)
“A copy of transaction data specifically structured for query and analysis” (Kimball)
Subject-oriented”
Organised around the major subjects of an enterprise (e.g. customers, products, and sales) rather than the major application areas (e.g. customer invoicing, stock control, and order processing)
“Integrated”
Integrates application-oriented data from different source systems, which must be made consistent to present a unified view to the users
Centralised & cross-functional
Common data sources:
Transaction processing systems, relational databases, ERP, etc.
Legacy systems
External data
Extraction, Transformation and loading (ETL)
Typically, ETL tools are used to set up and configure an automated system that regularly updates the data warehouse
Extract data from source systems (operational, legacy, etc.)
Identifying which records in the source have changed since the last update
Transform data
Substeps: format cleanse aggregate and merge enrich
Load transformed data into the data warehouse
Big data
Increasingly data comes from a multitude of different sources, is often unstructured and unintegrated, and there are ever larger amounts of it – hence the term big data
Storing and (processing) big data : Hadoop
Hadoop is an open-source framework that has become popular for distributed storage and parallel processing of massive amounts of data
Distributed file system that can spread the data over a large cluster of (inexpensive) machines
MapReduce – programming model that allows for large batch-processing jobs to be divided into smaller tasks that can be run in parallel
Originally developed by Google to index the exploding volume of content on the web
Hadoop environment may
complement (or replace) a traditional data warehouse
Online analytical processing (OLAP)
Online Analytical Processing (OLAP) is defined as:
interactive analysis of
large volumes of data
from multiple dimensions
Originally applied to data from data warehouse or data mart but now also OLAP-on-Hadoop engines emerging
OLAP tools allow you to interactively break down summary measures such as total unit sales, sales revenue, costs, etc. according to various available grouping criteria
Originally data pre-loaded internally in the form of an OLAP “cube” to facilitate analysis; this pre-processing step now increasingly made obsolete by in-memory analytics solutions
OALP operations : drilling
Drilling: navigating through a dimension hierarchy to desired level of detail
Drill down: go down the hierarchy or introduce extra dimension (i.e. break down in more detail)
Total sales
Total sales per city
Total sales per city per shop
Drill up or roll up: climb up hierarchy or reduce dimensions (i.e. get measure at more aggregate level)
Drill across: within same dimension select another attribute value
After viewing the results for 2011, change the selection to 2012
OALP operations - slicing and dicing
Slicing: take horizontal or vertical cut of cube, i.e. restrict one dimension
Sales data for product X
Sales data for shop A
Dicing: restrict two or more dimensions
Sales data for products X and Y, in shops A and B, during the summer
limits of OALP analysis
You could e.g. break down a product’s revenue according to customer demographics but what if there are 10,000s of products to manually investigate?
Also, does not yet produce any prediction for the future!
Example for customer attrition:
Basic reports or dashboards may show that overall churn rate (i.e. the percentage of customers lost) was up in last quarter
OLAP analysis may reveal churn rates were higher for this product group or that segment of customers
Predictive analytics: can we forecast churn rates for the next quarter, or, even better, identify which individual customers have highest probability of churning given the data we have on them?