Chapter 1 Flashcards

Question

What is data mining not?

Answer 1

"Searching", generating histograms, SQL queries etc. It is not simply statistical analysis.

Answer 2

To follow an integrated methodological process that involves translating the business needs into a problem which has to be analysed, retrieving the database needed to carry out the analysis and applying the data mining technique implemented in a computer algorithm with the final aim of achieving important results useful for taking a strategic decision.

Answer 3

The need to know the business, understand the data or to understand analytical methods. It assists in finding patterns and relationships in data, but doesn't provide information regarding the value of these patterns.

Answer 4

Extracting interesting data and information from archives and databases. Unlike DM, the criteria for extracting information are decided beforehand, so they are exogenous from the extraction itself.

Answer 5

Data retrieval - find all customers who missed one payment Data mining - find all customers that are LIKELY to miss one payment (classification, scoring)

Answer 6

The need to standardise methodologies and define best practices. The goal is to help organisations to better understand knowledge discovery process by providing a roadmap to follow while planning and executing the project.

Answer 7

- Knowledge discovery databases (KDD) process model - Cross industrial standard process for data mining (CRISP-DM) - Sample, explore, modify, model and asses (SEMMA) data mining framework

Answer 8

This process model is centred around the overall process of knowledge discovery from data. This includes - How the data are stored - How it is accessed - How algorithms can be scaled to enormous datasets efficiently - How results can be interpreted and visualised

Answer 9

9 Steps: 1 - Developing and understanding the application domain 2 - Create a target data set (selecting) 3 - Data cleaning and preprocessing 4 - Data reduction and projection 5 - Choosing the data mining task 6 - Choosing the data mining algorithm 7 - Data mining 8 - Interpreting mined patterns 9 - Consolidating discovered knowledge

Answer 10

CRISP-DM provides a non-proprietary and freely available standard process for fitting data mining into the general problem-solving strategy of a business or research unit. According to CRISP-DM a given data mining project has a life cycle consisting of six phases. There are six phases: 1 - business understanding 2 - data understanding 3 - data preparation 4 - modelling 5 - evaluation 6 - deployment

Answer 11

Adaptive The next phase in the sequence often depends on the outcomes associated with the previous phase. Depending on the behaviour and characteristics of the model, we may have to return to previous phases for further refinement before moving on to the next.

Answer 12

It is iterative, allows room for error. The model is characterised by an easy-to-understand vocabulary and good documentation.

Answer 13

A data mining methodology that focusses on logical organisation of the model development phase of data mining projects. The acronym refers to the core process of conducting data mining. - Sample - Exploring - Modify - Model - Assess It begins with a statistically representative sample of the data.

Answer 14

Big enough to contain significant information yet small enough to manipulate quickly.

Answer 15

It reduces the processing time required to get crucial business information. If general patterns appear in the data as a whole, these will be traceable in the representative sample.

Answer 16

Summary methods

Answer 17

Helps refine the discovery process.

Answer 18

Statistical techniques - Factor analysis - Correspondence analysis - Clustering

Answer 19

Creating, selecting and transforming the variables to focus the model selection process. Manipulate the data to include information, look for outliers, reduce the number of variables etc.

Answer 20

Allowing the software to search automatically for a combination of data that reliably predicts a desired outcome.

Answer 21

Evaluating the usefulness and reliability of the findings from the data mining process, and estimate how well it performs. You can test the model by splitting into a test/train set, or test the model against known data. Practical applications of the model helps prove its validity.

Answer 22

The return on investment from the mining process is realised.

Answer 23

Quality and quantity of available data.

Answer 24

A single unit of information, where each feature can take a number of different values

Answer 25

As flat (rectangular) files and in other formats using databases and data warehouses.

Answer 26

Records, examples, units, cases, individuals, data points

Answer 27

Entities that are described by one or more features?

Answer 28

Multivariate data refers to data in which an object is described by many features. Univariate data refers to the situation in which a single feature describes an object.

Answer 29

Objects described by the same features

Answer 30

To store data in a simple text file format. They are often generated from data stored in other, more complex formats, such as spreadsheets or databases.

Answer 31

- Databases - Data warehouses - Advanced database systems - Object-oriented and object-relational database - Data-specific databases, such as transactional, spatial, temporal, text and multimedia databases - The World Wide Web

Answer 32

A database that is maintained separately from an organisation's operational databases. Data warehouse systems permit the integration of a variety of application systems. They support information processing by providing a solid platform of consolidated historical data for analysis. FORMAL: A data warehouse is a: - Subject-oriented - Integrated - Time variant - Non-volatile collection of data in support of management's decision making process.

Answer 33

A data warehouse is organised around a set of subjects of interest to the user. Eg the customer, supplier or product, rather than by application

Answer 34

The data warehouse must be able to integrate itself perfectly with the multitude of standards used by different applications from which data is collected. eg different ways to represent gender (M/D, 0/1 etc) The data warehouse must recognise different coding conventions and units of measurements before storing the data (eg transforming data)

Answer 35

Data are stored to provide information from a historical perspective. Data is not updated in the warehouse, additional data is simply loaded into the warehouse and gets integrated accordingly.

Answer 36

A data warehouse is always a physically separate store of data transformed from the application data found in the operational environment. Two operations usually occur - the loading of data and access of data. You don't actually take it way, you make a copy of it.

Answer 37

A process of: - Data cleaning - Data transformation - Data integration - Data loading

Answer 38

Integrating data from multiple heterogeneous sources to support structured and/or ad-hoc queries, analytical reporting and decision making. Eg a database for each hospital integrated into a data warehouse, enabling analysis across all hospitals.

Answer 39

Two ways 1 - based on the creation of a single centralised archive that collects all company information and integrates it with information coming from the outside 2 - brings together different thematic databases, called data marts, that are not initially connected among themselves, but which can evolve to create a perfectly interconnected structure

Answer 40

A data mart is a subset of a data warehouse focused on a particular line of business, department or subject area.

Answer 41

- Allows system administrators to continually control quality of the data - Requires careful programming to allow future expansion to load new data

Answer 42

- It is initially easier to implement - Significant cleaning and transformation may be required - the problem of integration

Answer 43

A nice way to look at the summary of the data / have a quick look, providing a 3D view of the data. Allows for fast access to the summarised data via pre-computation.

Answer 44

- Each dimension corresponds to an attribute or a set of attributes selected by the user to be included in the schema - Each cell (value) in the database corresponds to some summarised (aggregated) measure, such as average, count, minimum etc.

Answer 45

- A relational database - A multidimensional data cube

Answer 46

On-Line Analytic Processing (OLAP) It is an approach which can quickly provide answers to analytical queries that are dimensional in nature. OLAP operations make use of background knowledge regarding the domain of the data to allow the presentation of data at different levels of abstraction. This accommodates different user viewpoints and provides a user-friendly environment for interactive data analysis.

Answer 47

The availability of multidimensional data and the precomputation of summarised data.

Answer 48

- Reporting of sales - Marketing - Management reporting - Business process management (BPM) - Budgeting - Forecasting - Financial reporting

Answer 49

Drill-down and roll-up - they allow the user to view the data at different levels.

Answer 50

Breaks the data into sub-ranges to present the user with more detailed information.

Answer 51

Merges the data at one or more dimensions to provide the user with a higher-level summarisation

Answer 52

Slice and dice

Answer 53

Performs a selection of one dimension of the given cube, resulting in a subcube

Answer 54

Defines a sub cube by performing a selection of two or more dimensions

Answer 55

A web data warehouse. Web houses offer users web logs and other such information. The data warehouse has been forced to adapt to new requirements for web-based data.

Answer 56

Data warehouse - collects information about subjects that span an entire organisation, the scope is enterprise wide. It is a single organisational repository of enterprise-wide data across many or all subject areas. It is the authoritative repository of all the fact and dimension data at an atomic level. Data mart - a department subset of a data warehouse. It is a thematic database which focuses on selected subject and thus its scope is department wide. It was originally oriented towards the marketing field. it is a specific, subject-oriented repository of data designed to answer specific questions for a specific set of users.

Answer 57

An organisation could have multiple data marts serving the needs of marketing, sales, operations, collections etc.

Answer 58

The data warehouse

Answer 59

The data warehouse

Answer 60

The data mart - it only contains the required subject specific data for local analysis.

Answer 61

A data lake is a centralised repository that stores, processes, and secures large amounts of raw data in its native format. This includes structured, semi-structured and unstructured data. A storage repository The data structure and requirements are not defined until the data is needed

Answer 62

- DATA - DW has structured and processed data, DL has structured, semi-structured, unstructured and raw data - PROCESSING - DW is schema-on-write, DL is schema-on-read - STORAGE - DW is expensive for large data volumes, DL is designed for low-cost storage - AGILITY - DW is less agile with a fixed configuration, DL is highly agile, configure and reconfigure as needed

Answer 63

Before we load data into a data warehouse, it needs to be given some shape and structure ie we need to model it.

Answer 64

Data in a data lake is loaded in as raw data, then when the data is to be used it is given shape and structure.

Answer 65

- Open-source software (licensing and community support is free) - Designed to be installed on low-cost software

Answer 66

Map the business problem to a good data mining modelling technique that is based on some statistical or machine learning algorithm.

Answer 67

Two categories - Predictive modelling techniques - Descriptive modelling techniques

Answer 68

The prediction of one value using other values in the dataset. The learning algorithm attempts to discover and model the relationship between the target feature and other features. May not necessarily be the future eg conception date.

Answer 69

The one being predicted

Answer 70

Supervised learning - predictive models are given clear instruction on what they need to learn and how to learn it.

Answer 71

The fact that the target value provides a way for the learner to know how well it has learned the desired task. SL attempts to optimise a function to find the combination of feature values that result in the target output

Answer 72

CLASSIFICATION - predicting which category an example belongs to REGRESSION - predict numeric data (continuous)

Answer 73

It is easy to convert numbers into categories and categories into numbers.

Answer 74

A descriptive model is used for tasks that would benefit from the insight gained from summarising data in new and interesting way.

Answer 75

Predictive models predict a target of interest, in descriptive modelling no single feature is more important than any other. Supervised vs unsupervised learning.

Answer 76

There is no target to learn. Unlike supervised learning, unsupervised machine learning models are given unlabeled data and allowed to discover patterns and insights without any explicit guidance or instruction.

Answer 77

- Pattern discovery - - Market basket analysis - Clustering - - Segmentation analysis

Answer 78

Predictive modelling

Answer 79

To identify useful associations within data. PD is often used for market basket analysis on transactional purchase data. Want to identify items that are frequently purchased together, such that the learned information can be used to refine marketing tactics.

Answer 80

- Move items close together - Run promotions to "up-sell" associated items

Answer 81

Dividing a dataset into homogenous groups. This is sometimes used for segmentation analysis.

Answer 82

It identifies groups of individuals with similar behaviour or demographic information, so that advertising campaigns can be tailored for particular audiences

Answer 83

A machine is capable of identifying clusters, human intervention is required to interpret them. eg need to understand differences among groups to create promotion that best suits each group

Answer 84

- Classification - Numeric prediction - Pattern detection - Clustering The task drives the choice of the algorithm

Chapter 1 Flashcards

(112 cards)