Chapter 1 Flashcards

1
Q

What has allowed for the development of data science?

A

Cheaper data storage, faster hardware and advances in algorithms

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Define data science, data analytics and data mining

A

Data Science: The interdisciplinary field that uses scientific methods, algorithms, and systems to extract insights and knowledge from structured and unstructured data.

Data Analytics: The process of examining data sets to uncover patterns, trends, and insights, often with the aim of making informed business decisions.

Data Mining: The practice of discovering hidden patterns and relationships in large datasets using statistical and machine learning techniques.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is data mining?

A

Data mining refers to extracting or “mining” knowledge from large amounts of data.

Includes sophisticated algorithms for analysing data that can’t be analysed manually.

It involves selecting, exploring and modelling large quantities of data to discover patterns or relations. These patterns are at first unknown. We want to obtain clear and useful results for the owner of the database (using previous data to make a predictions).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What are other terms for data mining?

A

Knowledge discovery in databases (KDD)
Knowledge mining from databases
Knowledge discovery
Knowledge extraction
Data/pattern analysis

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are some examples of large datasets that may require data mining techniques?

A
  • Supermarket information on transactions and customers
  • NASA Earth Observation System
  • Modern biology information on human genomes etc.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is data?

What are some examples of data?

A

Any facts, numbers, text, images, audio, maps etc. that can be processed by a computer.

Data is information such as facts and numbers used to analyse something or make decisions.

Examples include:
- Transactions in supermarkets
- Stock market figures
- Words or sentences in a book
- Road layouts
- Satellite images

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is a defining characteristic of data mining compared to other methods?

A

Data mining is data driven, whereas other methods are often model driven

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is the difference in the size of the dataset used in statistics vs data mining?

A

In statistics, want to find the smallest data size that gives sufficiently confident estimates.

In data mining, we want the data size to be large and we are interested in building a model that is small (not too complex) but still describes the data well.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What kind of techniques/disciplines are involved in data mining?

A
  • Database technology
  • Statistic
  • Machine learning
  • High-performance computing
  • Pattern recognition
  • Data visualisation
  • Information retrieval
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What do data mining, machine learning and deep learning have in common?

A

They have the same goal, to extract insights, patterns and relationships that can be used to make decisions.

But they have different approaches and abilities.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Describe data mining

A

Data mining can be considered a superset of many different methods to extract insights from data.

Data mining applies methods from many different areas to identify previously unknown patterns from data.

This can include:
- Statistical algorithms
- Machine learning
- Text analysis
- Time series analysis
- Other areas of analytics

DM also includes the study and practice of data storage and data manipulation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Describe machine learning

A

Just like statistical models, the goal is to understand the structure of the data. ML has developed based on the ability to use computers to probe the data for structure, even if we do not have a theory of what the structure looks like.

The test for a ML model is a validation error on new data, not a theoretical test that proves a null hypothesis.

ML often uses an iterative approach to learn from data, so the learning can be easily automated. Passes are run through the data until a robust pattern is found.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Describe deep learning

A

Combines advances in computing power and special types of neural networks to learn complicate patterns in large amounts of data.

DL techniques are currently state of the art for identifying objects in images and words in sounds.

Future uses: automatic language translation and medical diagnoses

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What were the steps in the development of data mining?

A

< 1960 - data collection and database creation (simplistic filing systems)

1970s - 1980s - database management systems (developed more hierarchical database systems, databases consisting of tables where each one is assigned a name. Organising)

1980s - present - advanced database systems (developed further for data retrieval)

1980s - present - web-based database systems (with the internet boom)

1980s - present - data warehousing and data mining (to uncover previously unknown patterns in the data)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What does data rich but information poor elude to?

A

There has been a dramatic increase in the amount of stored data.

This far exceeds human ability for comprehension without powerful tools.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What does target marketing involve?

A

The supermarket finds patterns (clusters) of similar customers and targets these people with certain products. This is more cost effective than simply sending products to all customers.

Transaction data can model customer purchase patterns over time.

Customer similarity:
- Spending habits
- Income
- Interests
- Shopping patterns

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What does cross-market analysis involve?

A

The supermarket finding patterns (associations) between products and marketing accordingly.

eg associations between products
This is often referred to as market basket analysis.
Need to understand the data and the direction of the relationship (eg beer and diapers example).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What are the three steps of customer relationship management?

A

1 - acquiring new customers
2 - increasing the value of the customer
3 - retaining good customers

eg banks

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What kinds of techniques may identify customers who would pose less risk?

A

Classification or scoring techniques

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What technique may help the bank identify customer needs and offer linked products, increasing bank revenue?

A

Cross market analysis

By identifying customers that are most profitable, or who will be most profitable in the future, they aim to retain them.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Define financial analysis

A

The identification of financial trends and patterns over time (time series analysis) to ensure organisations can maximise profit

eg the stock market, business profit

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Define competition

A

The ability to monitor competitors and market directions. The ability to set price strategies in a highly competitive market.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Define forecasting

A

The identification of patterns, predicting future events by analysing past and presence data and trends.

Eg weather forecasting identifies weather patterns.
Classification - rain or sun
Continuous - temperature etc.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Define fraud detection

A

A database may contain objects that do not comply with the general behaviour or model of the data. Data mining can involve identifying patterns and hence identifying any outliers (anomaly discovery)

Distance measures where objects that are a substantial distance from any other cluster are considered outliers.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

What is data mining not?

A

“Searching”, generating histograms, SQL queries etc.

It is not simply statistical analysis.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

What does it mean to apply data mining methodology?

A

To follow an integrated methodological process that involves translating the business needs into a problem which has to be analysed, retrieving the database needed to carry out the analysis and applying the data mining technique implemented in a computer algorithm with the final aim of achieving important results useful for taking a strategic decision.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

What can data mining not eliminate?

A

The need to know the business, understand the data or to understand analytical methods.

It assists in finding patterns and relationships in data, but doesn’t provide information regarding the value of these patterns.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

What is data retrieval?

A

Extracting interesting data and information from archives and databases.

Unlike DM, the criteria for extracting information are decided beforehand, so they are exogenous from the extraction itself.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

Give an example of the difference in a query for data retrieval vs data mining

A

Data retrieval - find all customers who missed one payment

Data mining - find all customers that are LIKELY to miss one payment (classification, scoring)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

What led to the creation of many data mining process frameworks?

A

The need to standardise methodologies and define best practices.

The goal is to help organisations to better understand knowledge discovery process by providing a roadmap to follow while planning and executing the project.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

What are the most popular data mining process frameworks?

A
  • Knowledge discovery databases (KDD) process model
  • Cross industrial standard process for data mining (CRISP-DM)
  • Sample, explore, modify, model and asses (SEMMA) data mining framework
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

What are the main concepts of the knowledge discovery databases (KDD)?

A

This process model is centred around the overall process of knowledge discovery from data.

This includes
- How the data are stored
- How it is accessed
- How algorithms can be scaled to enormous datasets efficiently
- How results can be interpreted and visualised

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

What is the leading methodology for data mining?

A

CRISP-DM

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

What are the steps of the KDD process model?

A

9 Steps:

1 - Developing and understanding the application domain
2 - Create a target data set (selecting)
3 - Data cleaning and preprocessing
4 - Data reduction and projection
5 - Choosing the data mining task
6 - Choosing the data mining algorithm
7 - Data mining
8 - Interpreting mined patterns
9 - Consolidating discovered knowledge

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

What are the main steps of CRISP-DM?

A

CRISP-DM provides a non-proprietary and freely available standard process for fitting data mining into the general problem-solving strategy of a business or research unit.

According to CRISP-DM a given data mining project has a life cycle consisting of six phases.

There are six phases:
1 - business understanding
2 - data understanding
3 - data preparation
4 - modelling
5 - evaluation
6 - deployment

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

How is the phase-sequence of CRISP-DM described?

A

Adaptive

The next phase in the sequence often depends on the outcomes associated with the previous phase.

Depending on the behaviour and characteristics of the model, we may have to return to previous phases for further refinement before moving on to the next.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

What is an advantage of the CRISP-DM process?

A

It is iterative, allows room for error.

The model is characterised by an easy-to-understand vocabulary and good documentation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
38
Q

What is the SEMMA framework?

A

A data mining methodology that focusses on logical organisation of the model development phase of data mining projects.

The acronym refers to the core process of conducting data mining.
- Sample
- Exploring
- Modify
- Model
- Assess

It begins with a statistically representative sample of the data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
39
Q

When sampling your data, how big should the portion of data extracted be?

A

Big enough to contain significant information yet small enough to manipulate quickly.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
40
Q

What is the advantage of mining a representative sample instead of the whole volume?

A

It reduces the processing time required to get crucial business information.

If general patterns appear in the data as a whole, these will be traceable in the representative sample.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
41
Q

If a niche is so tiny that it is not represented in a sample, but so important that it influences the big picture, how can it be discovered?

A

Summary methods

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
42
Q

What does exploration allow?

A

Helps refine the discovery process.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
43
Q

If visual exploration does not reveal clear trends, what can you use?

A

Statistical techniques
- Factor analysis
- Correspondence analysis
- Clustering

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
44
Q

What does modifying the data involve?

A

Creating, selecting and transforming the variables to focus the model selection process.

Manipulate the data to include information, look for outliers, reduce the number of variables etc.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
45
Q

How do you model the data?

A

Allowing the software to search automatically for a combination of data that reliably predicts a desired outcome.

46
Q

How do you assess your data?

A

Evaluating the usefulness and reliability of the findings from the data mining process, and estimate how well it performs.

You can test the model by splitting into a test/train set, or test the model against known data.

Practical applications of the model helps prove its validity.

47
Q

What occurs during model deployment?

A

The return on investment from the mining process is realised.

48
Q

What can data mining and knowledge discovery depend on?

A

Quality and quantity of available data.

49
Q

What is a value of a feature/attribute/variable?

A

A single unit of information, where each feature can take a number of different values

50
Q

What are datasets composed of?

A

Objects

51
Q

What are objects described by?

A

Features

52
Q

How are datasets stored?

A

As flat (rectangular) files and in other formats using databases and data warehouses.

53
Q

What are objects also known as?

A

Records, examples, units, cases, individuals, data points

54
Q

What do objects represent?

A

Entities that are described by one or more features?

55
Q

What is multivariate vs univariate data?

A

Multivariate data refers to data in which an object is described by many features.

Univariate data refers to the situation in which a single feature describes an object.

56
Q

What are datasets composed of?

A

Objects described by the same features

57
Q

What are flat files used for?

A

To store data in a simple text file format.

They are often generated from data stored in other, more complex formats, such as spreadsheets or databases.

58
Q

Data mining tools used in the knowledge discovery process can be applied to a great variety of other data formats such as:

A
  • Databases
  • Data warehouses
  • Advanced database systems
  • Object-oriented and object-relational database
  • Data-specific databases, such as transactional, spatial, temporal, text and multimedia databases
  • The World Wide Web
59
Q

What is a data warehouse?

A

A database that is maintained separately from an organisation’s operational databases.

Data warehouse systems permit the integration of a variety of application systems. They support information processing by providing a solid platform of consolidated historical data for analysis.

FORMAL: A data warehouse is a:
- Subject-oriented
- Integrated
- Time variant
- Non-volatile
collection of data in support of management’s decision making process.

60
Q

What does subject-oriented mean?

A

A data warehouse is organised around a set of subjects of interest to the user.

Eg the customer, supplier or product, rather than by application

61
Q

What does integrated mean?

A

The data warehouse must be able to integrate itself perfectly with the multitude of standards used by different applications from which data is collected.

eg different ways to represent gender (M/D, 0/1 etc)

The data warehouse must recognise different coding conventions and units of measurements before storing the data (eg transforming data)

62
Q

What does time-variant mean?

A

Data are stored to provide information from a historical perspective. Data is not updated in the warehouse, additional data is simply loaded into the warehouse and gets integrated accordingly.

63
Q

What does non-volatile mean?

A

A data warehouse is always a physically separate store of data transformed from the application data found in the operational environment.

Two operations usually occur - the loading of data and access of data. You don’t actually take it way, you make a copy of it.

64
Q

How are data warehouses constructed?

A

A process of:
- Data cleaning
- Data transformation
- Data integration
- Data loading

65
Q

How may a data warehouse be constructed?

A

Integrating data from multiple heterogeneous sources to support structured and/or ad-hoc queries, analytical reporting and decision making.

Eg a database for each hospital integrated into a data warehouse, enabling analysis across all hospitals.

66
Q

How many ways are there to construct a data warehouse? What are they?

A

Two ways

1 - based on the creation of a single centralised archive that collects all company information and integrates it with information coming from the outside

2 - brings together different thematic databases, called data marts, that are not initially connected among themselves, but which can evolve to create a perfectly interconnected structure

67
Q

What is a data mart?

A

A data mart is a subset of a data warehouse focused on a particular line of business, department or subject area.

68
Q

What is the advantage and disadvantage of constructing a data warehouse from integrating company information with outside information?

A
  • Allows system administrators to continually control quality of the data
  • Requires careful programming to allow future expansion to load new data
69
Q

What is the advantage and disadvantage of constructing a data warehouse from data marts?

A
  • It is initially easier to implement
  • Significant cleaning and transformation may be required - the problem of integration
70
Q

What does a data cube allow you to do?

A

A nice way to look at the summary of the data / have a quick look, providing a 3D view of the data. Allows for fast access to the summarised data via pre-computation.

71
Q

A data warehouse uses a multidimensional database structure. What do dimensions and cells correspond to?

A
  • Each dimension corresponds to an attribute or a set of attributes selected by the user to be included in the schema
  • Each cell (value) in the database corresponds to some summarised (aggregated) measure, such as average, count, minimum etc.
72
Q

How can a data warehouse be implemented?

A
  • A relational database
  • A multidimensional data cube
73
Q

What does OLAP stand for and what is it?

A

On-Line Analytic Processing (OLAP)

It is an approach which can quickly provide answers to analytical queries that are dimensional in nature.

OLAP operations make use of background knowledge regarding the domain of the data to allow the presentation of data at different levels of abstraction. This accommodates different user viewpoints and provides a user-friendly environment for interactive data analysis.

74
Q

Why are data warehouse systems well suited for OLAP?

A

The availability of multidimensional data and the precomputation of summarised data.

75
Q

What are typical applications of OLAP in business?

A
  • Reporting of sales
  • Marketing
  • Management reporting
  • Business process management (BPM)
  • Budgeting
  • Forecasting
  • Financial reporting
76
Q

What are two common OLAP operations?

A

Drill-down and roll-up - they allow the user to view the data at different levels.

77
Q

Describe drill-down

A

Breaks the data into sub-ranges to present the user with more detailed information.

78
Q

Describe roll-up

A

Merges the data at one or more dimensions to provide the user with a higher-level summarisation

79
Q

What are OLAP operations for selecting a subset of data from a data cube?

A

Slice and dice

80
Q

Describe the slice operation

A

Performs a selection of one dimension of the given cube, resulting in a subcube

81
Q

Describe the dice operation

A

Defines a sub cube by performing a selection of two or more dimensions

82
Q

What is a data web house?

A

A web data warehouse. Web houses offer users web logs and other such information.

The data warehouse has been forced to adapt to new requirements for web-based data.

83
Q

What is a the difference between a data mart and a data warehouse?

A

Data warehouse - collects information about subjects that span an entire organisation, the scope is enterprise wide. It is a single organisational repository of enterprise-wide data across many or all subject areas. It is the authoritative repository of all the fact and dimension data at an atomic level.

Data mart - a department subset of a data warehouse. It is a thematic database which focuses on selected subject and thus its scope is department wide. It was originally oriented towards the marketing field. it is a specific, subject-oriented repository of data designed to answer specific questions for a specific set of users.

84
Q

How many data marts can an organisation have?

A

An organisation could have multiple data marts serving the needs of marketing, sales, operations, collections etc.

85
Q

What is the authoritative repository?

A

The data warehouse

86
Q

What data structure is used at the enterprise level?

A

The data warehouse

87
Q

What data structure is used at the business division/department level?

A

The data mart - it only contains the required subject specific data for local analysis.

88
Q

What is a data lake?

A

A data lake is a centralised repository that stores, processes, and secures large amounts of raw data in its native format. This includes structured, semi-structured and unstructured data.

A storage repository

The data structure and requirements are not defined until the data is needed

89
Q

What are some differences between the data warehouse and data lake?

A
  • DATA - DW has structured and processed data, DL has structured, semi-structured, unstructured and raw data
  • PROCESSING - DW is schema-on-write, DL is schema-on-read
  • STORAGE - DW is expensive for large data volumes, DL is designed for low-cost storage
  • AGILITY - DW is less agile with a fixed configuration, DL is highly agile, configure and reconfigure as needed
90
Q

What does schema-on-write mean?

A

Before we load data into a data warehouse, it needs to be given some shape and structure ie we need to model it.

91
Q

What does schema-on-read mean?

A

Data in a data lake is loaded in as raw data, then when the data is to be used it is given shape and structure.

92
Q

How are data lake technologies like Hadoop able to keep relatively low cost of storing data?

A
  • Open-source software (licensing and community support is free)
  • Designed to be installed on low-cost software
93
Q

What do you need to be able to do in order to solve a concrete business problem?

A

Map the business problem to a good data mining modelling technique that is based on some statistical or machine learning algorithm.

94
Q

How many categories of data mining modelling techniques are there and what are they?

A

Two categories

  • Predictive modelling techniques
  • Descriptive modelling techniques
95
Q

What are predictive modelling techniques used for?

A

The prediction of one value using other values in the dataset.

The learning algorithm attempts to discover and model the relationship between the target feature and other features.

May not necessarily be the future eg conception date.

96
Q

What is the target feature?

A

The one being predicted

97
Q

Is training a predictive model supervised or unsupervised learning?

A

Supervised learning - predictive models are given clear instruction on what they need to learn and how to learn it.

98
Q

What does supervised learning refer to?

A

The fact that the target value provides a way for the learner to know how well it has learned the desired task.

SL attempts to optimise a function to find the combination of feature values that result in the target output

99
Q

What are two types of supervised learning?

A

CLASSIFICATION - predicting which category an example belongs to

REGRESSION - predict numeric data (continuous)

100
Q

Why is the boundary between classification models and numeric prediction models not necessarily firm?

A

It is easy to convert numbers into categories and categories into numbers.

101
Q

What are descriptive modelling techniques used for?

A

A descriptive model is used for tasks that would benefit from the insight gained from summarising data in new and interesting way.

102
Q

What is a difference between predictive and descriptive modelling?

A

Predictive models predict a target of interest, in descriptive modelling no single feature is more important than any other.

Supervised vs unsupervised learning.

103
Q

What is unsupervised learning?

A

There is no target to learn.

Unlike supervised learning, unsupervised machine learning models are given unlabeled data and allowed to discover patterns and insights without any explicit guidance or instruction.

104
Q

What are different descriptive modelling tasks?

A
  • Pattern discovery
    • Market basket analysis
  • Clustering
    • Segmentation analysis
105
Q

Which type of data mining modelling technique is used in real time to control traffic lights during rush hours etc.?

A

Predictive modelling

106
Q

What is pattern discovery used for?

A

To identify useful associations within data.

PD is often used for market basket analysis on transactional purchase data. Want to identify items that are frequently purchased together, such that the learned information can be used to refine marketing tactics.

107
Q

How might retailers utilise knowledge gained from pattern discovery?

A
  • Move items close together
  • Run promotions to “up-sell” associated items
108
Q

What is clustering?

A

Dividing a dataset into homogenous groups.

This is sometimes used for segmentation analysis.

109
Q

What is segmentation analysis?

A

It identifies groups of individuals with similar behaviour or demographic information, so that advertising campaigns can be tailored for particular audiences

110
Q

Why is there a need for human involvement?

A

A machine is capable of identifying clusters, human intervention is required to interpret them.

eg need to understand differences among groups to create promotion that best suits each group

111
Q

What are the four learning learning tasks you project may represent?

A
  • Classification
  • Numeric prediction
  • Pattern detection
  • Clustering

The task drives the choice of the algorithm

112
Q
A