Chapter 1 Flashcards
What has allowed for the development of data science?
Cheaper data storage, faster hardware and advances in algorithms
Define data science, data analytics and data mining
Data Science: The interdisciplinary field that uses scientific methods, algorithms, and systems to extract insights and knowledge from structured and unstructured data.
Data Analytics: The process of examining data sets to uncover patterns, trends, and insights, often with the aim of making informed business decisions.
Data Mining: The practice of discovering hidden patterns and relationships in large datasets using statistical and machine learning techniques.
What is data mining?
Data mining refers to extracting or “mining” knowledge from large amounts of data.
Includes sophisticated algorithms for analysing data that can’t be analysed manually.
It involves selecting, exploring and modelling large quantities of data to discover patterns or relations. These patterns are at first unknown. We want to obtain clear and useful results for the owner of the database (using previous data to make a predictions).
What are other terms for data mining?
Knowledge discovery in databases (KDD)
Knowledge mining from databases
Knowledge discovery
Knowledge extraction
Data/pattern analysis
What are some examples of large datasets that may require data mining techniques?
- Supermarket information on transactions and customers
- NASA Earth Observation System
- Modern biology information on human genomes etc.
What is data?
What are some examples of data?
Any facts, numbers, text, images, audio, maps etc. that can be processed by a computer.
Data is information such as facts and numbers used to analyse something or make decisions.
Examples include:
- Transactions in supermarkets
- Stock market figures
- Words or sentences in a book
- Road layouts
- Satellite images
What is a defining characteristic of data mining compared to other methods?
Data mining is data driven, whereas other methods are often model driven
What is the difference in the size of the dataset used in statistics vs data mining?
In statistics, want to find the smallest data size that gives sufficiently confident estimates.
In data mining, we want the data size to be large and we are interested in building a model that is small (not too complex) but still describes the data well.
What kind of techniques/disciplines are involved in data mining?
- Database technology
- Statistic
- Machine learning
- High-performance computing
- Pattern recognition
- Data visualisation
- Information retrieval
What do data mining, machine learning and deep learning have in common?
They have the same goal, to extract insights, patterns and relationships that can be used to make decisions.
But they have different approaches and abilities.
Describe data mining
Data mining can be considered a superset of many different methods to extract insights from data.
Data mining applies methods from many different areas to identify previously unknown patterns from data.
This can include:
- Statistical algorithms
- Machine learning
- Text analysis
- Time series analysis
- Other areas of analytics
DM also includes the study and practice of data storage and data manipulation.
Describe machine learning
Just like statistical models, the goal is to understand the structure of the data. ML has developed based on the ability to use computers to probe the data for structure, even if we do not have a theory of what the structure looks like.
The test for a ML model is a validation error on new data, not a theoretical test that proves a null hypothesis.
ML often uses an iterative approach to learn from data, so the learning can be easily automated. Passes are run through the data until a robust pattern is found.
Describe deep learning
Combines advances in computing power and special types of neural networks to learn complicate patterns in large amounts of data.
DL techniques are currently state of the art for identifying objects in images and words in sounds.
Future uses: automatic language translation and medical diagnoses
What were the steps in the development of data mining?
< 1960 - data collection and database creation (simplistic filing systems)
1970s - 1980s - database management systems (developed more hierarchical database systems, databases consisting of tables where each one is assigned a name. Organising)
1980s - present - advanced database systems (developed further for data retrieval)
1980s - present - web-based database systems (with the internet boom)
1980s - present - data warehousing and data mining (to uncover previously unknown patterns in the data)
What does data rich but information poor elude to?
There has been a dramatic increase in the amount of stored data.
This far exceeds human ability for comprehension without powerful tools.
What does target marketing involve?
The supermarket finds patterns (clusters) of similar customers and targets these people with certain products. This is more cost effective than simply sending products to all customers.
Transaction data can model customer purchase patterns over time.
Customer similarity:
- Spending habits
- Income
- Interests
- Shopping patterns
What does cross-market analysis involve?
The supermarket finding patterns (associations) between products and marketing accordingly.
eg associations between products
This is often referred to as market basket analysis.
Need to understand the data and the direction of the relationship (eg beer and diapers example).
What are the three steps of customer relationship management?
1 - acquiring new customers
2 - increasing the value of the customer
3 - retaining good customers
eg banks
What kinds of techniques may identify customers who would pose less risk?
Classification or scoring techniques
What technique may help the bank identify customer needs and offer linked products, increasing bank revenue?
Cross market analysis
By identifying customers that are most profitable, or who will be most profitable in the future, they aim to retain them.
Define financial analysis
The identification of financial trends and patterns over time (time series analysis) to ensure organisations can maximise profit
eg the stock market, business profit
Define competition
The ability to monitor competitors and market directions. The ability to set price strategies in a highly competitive market.
Define forecasting
The identification of patterns, predicting future events by analysing past and presence data and trends.
Eg weather forecasting identifies weather patterns.
Classification - rain or sun
Continuous - temperature etc.
Define fraud detection
A database may contain objects that do not comply with the general behaviour or model of the data. Data mining can involve identifying patterns and hence identifying any outliers (anomaly discovery)
Distance measures where objects that are a substantial distance from any other cluster are considered outliers.
What is data mining not?
“Searching”, generating histograms, SQL queries etc.
It is not simply statistical analysis.
What does it mean to apply data mining methodology?
To follow an integrated methodological process that involves translating the business needs into a problem which has to be analysed, retrieving the database needed to carry out the analysis and applying the data mining technique implemented in a computer algorithm with the final aim of achieving important results useful for taking a strategic decision.
What can data mining not eliminate?
The need to know the business, understand the data or to understand analytical methods.
It assists in finding patterns and relationships in data, but doesn’t provide information regarding the value of these patterns.
What is data retrieval?
Extracting interesting data and information from archives and databases.
Unlike DM, the criteria for extracting information are decided beforehand, so they are exogenous from the extraction itself.
Give an example of the difference in a query for data retrieval vs data mining
Data retrieval - find all customers who missed one payment
Data mining - find all customers that are LIKELY to miss one payment (classification, scoring)
What led to the creation of many data mining process frameworks?
The need to standardise methodologies and define best practices.
The goal is to help organisations to better understand knowledge discovery process by providing a roadmap to follow while planning and executing the project.
What are the most popular data mining process frameworks?
- Knowledge discovery databases (KDD) process model
- Cross industrial standard process for data mining (CRISP-DM)
- Sample, explore, modify, model and asses (SEMMA) data mining framework
What are the main concepts of the knowledge discovery databases (KDD)?
This process model is centred around the overall process of knowledge discovery from data.
This includes
- How the data are stored
- How it is accessed
- How algorithms can be scaled to enormous datasets efficiently
- How results can be interpreted and visualised
What is the leading methodology for data mining?
CRISP-DM
What are the steps of the KDD process model?
9 Steps:
1 - Developing and understanding the application domain
2 - Create a target data set (selecting)
3 - Data cleaning and preprocessing
4 - Data reduction and projection
5 - Choosing the data mining task
6 - Choosing the data mining algorithm
7 - Data mining
8 - Interpreting mined patterns
9 - Consolidating discovered knowledge
What are the main steps of CRISP-DM?
CRISP-DM provides a non-proprietary and freely available standard process for fitting data mining into the general problem-solving strategy of a business or research unit.
According to CRISP-DM a given data mining project has a life cycle consisting of six phases.
There are six phases:
1 - business understanding
2 - data understanding
3 - data preparation
4 - modelling
5 - evaluation
6 - deployment
How is the phase-sequence of CRISP-DM described?
Adaptive
The next phase in the sequence often depends on the outcomes associated with the previous phase.
Depending on the behaviour and characteristics of the model, we may have to return to previous phases for further refinement before moving on to the next.
What is an advantage of the CRISP-DM process?
It is iterative, allows room for error.
The model is characterised by an easy-to-understand vocabulary and good documentation.
What is the SEMMA framework?
A data mining methodology that focusses on logical organisation of the model development phase of data mining projects.
The acronym refers to the core process of conducting data mining.
- Sample
- Exploring
- Modify
- Model
- Assess
It begins with a statistically representative sample of the data.
When sampling your data, how big should the portion of data extracted be?
Big enough to contain significant information yet small enough to manipulate quickly.
What is the advantage of mining a representative sample instead of the whole volume?
It reduces the processing time required to get crucial business information.
If general patterns appear in the data as a whole, these will be traceable in the representative sample.
If a niche is so tiny that it is not represented in a sample, but so important that it influences the big picture, how can it be discovered?
Summary methods
What does exploration allow?
Helps refine the discovery process.
If visual exploration does not reveal clear trends, what can you use?
Statistical techniques
- Factor analysis
- Correspondence analysis
- Clustering
What does modifying the data involve?
Creating, selecting and transforming the variables to focus the model selection process.
Manipulate the data to include information, look for outliers, reduce the number of variables etc.