Chapter 4 - Predictive Analytics I: Data Mining Flashcards

1
Q

What is Data Mining?

A

A term used to describe discovering knowledge from large amounts of data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

How are companies dealing with data as it relates to understanding their customer?

A

They are analyzing the vast amount of data that they collect. Data mining helps the management of mission critical tasks with a high level of accuracy and timeliness.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are some reasons businesses have turned to Data Mining? (7)

A
  1. More intense competition at the global scale
  2. Untapped value hidden in large data sources
  3. Consolidation and integration of database records
  4. Consolidation of databases into a single location
  5. Exponential increase in data processing & storage technologies
  6. Significant reduction in cost of hardware and software
  7. Movement toward demassification of business practices
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is Genomic Data?

A

It combines genetics with statistical data analysis and computer science.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are Four Example Uses of Data Mining?

A

Used to detect and reduce fraudulent activities
Identify customer buying patterns and reclaim profitable customers
Identify trading rules from historical data
Aid in increased profitability using market-basket analysis

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are the Seven (1-3) Characteristics and Objectives of Data Mining?

A
  1. Data are cleansed and consolidated into a data warehouse
  2. Data Mining environment is usually a client/server architecture or a Web-based IS architecture.
  3. Sophisticated new tools help to remove information buried in corporate files or archival public records. Also explores the usefulness of soft data.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What are the Seven (4-7) Characteristics and Objectives of Data Mining?

A
  1. The miner is often the end user who obtains answers quickly
  2. Striking it rich involves finding unexpected results and requires users to think creatively throughout the process
  3. Data mining tools are readily combined with spreadsheets and other software development tools.
  4. Due to the large amounts of data, it is sometimes necessary to use parallel processing for data mining.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What are the Six (6) Multiple Disciplines associated with Data Mining?

A
  1. Knowledge Extraction
  2. Pattern Analysis
  3. Data Archaeology
  4. Information Harvesting
  5. Pattern Searching
  6. Data Dredging
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What are the Four (4) Major Types of Patterns Data Mining Seeks to Identify?

A

Association - Find the commonly co-occurring groupings
Predictions - Tell the nature of future occurrences of certain events based on what has happened in the past.
Clusters - Identify natural groupings of things based on their known characteristics.
Sequential Relationships - Discover time-ordered events

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is the Main Difference between Data Mining and Statistics?

A

Statistics starts with a well-defined proposition and well-defined hypothesis whereas data mining starts with a loosely-defined discovery statement.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are the Fourteen (1-5) Industry Focuses where Data Mining can be Applied?

A
  1. Customer Relationship Management (CRM)
  2. Banking
  3. Retailing and Logistics
  4. Manufacturing and Production
  5. Brokerage and Securities Trading
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What are the Fourteen (6-10) Industry Focuses where Data Mining can be Applied?

A
  1. Insurance
  2. Computer Hardware and Software
  3. Government and Defense
  4. Travel Industry (Airlines; Hotels; Rental Car Companies, etc.)
  5. Healthcare
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What are the Fourteen (11-14) Industry Focuses where Data Mining can be Applied?

A
  1. Medicine
  2. Entertainment Industry
  3. Homeland Security and Law Enforcement
  4. Sports
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What does CRISP-DM Stand For?

A

Cross-Industry Standard Process for Data Mining

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What are the Six (1-3) Steps associated with CRISP-DM?

A
  1. Business Understanding - The key element of any data mining study is to know what the study is for.
  2. Data Understanding - Identify the relevant data from many available databases.
  3. Data Preparation - Take data and prepare it for analysis by data mining methods.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What are the Six (4-6) Steps associated with CRISP-DM?

A
  1. Model Building - Modeling techniques are selected and applied to an already prepared dataset to address the specific business need.
  2. Testing and Evaluation - Models are evaluated how they meet business objectives and to what extent.
  3. Deployment - Exploration, organization, and presentation of data findings.
17
Q

What does SEMMA stand for? What are the Five (5) Steps to SEMMA?

A

Sample - Generate a representative sample of the data
Explore - Visualization and basic representation of the data
Modify - Select variables; transform variable representations
Model - Use statistical and machine learning models
Assess - Evaluate the accuracy and usefulness of the models

18
Q

What does KDD mean?

A

KDD is knowledge discovery in databases. The process of using data mining methods to find useful info and patterns in the data.

19
Q

What are the Five (5) Elements to KDD?

A

Data Selection
Data Preprocessing
Data Transformation
Data Mining
Interpretation/Evaluation

20
Q

What are the Main Differences between Data Mining Methods?

A

Classification learns the function between characteristics and their membership through a supervised learning process, whereas Clustering learns the relationship through an unsupervised learning process where only the input variables are presented to the algorithm.

21
Q

What is a Simple Split?

A

It is the process of splitting the data into two mutually exclusive subsets called the training set and the test set.

22
Q

What is the K-fold Cross-Validation?

A

Also called rotation estimation, the complete dataset is randomly split into k mutually exclusive subsets of approximately equal size.

23
Q

What are the Four (4) other Classification Assessment Methodologies?

A

Leave-one-out - Similar to K-Fold but every data point is used for testing once on as many models developed as there are data points.
Bootstrapping - Fixed number of instances from the original data are sampled for training, and the rest is used for testing
Jackknifing - Similar to Leave-One-Out, but one sample is left out at each iteration of the estimation process.
Area Under the ROC Curve - graphical assessment where true positive rate is plotted on the y-axis and false positive rate is plotted on the x-axis.

24
Q

What are the Seven (7) Classification Techniques?

A
  1. Decision Tree Analysis
  2. Statistical Analysis
  3. Neural Networks
  4. Case-Based Reasoning
  5. Bayeslan Classifiers
  6. Genetic Algorithms
  7. Rough Sets
25
Q

Why are Ensemble Models for Predictive Analytics Effective?

A

Combining forecasts can improve accuracy and robustness of information outcomes, while reducing uncertainty and bias associated with individual models.

26
Q

What is a Decision Tree?

A

It recursively divides a training set until each division consists entirely or primarily of examples of one class. Each non-leaf node contains a split-point, which is a test on one or more attributes and determines how the data are to be split further.

27
Q

What is the Gini Index?

A

It is used in economics to measure the diversity of a population.

28
Q

What is Information Gain?

A

It is the splitting mechanism used in ID3, which is more widely known as the decision tree algorithm.

29
Q

What is Entropy?

A

It measures the extent of uncertainty or randomness in a data set

30
Q

What are Cluster Analysis Results typically Used For?

A

It is used for classifying items, events, or concepts into common groupings. It is commonly used in biology, medicine, genetics, social network analysis, anthropology, archaeology, astronomy, character recognition, and even MIS development.

31
Q

What is the Apriori algorithm?

A

It is the most commonly used algorithm to discover association rules. Given a set of itemsets, the algorithm attempts to find subsets that are common to at least a minimum number of the itemsets.