Chapter 5: Data Mining for Business Intelligence Flashcards

1
Q

Apriori algorithm

A

The most frequently used algorithm to find association rules. This algorithm identifies subsets that are frequent to a minimum number of the itemsets. The frequent subsets are extended one item at a time. This means it will increase from one-item subsets to two-item subsets, then three-item subsets and so on until there are no more successful extensions found.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Area under the ROC curve

A

It is a graphical plot in which the true positive rate is plotted on the Y- axis and the false positive rate is plotted on the X-axis. The area under the ROC curve deter determines the accuracy measure of a classifier: value1 indicates a perfect classifier, a value of 0.5 indicates no better than a random chance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Associations (Association rule learning in data mining)

A

It is a research technique that is used to identify relationships among variables in the database. In the retail industry, associations rule mining is called market basket analysis. Link analysis and sequence mining are derivatives of association rule mining.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Bootstrapping

A

is when a fixed number of instances from the original data is sampled (with replacement) for training and the rest of the data set is used for testing. This process can be repeated if need it.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Categorical data

A

are labels for classes that are used to divide a variable into specific groups. Categorical data is also call discrete data mining that is represents finite number of values with no continuum between them.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Classification

A

or supervised induction. It is a very common data mining task. This task analyzes the historical data and generates a model that can predict future behavior. This model consists of generalizations over the records of a training dataset, which help distinguish predefined classes. The expectation is that this model can be used to predict the classes of other unclassified records and even to predict actual future events.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Clustering

A

is the process of partitioning a collection of objects, events, etc. presented in a dataset, into natural groups (sub-classes) where the members share similar characteristics. Commonly used clustering techniques are k-means (in statistics) and self-organizing maps (in machine learning).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Confidence

A

is one of the metrics that association rules mining uses to answer the question: “Are all association rules interesting and useful?” Confidence measures how often consequent go together with antecedent.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

CRISP-DM

A

(Cross Industry Standard Process for Data Mining) is a general process for doing data mining projects.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Data mining

A

in the discovering of patterns and significant knowledge in large quantity of data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Decision tree

A

builds classification or regression models in the form of a tree structure. It breaks down data into smaller and smaller subsets. Input variables in a decision tree are called attributes. A tree has branches and nodes; a branch represents the outcome of a test to classify a pattern, and each leaf node holds a class label. The topmost node in the tree is the root node.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Discovering-driven data mining

A

is a technique used to find patterns, associations, and other relationships hidden within datasets. It usually discovers facts that organization had not previously known.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Distance measure

A

is used in cluster analysis methods to calculate the closeness between pairs of items. Well known distance measures are Euclidian distance (distance between two points that can be measured with a ruler) and the Manhattan distance (the rectilinear distance, or taxicab distance, between two points).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Entropy

A

measures the extent of uncertainty or randomness in a dataset.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Gini index

A

can be used to determine the purity of a specific class as a result of a decision to branch along a particular attribute or variable.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Hypothesis-driven data mining

A

it is a technique that begins with a proposition by the user, who then seeks to validate the truthfulness of the proposition.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Information gain

A

is the splitting mechanism used in ID3.The most well-known decision tree algorithm.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Interval data

A

are variables that can be measured on interval scales. For example: temperature on the Celsius scale.

19
Q

k-fold cross-validation

A

or rotation estimation is when the original sample is randomly partitioned into k equal size subsets. Then the classification model is train and tested k times. Each time is trained in all but one-fold and then tested on the remaining single fold.

20
Q

Knowledge discovery in databases (KDD)

A

has been defined as a process of using data mining methods (involving the use of algorithms to identify patterns in data derived through the KDD process. The KDD process consist on: data selection, data processing, data transformation, data mining, and interpretation (evaluation).

21
Q

Link analysis

A

when doing link analysis, the linkage among many objects is discovered automatically.

22
Q

Microsoft Enterprise Consortium

A

is the worldwide source for access to Microsoft’s SQL Server 2008 software suite for academic purposes (teaching and research). The consortium provides wide range of business intelligence developing tools and different large datasets.

23
Q

Microsoft’s SQL Server

A

is a suite of business intelligence capabilities that has become very popular for data mining because data and the models are stored in the same relational database environment. This makes model management an easier task.

24
Q

Nominal data

A

contains code assigned to objects as labels, which are not measurements. Example for the variable marital status the general categories are: single, married, and divorced. Nominal data can be represented by binominal values like yes/no, true/false, good/bad. Nominal data can be represented by multinomial values like: brown/green/blue, another example is: single/married/divorced.

25
Q

Numeric data

A

is the numerical value of the variable. Numerical variables are age, number of children, household income (in dollars), travel distance (in miles), temperature (in Fahrenheit degrees). Numeric data is also called continuous data because represents scalable measurements.

26
Q

Ordinal data

A

contains code assigned to objects or events as labels, this labels also represent the rank order among them. Example credit score can be: low, medium, or high.

27
Q

Prediction

A

is used as a synonym of forecasting in data mining. Prediction can be named as classification where the predicted object, tomorrow forecast is a class label as “rainy” or “sunny”. The prediction could be a regression where the predicted thing is tomorrow’s temperature, then is a real number like “70 °F.”

28
Q

RapidMiner

A

Is a free data mining tool. Its graphical user interface, employment of large number of algorithms, and the incorporation of different data visualization features is making it stand out from other fee tools.

29
Q

Ratio data

A

are measurement variables found in the physical sciences and engineering. For example, mass, length, time, plane angle, energy, electric charge.

30
Q

Regression

A

is a data mining method in what is being predicted is a numeric value like a temperature (“70 °F “).

31
Q

SAS Enterprise Miner

A

is a free data mining tool which has growth in popularity among commercial tools.

32
Q

SEMMA

A

(Sample, Explore, Modify, Model, and Assess). The SEMMA data mining process starts with the generation of a representative sample of the data, which makes easier to apply exploratory statistical and visualization techniques, then select and transform (modify) the most significant predictive variables, then model the variables to predict outcomes, and finally conduct assessments to evaluate the accuracy and usefulness of the models.

33
Q

Sequence mining

A

is the other derivative of association rule mining. In sequence mining relationships are examined in terms of their order of occurrence to identify associations over time.

34
Q

Simple split

A

is a hold-out method for testing model accuracy. The simple split partitions the data in two exclusive subsets called training set and test set (or holdout set).

35
Q

SPSS PASW Modeler

A

is the most popular data mining software tool according to the May 2009 survey by Kdnuggets.com

36
Q

Support

A

is a metric for association rule that measures how often the antecedents and the consequents appear together in the same transaction.

37
Q

Weka

A

is the most popular free and open source software tool for data mining. It was developed by researchers from the University Waikato in New Zealand.

38
Q

Reasons for businesses growing interest in data mining

A

o Intense competition driven by customer constantly changing needs
o Recognition of the hidden values in large data sources.
o The consolidation and integration of database records.
o Being able to place databases and data repositories in a data warehouse.
o The rapid increase on speed of processing and storage.
o The cost of hardware and software for storing data and processing it have been decrease.

39
Q

Prominent characteristics and objectives of data mining

A

 There are data which needs to be sorted from large databases. It could contain years of accumulated data. In most cases the data has been cleaned and synthesize into a data warehouse.
 Client-server environment or Web-based Information Systems architecture.
 New tools that facilitates interaction with user
 The end user is the miner in most cases
 The miner needs to have a creative thinking and the capability to interpret the findings.
 Other characteristic of data mining is that its tools are already working with spreadsheets and software development products.
 Another characteristic is the need to use parallel processing to handle the enormous amount of data and the search.

40
Q

Data Classification in Data Mining

A
  • Categorical Data
  • Nominal Data
  • Ordinal Data
  • Numerical Data
  • Interval Data
  • Ratio Data
41
Q

data mining techniques

A

o Prediction
o Association
o Clustering

42
Q

Data Mining Applications

A
  • Since data mining has been used to address many complicated businesses problems it has increase in popularity. Business is utilizing data mining to resolve serious problems affecting them currently and other times to explore opportunities of gaining competitive advantage. Here are some ways different businesses are using data mining:
    o Customer relationship management: for identifying most profitable customers for example
    o Banking: automating the loan application process, and predicting the most probable defaulters.
    o Retailing and logistics: predicting sales in order to determine inventory
    o Manufacturing and production: predicting machine failure
    o Brokerage and securities trading: forecasting the range and direction of stock fluctuations
    o Insurance: forecast claim amounts for property and medical coverage costs
    o Computer hardware and software: to predict disk drive failures
    o Government and defense: forecasting the cost of moving military personnel and equipment and predict adversary moves.
    o Travel industry (airlines, hotels/resorts, rental car companies): to predict sales of different services.
    o Health care: identifying people without insurance and the factors for it.
    o Medicine: predict success rates of organ transplants
43
Q

Data Mining Process

A
  • Cross-Industry Standard Process for Data Mining (CRISP-DM) proposed that there are six steps in data mining process.
    o Business understanding: is to know what the study id for
    o Data understanding: is the identification of the relevant data
    o Data preparation: also called data processing is to take data identified in previous step and prepare it for analysis by data mining methods (cleaning data etc.)
    o Model building: assessment and comparative analysis of various models built. Modeling techniques are selected and applied
    o Testing and evaluation: developed models are assessed and evaluated for their accuracy
    o Deployment: this is the last step and is here where the knowledge gained from the data mining process is presented in a way the user can understand and benefit from these findings
  • Other data mining standardized process is SEMMA. SEMMA (Sample, Explore, Modify, Model, and Assess). The SEMMA data mining process starts with the generation of a representative sample of the data, which makes easier to apply exploratory statistical and visualization techniques, then select and transform (modify) the most significant predictive variables, then model the variables to predict outcomes, and finally conduct assessments to evaluate the accuracy and usefulness of the models.
44
Q

Classification Techniques

A
o	Decision tree analysis
o	statistical analysis
o	Neural networks
o	Case-based reasoning
o	Bayesian classifiers
o	Genetic algorithms
o	Rough sets