Chapter 5: Data Mining for Business Intelligence Flashcards

Question

Numeric data

Answer 1

is the numerical value of the variable. Numerical variables are age, number of children, household income (in dollars), travel distance (in miles), temperature (in Fahrenheit degrees). Numeric data is also called continuous data because represents scalable measurements.

Answer 2

contains code assigned to objects or events as labels, this labels also represent the rank order among them. Example credit score can be: low, medium, or high.

Answer 3

is used as a synonym of forecasting in data mining. Prediction can be named as classification where the predicted object, tomorrow forecast is a class label as “rainy” or “sunny”. The prediction could be a regression where the predicted thing is tomorrow’s temperature, then is a real number like “70 °F.”

Answer 4

Is a free data mining tool. Its graphical user interface, employment of large number of algorithms, and the incorporation of different data visualization features is making it stand out from other fee tools.

Answer 5

are measurement variables found in the physical sciences and engineering. For example, mass, length, time, plane angle, energy, electric charge.

Answer 6

is a data mining method in what is being predicted is a numeric value like a temperature (“70 °F “).

Answer 7

is a free data mining tool which has growth in popularity among commercial tools.

Answer 8

(Sample, Explore, Modify, Model, and Assess). The SEMMA data mining process starts with the generation of a representative sample of the data, which makes easier to apply exploratory statistical and visualization techniques, then select and transform (modify) the most significant predictive variables, then model the variables to predict outcomes, and finally conduct assessments to evaluate the accuracy and usefulness of the models.

Answer 9

is the other derivative of association rule mining. In sequence mining relationships are examined in terms of their order of occurrence to identify associations over time.

Answer 10

is a hold-out method for testing model accuracy. The simple split partitions the data in two exclusive subsets called training set and test set (or holdout set).

Answer 11

is the most popular data mining software tool according to the May 2009 survey by Kdnuggets.com

Answer 12

is a metric for association rule that measures how often the antecedents and the consequents appear together in the same transaction.

Answer 13

is the most popular free and open source software tool for data mining. It was developed by researchers from the University Waikato in New Zealand.

Answer 14

o Intense competition driven by customer constantly changing needs o Recognition of the hidden values in large data sources. o The consolidation and integration of database records. o Being able to place databases and data repositories in a data warehouse. o The rapid increase on speed of processing and storage. o The cost of hardware and software for storing data and processing it have been decrease.

Answer 15

 There are data which needs to be sorted from large databases. It could contain years of accumulated data. In most cases the data has been cleaned and synthesize into a data warehouse.  Client-server environment or Web-based Information Systems architecture.  New tools that facilitates interaction with user  The end user is the miner in most cases  The miner needs to have a creative thinking and the capability to interpret the findings.  Other characteristic of data mining is that its tools are already working with spreadsheets and software development products.  Another characteristic is the need to use parallel processing to handle the enormous amount of data and the search.

Answer 16

- Categorical Data - Nominal Data - Ordinal Data - Numerical Data - Interval Data - Ratio Data

Answer 17

o Prediction o Association o Clustering

Answer 18

- Since data mining has been used to address many complicated businesses problems it has increase in popularity. Business is utilizing data mining to resolve serious problems affecting them currently and other times to explore opportunities of gaining competitive advantage. Here are some ways different businesses are using data mining: o Customer relationship management: for identifying most profitable customers for example o Banking: automating the loan application process, and predicting the most probable defaulters. o Retailing and logistics: predicting sales in order to determine inventory o Manufacturing and production: predicting machine failure o Brokerage and securities trading: forecasting the range and direction of stock fluctuations o Insurance: forecast claim amounts for property and medical coverage costs o Computer hardware and software: to predict disk drive failures o Government and defense: forecasting the cost of moving military personnel and equipment and predict adversary moves. o Travel industry (airlines, hotels/resorts, rental car companies): to predict sales of different services. o Health care: identifying people without insurance and the factors for it. o Medicine: predict success rates of organ transplants

Answer 19

- Cross-Industry Standard Process for Data Mining (CRISP-DM) proposed that there are six steps in data mining process. o Business understanding: is to know what the study id for o Data understanding: is the identification of the relevant data o Data preparation: also called data processing is to take data identified in previous step and prepare it for analysis by data mining methods (cleaning data etc.) o Model building: assessment and comparative analysis of various models built. Modeling techniques are selected and applied o Testing and evaluation: developed models are assessed and evaluated for their accuracy o Deployment: this is the last step and is here where the knowledge gained from the data mining process is presented in a way the user can understand and benefit from these findings - Other data mining standardized process is SEMMA. SEMMA (Sample, Explore, Modify, Model, and Assess). The SEMMA data mining process starts with the generation of a representative sample of the data, which makes easier to apply exploratory statistical and visualization techniques, then select and transform (modify) the most significant predictive variables, then model the variables to predict outcomes, and finally conduct assessments to evaluate the accuracy and usefulness of the models.

Answer 20

``` o Decision tree analysis o statistical analysis o Neural networks o Case-based reasoning o Bayesian classifiers o Genetic algorithms o Rough sets ```

Chapter 5: Data Mining for Business Intelligence Flashcards

(44 cards)