Test 1 ISYS 4293 Flashcards

1
Q

Business Intelligence and Data Mining

A

Data mining is a collection of knowledge-discovery technologies used to perform Business Intelligence in order to support an organization’s decision-making

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Cross Industry Standard Process-DM

A

is how we do data mining

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

(1) Business Problem Understanding

A

-Define business requirements and objectives
-Translate objectives into data mining problem definition
-Prepare initial strategy to meet objectives

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

(2) Data Understanding Phase

A

-Collect data
-Assess data quality
-Perform exploratory data analysis (EDA)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

(3) Data Preparation Phase

A

-Cleanse, prepare, and transform data set
-Prepares for modeling in subsequent phases
-Select cases and variables appropriate for analysis

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

(4) Modeling Phase

A

-Select and apply one or more modeling techniques
-Calibrate model settings to optimize results
-If necessary, additional data preparation may be required

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

(5) Evaluation Phase

A

-Evaluate one or more models for effectiveness
-Determine whether defined objectives are achieved
-Make decision regarding data mining results before deploying to field

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

(6) Deployment Phase

A

-Make use of models created
-Simple deployment: generate report
-Complex deployment: implement additional data mining effort in another department
-In business, customer often carries out deployment based on model

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

How many data mining tasks?

A

6 data mining task

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Data Mining Task: Description

A

-Describes general patterns and trends
-Easy to interpret and explain
-Transparent Models
-Pictures and #’s
-E.g. Scatterplots, Descriptive Stats

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Data Mining Task: Estimation

A

-Target Variable = Numerical
-Numerical Predictor/Categorical (IV’s) values to approximate changes in Numerical Target Variables(DV’s)
-Ex: Estimate a student’s Graduate GPA from their Undergrad GPA
-E.g. Correlation, Linear Regression

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Data Mining Task: Classification

A

-target variables (DV’s) = categorical
-Examples:
Simple vs Complex tasks
Fraudulent card transactions
Income brackets(ex. high, middle, low)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Data Mining Task: Prediction

A

-Results lie in the future
-There is a time component in this task
-Ex: What is the probability of Razorbacks winning a game with a particular combination of player profiles?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Data Mining Task: Association

A

-Finding attributes of data that go together
-Profiling relationships between two or more attributes
-Understand the consequent behaviors when based on prior behaviors
-Ex: Supermarkets use affinity analysis to see what items are purchased together

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Data Mining Task: Clustering

A

-no target variables
-segmentation of data
-Ex: Focused marketing campaigns

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Data mining Task: Learning Types

A

Supervised and Unsupervised

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Supervised

A

-Have a target variable
-Task:
Classification(Categorical Target Variable)
Estimation (Numeric Target Variable)
Description
Prediction

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Unsupervised

A

-No target variable
-Task:
Association
Clustering

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Fallacy 1:
-Set of tools can be turned loose on data repositories
-Finds answers to all business problems

A

Reality 1:
-No automatic data mining tools solve problems
-Rather, data mining is a process (CRISP-DM)
-Integrates into overall business objectives

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Fallacy 2:
-Data mining process is autonomous
-Requires little oversight

A

Reality 2:
-Requires significant intervention during every phase
-After model deployment, new models require updates
-Continuous evaluative measures monitored by analysts

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Fallacy 3:
-Data mining quickly pays for itself

A

Reality 3:
-Return rates vary
-Depending on startup, personnel, data preparation costs, etc.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Fallacy 4:
Data mining software easy to use

A

Reality 4:
-Ease of use varies across projects
-Analysts must combine subject matter knowledge with specific problem domain

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Fallacy 5:
Data mining identifies causes of business problems

A

Reality 5:
-Knowledge discovery process uncovers patterns of behavior
-Humans interpret results and identify causes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Fallacy 6:
-Data mining automatically cleans data in databases

A

Reality 6:
-Data mining often uses data from legacy systems
-Data possibly not examined or used in years
-Organizations starting data mining efforts confronted with huge data preprocessing task

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

Fallacy 7:
-Data mining will always yield positive results

A

Reality 7:
-Not guaranteed for positive results
-Can sometimes provide actionable results and improve decisions, but not always

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

Data preparation

A

60% of effort for data mining process

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

Data Cleaning

A

-Replacement Missing Value
-Normalization, converting variables to standardized scale
-Testing for Normality
-Dummy Variables
-Outliers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

Why Preprocess data

A

-Raw data may often be incomplete, noisy
-Data often from legacy databases where values are missing or non relevant
-Data in form not suitable for data mining; Obsolete fields; Outliers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

Three Alternate Methods For Replacing Data

A

-Replace Missing Values with User-defined Constant
-Replace Missing Values with Mode or Mean/Median
-Replace Missing Values with Random Values

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

Replace Values with User-Defined Constant

A

-Missing numeric values replaced with 0.0
-Missing categorical values replaced with β€œMissing

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

Replace Missing Values with Mode or Mean/Median

A

-Mode for categorical field
-Mean/Median for continuous field

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

Replace Missing Values with Random Values

A

-Values randomly taken from underlying distribution
-Method superior compared to mean substitution
-Measures of location and spread remain closer to original

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

Data Transformation: Normalization

A

-Standardizes scale of effect each variable has on results and The mean and variance or range of every variable
(numeric field values should be normalized)

34
Q

Min-Max Normalization

A

-Determines how much greater field value is than minimum value for field
-Scales this difference by field’s range
-X* stands for β€œmin-max normalized X”

35
Q

Z-score Standardization

A

-Widely used in statistical analysis
-Takes difference between field value and field value mean
-Scales this difference by field’s standard deviation
-Range [-3,3]
-Data values equal to field’s mean have z-score Standardization value = 0
-Data values that lie above the mean have positive z-score Standardization values

36
Q

In Z-score Standardization:
-Data values equal to field’s mean

A

have z-score Standardization value = 0

37
Q

In Z-score Standardization:
-Data values that lie above the mean

A

have positive z-score Standardization values

38
Q

In Z-score Standardization:
-Data Values that lie below mean

A

Have negative z-score Standardization Values

39
Q

Normality

A

to transform variable so that its distribution is closer to normal without changing its basic information

40
Q

Data Transformation: Normality

A

Common transformations:
-Natural log = ln(bank)
-Square root = βˆšπ΅π‘Žπ‘›π‘˜
-Inverse square root = 1/βˆšπ΅π‘Žπ‘›π‘˜

41
Q

Right-skewed data

A

mean > median; skewness is positive

42
Q

Left-skewed data

A

mean < median; skewness is negative

43
Q

Symmetric data

A

mean = median = mode; skewness is zero

44
Q

Outliers

A

-values that lie near extreme limits of data range
-Outliers may represent errors in data entry

45
Q

Z-score Standardization

A

sensitive to outliers

46
Q

Min Max normalization

A

sensitive to variation

47
Q

Interquartile Range (IQR)

A

-Used to identify Outliers
-Robust statistical method and less sensitive to presence of outliers
-measure of variability

48
Q

Hypothesis

A

A statement or claim about a parameter

49
Q

Null Hypothesis

A

represents assumed value

50
Q

Alternative Hypothesis

A

represents alternative claim about the value

51
Q

Statistical Inference

A

Methods for estimating and testing hypotheses about population characteristics based on information contained in a sample

52
Q

A data analyst meets with superiors to discuss
whether to use kNN or Association on the data

A

Modeling Phase(Still discussing which model to use)

53
Q

Chief Analyst meets with CIO, who says that she
would like to investigate and scope out how
analytics can be used in HR hiring projects?

A

Business Understanding Phase(look at investigate and scope out)

54
Q

Estimate the amount of money a randomly chosen family of 4
will be shopping given a time and date?

A

estimation or prediction (target variable, categorial or continuous, numerical or nonnumerical)

55
Q

Forecast the stock price of Microsoft for next year?

A

estimation or prediction

56
Q

What does this equation represent?
zi= (xi-x_)/s

A

z-score equation
zi= zscore
xi= observed value
x_ = mean of sample
s= standard deviation

57
Q

Simple linear regression equation

A

𝑦=𝜷_𝟎+𝜷_𝟏 π‘₯+𝜺

58
Q

What is the use of standardizing
variables?

A

Automatically remove outliers in the variables
* Convert variables to a same scale *
* Helps in computing IQR
* Make interpretation of the results easier

59
Q

When Handling Missing Data, one could,

A

– Replace Missing Values with User-defined Constant
– Replace Missing Values with Mode or Mean/Median
– Replace Missing Values with Random Values
– All of the above *

60
Q

IQR is more robust than Z-score method for
outlier detection, however, it is highly sensitive
to mean and standard deviation

A
  1. True
  2. False * (look at highly sensitive)
  3. It is depends on the context
  4. It is depends on the observations
  5. Only 3 and 4
61
Q

Is IQR or zscore more sensitive

A

zscore

62
Q

In data mining tasks, one could reduce the
margin of errors by…

A

Reducing the sample size
– Increasing the sample size *
– Changing the standard deviation
– Keeping the sample size constant

63
Q

Normalization of the data can be done using

A

None of the above (min-max equation) x β€² = ( x βˆ’ x m i n ) / ( x m a x βˆ’ x m i n )

64
Q

In Forward Regression, you start with all variables of
interest in the model and then at each step, the least
significant variable is dropped, assuming it’s p-value is
above a pre-set level (Ξ± = .05 or .10)

A

false

65
Q

Before running a k-nearest neighbor model it is required to

A

set the number of neighbors to compare instances to

66
Q

In k-nearest neighbor, distance for categorical
variable can be computed by

A

Different function

67
Q

Choose appropriate fit statistics for estimation model
selection:

A

Misclassification
– Gini Coeff
– Average Squared Error *
– Schwarz’s Bayesian Criterion *
– Average Profit/Loss
– Log Likelihood *

68
Q

appropriate fit statistics for rankings model

A

ROC Index *
Gina Coefficient *

69
Q

Choose appropriate fit statistics for decision model
selection:

A

Misclassification *
– Gini Coeff
– ASE
– MSE
– Average Profit/Loss *
– Log Likelihood
- kolmorgov smirnov statistic *

70
Q

Which is true when modeling a Decision Tree?

A

Each variable is evaluated at each node to determine the splitting
variable
* The same variable may be used for splitting at different locations in
the Decision Tree
* CART (Phi) / information gain criteria can be used for selecting
candidate splits
* If not pruned, a stopping criterion in creating a Decision Tree is
when the tree reaches the leaf nodes
* All of the above *

71
Q

Categorical data

A
  • Labels or names used to identify an attribute of
    each element
  • Generally qualitative
  • Nominal or ordinal
72
Q

Quantitative data

A
  • indicates how many or how much
  • Either discrete or continuous
73
Q

The sum of differences between xi and xbar =

A

0

74
Q

Variance

A
  • measures how far a set of numbers are spread out from their average value.

Xi - Xbar = Varianc

75
Q

Overfitting

A

-when your model memorizes your exact training data but doesn’t figure out the pattern in the data
-Fits the model too much

76
Q

Underfitting

A

is when you model is too simple
-model isn’t complex enough to match the training data

77
Q

continuous variable

A

use the two-sample t test for the difference in means

78
Q

flag variable

A

use the two-sample Z test for the difference in proportions

79
Q

multinomial variable

A

use the test for the homogeneity of proportions

80
Q

goodness of fit equation:

A

Ξ¦(s|t) = 2PlPr βˆ‘|P(j|tL) - P(j|tR)|