Exam 2 Flashcards

Question 1

Q

What does the straight line on the lift chart represent

Answer

A

expected number of positives on any class we would predict if we used the naive model

Question 2

Q

What is a validatation set used to do?

Answer

A

compare models and pick the best one

Question 3

Q

Estimating a model that explains the training set data points perfectly and leaves little error but that is unlikely to be accurate in prediction is

Answer

A

overfitting

Question 4

Q

With most data mining techniques, why do we partition the data..

Answer

A

in order to judge how out model will do when we apply it to new data

Question 5

Q

What data mining technique groups objects together based upong maximizing the intraclass similarity and minimizing interclass similarity

Answer

A

clustering
inter- among
intra-within

Question 6

Q

What are the tools and techniques that are used in the large scale or big data arena

Answer

A

data mining

Question 7

Q

What new mindset is needed to begin data mining using big data

Answer

A

we will need to be open to finding relationships and patterns we never imagined existed in the data we are about to examine

Question 8

Q

In “the big data future has attived” by michal malone the statement is made that:

metadata is more IM than the big data itself
the major challenge to big data analysis will be overcome bc the fruits of big data are too valuable
discovery of this “metadata” may prove to be the undoing of big data analysis
that privacy issues will prevent big data analysis from advancing beyond wht weve already seen

Question 9

Q

What are the four categories of analytic tool available in data mining?

Answer

A

prediction, classification, clustering, association

Question 10

Q

What data mining toold allows us to predict a class of objects whose label is unknown to us?

Answer

A

prediction

Question 11

Q

Forecasting model in stats, is what in data mining?

Answer

A

algorithm

Question 12

Q

Data mining term “score” is known as __ in stats

Question 13

Q

What stat terminology is referred to as a record in data mining terminology?

Answer

A

observation

Question 14

Q

What are the 5 steps identified by SAS for the data mining process

Answer

A

sample, explore, modify, model, and assess

Question 15

Q

The data mining process that involved creating, selecting, or transforming data is called

Question 16

Q

The data mining process step that involved data cleansing is called

Question 17

Q

In The invisible digital hand, the replacement of the visible hand in competition by the digitized hand…

is usually accompanied by fewer firms in the marketplace
could result in less price comparison and more impulse buying
can give rise to anticompetitive behavior
does not give rise to the “frenemy” relationships

Question 18

Q

Most economic time series are integrated in what order?

Answer

A

one
Can’t use ARIMA with trend. Must integrate it (another name for taking 1st diffrences). Most time series are integrated in one

Question 19

Q

Which of the following models utilizes a transformed series to induce a stationary series?

ARIMA(1,0,1)
ARIMA(1,0,0)
ARIMA(1,1,1)
ARIMA(0,0,1)

Answer

A

3- the I has to be a 1 bc it’s transformative

Question 20

Q

Which of the following is NOT a char of a time series best represented as an ARIMA (3,0,1)

og series is stationary
autocorrelation function has one dominant spike
the partial autocorrelation function has one dominant spike
the partial autocorrelation function has 3 spikes
none are correct

Answer

A

one diminant spike

Question 21

Q

Which of the following is not a first step in the ARIMA model selection process

examine teh ACF of the raw series
examine the PCF of teh raw series
test the data for stationarity
estimate an ARIMA (1,1,1) model for reference purposes
all of the options are correct

Question 22

Q

What is the Q stat based on?

Answer

A

estimated autocorrelation function

Question 23

Q

What is the Q stat used to test?

Answer

A

whether a series is white noise or not

Question 24

Q

T/F the Q stat follows the chi squared distr.

Question 25

Q

What tests whether there is residual autocorrelations as a set are sig diff from zero

Question 26

Q

ARIMA models require that data be….

Answer

A

stationary

Question 27

Q

In what situation does the ARIMA model have a decided advantage over standard regression models?

Answer

A

when we don’t know the predictors of the variable to be forecast

Question 28

Q

The philosophy of the box-jenkins methodology of using ARIMA model assumes what?

Answer

A

that the series we are observing sdtarted as white noise and was transformed by the black box process into the series

Question 29

Q

Which of the following does the box-jenkins methodology of using ARIMA models attemps to discern?

that they correct black box could have produced explanatory variables
that the correct black box could have produced an observed time series
that the correct black box could have produced such a series from white noise
that the correct black box could have produced a patterned time series

Answer

A

that the correct black box could hyave produced such a series from white noise

Question 30

Q

How could you graphically describe the process of the box-jenkins methodology of using ARIMA?

Answer

A

white noise- black box- observed time series

Question 31

Q

A moving average model is simply one tha predicts Yt as a function of the __ in predicting Yt

Answer

A

white noise

Question 32

Q

How is the equation for the autoregressive model diff than the equation for the MA model

Answer

A

the dep variable depends on its own previous values rather than the white noise series or residuals

Question 33

Q

Why is stationarity important?

Answer

A

**bc a series needs to be stationary before you identify the correct model

Question 34

Q

What is one method to help us achieve stationarity?

Answer

A

differencing

Question 35

Q

An ARIMA(p,d,q) model is one that has had differncing used to make a time series

Answer

A

stationary

Question 36

Q

Using the Ljung box stat applied to a samle w 30 degrees of freedom we cannot reject the null of a white noise process if the sample Q-value is less than __ at the 10% level of significance

Question 37

Q

What is the null hypothesis being tested using the Ljung-box stat?

Answer

A

the set of autocorrelation is jointly equal to zero

Question 38

Q

What problem arises when applying ARIMA type models to highly seasonal monthly data?

Answer

A

extremely high order AR and MA processes

Question 39

Q

Besides using sophisticated ARIMA type models capable of internally handling data seasonality, an alt is to use which of the following
-seasonal dummy variable
-trend dummy variables
-deseasonalized data, then reseasonalize to generate forecasts
-holts smoothing
all are correct

Answer

A

deseasonalized data

Question 40

Q

What is the key diff bw ARIMA type models and multiple regression models?

Answer

A

use of explanatory variables (arima doesn’t have any)

Question 41

Q

In the classical time series decomposition algebraic model, Y=TxSxCxI, what is C

Answer

A

measurement of the very long tem movement of the data that are often though of as waves

Question 42

Q

In the classical time series decomp algebraic model, Y=TxSxCxI, what is I

Answer

A

measurement of the irregular movement or random variations in the series

Question 43

Q

If a biz cycle always had the same vertical distance from trough to peak, it would be called

Answer

A

constant amplitude

Question 44

Q

What is the centered moving average

Answer

A

the series that remains after the seasonality and irregular components have been smoothed out by using moving averages

Question 45

Q

In what situation does teh ARIMA model have a decided advantage over standard regression models?

Answer

A

when we don’t know the predictors of the variable to be forecast

Question 46

Q

A classification models misclassification rate on the validation data is a better measure of the model’s predictive ability on new (unseen) data , then its misclassification rate on teh training data. Explain whether this statement is accurate and why that is so

Answer

A

This statement is accurate. A classification model uses its validation data to test the models accuracy or its predictive ability. Therefore, the misclassification rate on the validation data is a better indicator of a models predictive ability than the training data.
The misclassification rate on the validation set is a better measure because we want to see how well the model can function on unseen data.

Question 47

Q

The 1st step in data mining procedures according to SAS and IBM is to “sample” the data. Sampling here refers to dividing the data available for analysis into at least two parts: a training data set and a validation data set. Why do both SAS and IBM? SPSS recommend this as a first step? Wht are the risks of ignoring this procedural requirement?

Answer

A

Both SAS and IBM recommend sampling as the first step since we need the training data set to build the model and validation data set to test the model’s accuracy. The risk in ignoring this step is creating bias. If a data scientist uses the same data to both build and test the model, and that model is overfit, then most likely the results will also be overfit.
need to see how this model will work on data we have and in the real world.

Question 48

Q

How do unstructured and structured data differ? Which is the more prevalent form of data? How would the following be classified: # in excel spreadsheet, text files, video images, audio files

Answer

A

Structured data: data that does have a predefined model
Unstructured data: Data that does not have a predefined data model
Unstructured data is a more prevalent form of data because it comes in many different forms which we are expose to daily
Excel spreadsheet: structured data
A thousand text files: unstructured
A thousand video images: unstructured
A thousand audio filed: Unstructured

Question 49

Q

Some data mining algorithms work so well they have the tendency to overfit the training data. What does the term overfit mean, and what does overlooking it cause for the data scientist?

Answer

A

Overfitting: When we put too many attributes (or try to account for too many patterns) in a model, including some unrelated to the target.
If a data scientist overfits their data they will incorrectly explain some variation in the data that is nothing more than a chance variation. In other words, they will have mislabeled the noise in the data as part of the “true signal”
If you overfit, you model the noise in the data. If you model it, then replicate it, your model will have a great fit, but a low accuracy

Question 50

Q

What are ways to make a forecast less biased

Answer

A

diff forecast methods
diff forecasters
diff sources of data

Question 51

Q

For a combined forecast to be unbiased, each of the forecasts cannot

Answer

A

consistently over or underestimate any values

Question 52

Q

If we were to score a new cust based upon the attributes we used in the algorithm, we would be accurate in the prediction about 90% of the time if we always scored the indiv as “no accepting a loan” bc that indeed is what most cust. have done in the past. Why not accept being correct 90% of the time w this very simple decision rule?

Answer

A

Because with data mining we have access to the information and tools that can help us do better than predicting correctly 90% of the time. So in this scenario, we could look at our lift chart and find the customers with the highest probability of accepting a personal loan, and market to them, in order to have a better chance of finding people who will accept the loan.

Question 53

Q

Data has the char of being nonrivalry. What is this and why is it IMP to realize that data has this char?

Answer

A

Nonrivalry: characteristic that means that one person’s use of the good to create value does not diminish the value another can extract from the data.
It’s important to realize that data has this characteristic so more researchers and data scientists can use data, because every time the data set is used, it can be used to obtain different results. Every researcher can use a data set with a different purpose and get different conclusions.

Question 54

Q

The lift chart and teh confusion matrix are both standard diagnostic tools used to evaluate a data mining algorithm. Don’t the two measures display the same info? Explain any diff between the two measures?

Answer

A

Confusion matrix: This shows model performance. There is a confusion matrix for both the validation data and the training data. Most often, the results from the validation model are most relevant since they show how the model performed on unseen data. The validation confusion matrix shows model performance in classification on data that was not used to build the model. Gives results for the amount of correct classifications and the misclassifications.
Lift chart: This is the standard for accuracy in data mining. These charts help to determine how effectively the model can reorder the data set, by placing the individuals who have the highest probability of success on top, and those with the lowest probability of success on bottom. By looking at the chart, you can determine how well your model is doing compared to a naïve model.
confusion- what you misclassified vs classified correctly
lift- for each % of data, how many you got right or wrong

Question 55

Q

What’s the 1st step in combining forecasts? (2 parts)

Answer

A

make sure the # or rows of hist data exceed the # of forecast values
consider how the data should be set up

Question 56

Q

For ensemble models, what are two IM char of the 45 degree line for a perf forecast?

Answer

A

slope is equal to one

intercept is equal to zero

Question 57

Q

What would a graph of a downward bias forecast look like?

Answer

A

forecast values above the perf forecast line

Question 58

Q

What info is likely lost if a particular forecast is ignored bc it is not the best forecast?

the discarded forecast may make use of a type of relationship ignored by the best forecast
judgemental data included in the discarded forecast may not be included in the best forecast
data included in the discarded forecast may not be included in the “best” forecast
Some variables included in the discarded forecast may not be included in the “best” forecast

Question 59

Q

how can you obtain forecast improvement?

Answer

A

by combining forecasts from diff models

Question 60

Q

What is the most common source of bias in forecasting?

Answer

A

preconcieved notions of the forecaster

Question 61

Q

if a forecast model w a lower MAPE is more heavily weighted, the combined forecast will ___

Question 62

Q

Steps to combine forecasts

Answer

A

first consider how the data should be set up
regress the actual values of the variable to be forecast on the two forecast results for the historic period.
When there’s no bias proceed to the same regression bu force the constant to be 0

Question 63

Q

The premise of constructing combined forecasts would be satisfied by which of the following scenarios?

using exp smoothing method
using delphi method
using the MA method
using multiple MA methods

Question 64

Q

What is the purpose of the 45 degree line of a perfect foecast?

Answer

A

to show that the forecast would not have bias

Answer 53

A

leading economic index- lead turning points in economic activity (stock prices, avg weekly manufacturing hours, ISM new order index)
coincident index- are coincident with turning points in economic activity (ie personal income and industrial production)
lagging index- lag turning points in economic activity (avg prime rates, avg duration of unemployment, labor cost per unit of output)

Answer 54

A

a test of the combine forecast model bias can be performed

Answer 55

A

it minimizes the total error giving you more predictive accuracy

Answer 56

A

constant periodicity

Answer 57

A

Centered moving average

Answer 58

A

constant amplitude

Answer 59

A

by the cycle factor (CMA/CMAT)

if CF>1: indicated the deseasnalized value for that period is above the long term trend of the data

Answer 60

A

cycle factor

Answer 61

A

cycle factor

Answer 62

A

identify long term trend, seasonal fluctuation, cyclical movements, and irregular fluctuation. Then break the series into its components by breaking the series int its component parts and then reassembling the parts to construct a forecast

Answer 63

A

Y=T(rend)xS(easonality)xC(yclicality)xI(irregular variations)

Answer 64

A

t: CMAT
S: seasonal indices
C: cycle factor

Answer 65

A

it allows us to better see the underlying pattern in the data and provides a measure of the extent of seasonality in the form of seasonal indeces

Answer 66

A

expansion phase

Answer 67

A

year that is centered on that MA

Answer 68

A

constant periodicity and constant amplitude and regulatity

Answer 69

A

centered MA

Answer 70

A

compare the actual value with the deseasonalized value

Answer 71

A

indicates a period where the value is greater than the quarterly avg for the year

Answer 72

A

leading index

Answer 73

A

lagging index

Answer 74

A

past pattern

Answer 75

A

recent observation

recent

Answer 76

A

numbers are normally and ind. distributed
purely random series of numbers
it is assumed the observed time series started as white noise

Answer 77

A

partial autocorrelatin coefficient

autocorrelation coefficient

Answer 78

A

simple models are the best

- it is possible for two or more models to be very similar in their fit of data

Answer 79

A

allows for greater flecibility in the choice of the correct model
extracts a great deal of info from the time series
encourages examination of a wide variety of models in search for an acceptable one

Answer 80

A

allows for greater flexibility in the choice of the correct model
extracts a great deal of info from the time series
encourages examination of a wide variety of models in search for an acceptable one

Answer 81

A

take longs of the og time series to transfer the trend in variance to a trend in the mean
differencing the time series to remove a trend

Answer 82

A

more positive correlation

Answer 83

A

if neither function falls off abruptly, but both decline toward zero in some fashion, the appropriate model is an ARMA(p,q) type

Answer 84

A

if neither function falls off abruptly, but both decline toward zero in some fashion, the appropriate model is an ARMA(p,q) type

Answer 85

A

autoregressive model

Answer 86

A

if the remaining series is not white noise, pass it thru another black box

Answer 87

A

estimate paramenters of tentative model

Answer 88

A

whether the residual autocorrelations as a set are sig diff from zero

Answer 89

A

the autoregrissive is similar to the MA model except that the dep variable depends on its own previous values

Answer 90

A

autoregressive terms

Answer 91

A

nonstationary

Answer 92

A

creating, selecting, or transforming the data

Answer 93

A

not affect

Answer 94

A

holdout periods in standard data forecasting models

Answer 95

A

F, it is not needed

Answer 96

A

both measure the certainty of trustworthiness associated w the patterns discovered
in data mining u simultaneously search for diff kinds of patterns in parallel
in biz forecasting search for set patterns
in biz forecasting the expectation is that the data will contain some level of variation
in data mining, patterns are not pre specified

Answer 97

A

autoregressive terms

Answer 98

A

moving average terms