Exam 2 Flashcards
What does the straight line on the lift chart represent
expected number of positives on any class we would predict if we used the naive model
What is a validatation set used to do?
compare models and pick the best one
Estimating a model that explains the training set data points perfectly and leaves little error but that is unlikely to be accurate in prediction is
overfitting
With most data mining techniques, why do we partition the data..
in order to judge how out model will do when we apply it to new data
What data mining technique groups objects together based upong maximizing the intraclass similarity and minimizing interclass similarity
clustering
inter- among
intra-within
What are the tools and techniques that are used in the large scale or big data arena
data mining
What new mindset is needed to begin data mining using big data
we will need to be open to finding relationships and patterns we never imagined existed in the data we are about to examine
In “the big data future has attived” by michal malone the statement is made that:
- metadata is more IM than the big data itself
- the major challenge to big data analysis will be overcome bc the fruits of big data are too valuable
- discovery of this “metadata” may prove to be the undoing of big data analysis
- that privacy issues will prevent big data analysis from advancing beyond wht weve already seen
2
What are the four categories of analytic tool available in data mining?
prediction, classification, clustering, association
What data mining toold allows us to predict a class of objects whose label is unknown to us?
prediction
Forecasting model in stats, is what in data mining?
algorithm
Data mining term “score” is known as __ in stats
forecast
What stat terminology is referred to as a record in data mining terminology?
observation
What are the 5 steps identified by SAS for the data mining process
sample, explore, modify, model, and assess
The data mining process that involved creating, selecting, or transforming data is called
modify
The data mining process step that involved data cleansing is called
explore
In The invisible digital hand, the replacement of the visible hand in competition by the digitized hand…
- is usually accompanied by fewer firms in the marketplace
- could result in less price comparison and more impulse buying
- can give rise to anticompetitive behavior
- does not give rise to the “frenemy” relationships
3
Most economic time series are integrated in what order?
one
Can’t use ARIMA with trend. Must integrate it (another name for taking 1st diffrences). Most time series are integrated in one
Which of the following models utilizes a transformed series to induce a stationary series?
- ARIMA(1,0,1)
- ARIMA(1,0,0)
- ARIMA(1,1,1)
- ARIMA(0,0,1)
3- the I has to be a 1 bc it’s transformative
Which of the following is NOT a char of a time series best represented as an ARIMA (3,0,1)
- og series is stationary
- autocorrelation function has one dominant spike
- the partial autocorrelation function has one dominant spike
- the partial autocorrelation function has 3 spikes
- none are correct
one diminant spike
Which of the following is not a first step in the ARIMA model selection process
- examine teh ACF of the raw series
- examine the PCF of teh raw series
- test the data for stationarity
- estimate an ARIMA (1,1,1) model for reference purposes
- all of the options are correct
4
What is the Q stat based on?
estimated autocorrelation function
What is the Q stat used to test?
whether a series is white noise or not
T/F the Q stat follows the chi squared distr.
T
What tests whether there is residual autocorrelations as a set are sig diff from zero
q stat
ARIMA models require that data be….
stationary
In what situation does the ARIMA model have a decided advantage over standard regression models?
when we don’t know the predictors of the variable to be forecast
The philosophy of the box-jenkins methodology of using ARIMA model assumes what?
that the series we are observing sdtarted as white noise and was transformed by the black box process into the series
Which of the following does the box-jenkins methodology of using ARIMA models attemps to discern?
- that they correct black box could have produced explanatory variables
- that the correct black box could have produced an observed time series
- that the correct black box could have produced such a series from white noise
- that the correct black box could have produced a patterned time series
that the correct black box could hyave produced such a series from white noise
How could you graphically describe the process of the box-jenkins methodology of using ARIMA?
white noise- black box- observed time series
A moving average model is simply one tha predicts Yt as a function of the __ in predicting Yt
white noise
How is the equation for the autoregressive model diff than the equation for the MA model
the dep variable depends on its own previous values rather than the white noise series or residuals
Why is stationarity important?
**bc a series needs to be stationary before you identify the correct model
What is one method to help us achieve stationarity?
differencing
An ARIMA(p,d,q) model is one that has had differncing used to make a time series
stationary
Using the Ljung box stat applied to a samle w 30 degrees of freedom we cannot reject the null of a white noise process if the sample Q-value is less than __ at the 10% level of significance
40
What is the null hypothesis being tested using the Ljung-box stat?
the set of autocorrelation is jointly equal to zero
What problem arises when applying ARIMA type models to highly seasonal monthly data?
extremely high order AR and MA processes
Besides using sophisticated ARIMA type models capable of internally handling data seasonality, an alt is to use which of the following
-seasonal dummy variable
-trend dummy variables
-deseasonalized data, then reseasonalize to generate forecasts
-holts smoothing
all are correct
deseasonalized data
What is the key diff bw ARIMA type models and multiple regression models?
use of explanatory variables (arima doesn’t have any)
In the classical time series decomposition algebraic model, Y=TxSxCxI, what is C
measurement of the very long tem movement of the data that are often though of as waves
In the classical time series decomp algebraic model, Y=TxSxCxI, what is I
measurement of the irregular movement or random variations in the series
If a biz cycle always had the same vertical distance from trough to peak, it would be called
constant amplitude
What is the centered moving average
the series that remains after the seasonality and irregular components have been smoothed out by using moving averages
In what situation does teh ARIMA model have a decided advantage over standard regression models?
when we don’t know the predictors of the variable to be forecast
A classification models misclassification rate on the validation data is a better measure of the model’s predictive ability on new (unseen) data , then its misclassification rate on teh training data. Explain whether this statement is accurate and why that is so
This statement is accurate. A classification model uses its validation data to test the models accuracy or its predictive ability. Therefore, the misclassification rate on the validation data is a better indicator of a models predictive ability than the training data.
The misclassification rate on the validation set is a better measure because we want to see how well the model can function on unseen data.
The 1st step in data mining procedures according to SAS and IBM is to “sample” the data. Sampling here refers to dividing the data available for analysis into at least two parts: a training data set and a validation data set. Why do both SAS and IBM? SPSS recommend this as a first step? Wht are the risks of ignoring this procedural requirement?
Both SAS and IBM recommend sampling as the first step since we need the training data set to build the model and validation data set to test the model’s accuracy. The risk in ignoring this step is creating bias. If a data scientist uses the same data to both build and test the model, and that model is overfit, then most likely the results will also be overfit.
need to see how this model will work on data we have and in the real world.
How do unstructured and structured data differ? Which is the more prevalent form of data? How would the following be classified: # in excel spreadsheet, text files, video images, audio files
Structured data: data that does have a predefined model
Unstructured data: Data that does not have a predefined data model
Unstructured data is a more prevalent form of data because it comes in many different forms which we are expose to daily
Excel spreadsheet: structured data
A thousand text files: unstructured
A thousand video images: unstructured
A thousand audio filed: Unstructured
Some data mining algorithms work so well they have the tendency to overfit the training data. What does the term overfit mean, and what does overlooking it cause for the data scientist?
Overfitting: When we put too many attributes (or try to account for too many patterns) in a model, including some unrelated to the target.
If a data scientist overfits their data they will incorrectly explain some variation in the data that is nothing more than a chance variation. In other words, they will have mislabeled the noise in the data as part of the “true signal”
If you overfit, you model the noise in the data. If you model it, then replicate it, your model will have a great fit, but a low accuracy
What are ways to make a forecast less biased
diff forecast methods
diff forecasters
diff sources of data