Analytic Techniques Flashcards
what technique would you use if you needed to group items or find structure?
a) regression
b) clustering
c) time series
b)clustering
what technique would you use if you needed to discover relationships between actions or items?
a) text analysis
b) regression
c) classification
d) association rules
d)association rules
what technique would you use if you needed to determine the relationship between the input variables and the outcome?
a) text analysis
b) regression
c) Time series
b)regression
what technique would you use if you needed to assign labels to objects?
a) classification
b) text analysis
c) regression
a)classification
what technique would you use if you needed to find structure in temporal data in order to make forecasts?
a) classification
b) text analysis
c) time series
c)time series
what technique would you use if you needed to analyse free text?
a) time series
b) clustering
c) classification
d) text analysis
d)text analysis
what technique is clustering?
k-means
what technique is regression?
linear and logistic
what technique is classification?
naive bayes
decision trees
what technique is association rules?
apriori
what technique is time series?
ARMA, ARIMA, PACF & ACF
what technique is text analysis?
regular expressions
bag of words
TF-IDF
which methods are the unsupervised learning method?
k-means
apriori
what is the output of k-means?
the cluster centre
what is the input of k-means?
numerical - Euclidean distance
what is euclidian distance?
method of calculating distance - most ordinary distance
if a domain does not suggest a suitable value for k then what do you do?
plot wss and look for elbow
in k-means what do you do if its missing expected splits?
increase k
in k-means what do you do if its clusters have few data points?
decrease k
in k-means what do you do if the centroids are close together?
decrease k
what is the right description for apriori?
a) if y is observed, then x is also observed
b) if x is observed, then y is also observed
b) if x is observed, then y is also observed
what’s association rules sometimes referred as?
a) market analysis
b) market basket analysis
c) task basket analysis
b) market basket analysis
what is a frequent itemset for apriori?
set of items that appear together “often enough”
what is normally the support % for apriori? (confidence)
50%
what is confidence is apriori?
% of transactions that contain x that also contain y
in apriori, what does lift mean?
how many times more often x and y occur together than expected
in apriori, what does leverage mean?
measures the difference in the probability of x and y appearing together
how do you work out confidence with apriori? for example credit good = 700
job skilled = 544
a) 700/544
b) 544/700
b)544/700
what is a test set?
hold back some baskets with few random values removed - can the rules fill in the blanks
how do you work out lift of if 713 home owners and 527 have good credit. 700 have good credit overall
a) 0.527/(0.7000.713)
b) 527/(0.700713)
c) 527/713
a) 0.527/(0.700*0.713)
what does regression do?
a) looks at a variable between inputs and the outcome
b) looks at the relationship between a set of variables and the outcome
c) looks at the relationship between a set of outputs
b)looks at the relationship between a set of variables and the outcome
what is linear regression
used to estimate a continuous value as linear
in regression, what does OLS stand for?
Ordinary least squares
in regression, what does OLS do?
finds the best fit line
what is the p-value in regression?
a) p-value can be used to look for numeric input values
b) p-value can be used to determine if the coefficient is significantly not different than zero.
c) p-value can be used to determine if the coefficient is significantly different than zero.
c)p-value can be used to determine if the coefficient is significantly different than zero.
what does a large p-value mean?
a) null hypothesis is rejected
b) null hypothesis is not rejected
b) null hypothesis is not rejected
what are residuals in regression?
a) the similarities between the observed and the estimated outcomes
b) the differences between the observed and the estimated outcomes
b)the differences between the observed and the estimated outcomes
what is logistic regression?
used to estimate the probability that an event will occur (probability borrower will default)
what can logistic regression also be considered as?
classifier
what is the standard threshold of logistic regression?
0.5 (50%)
What is the preferred method for binary classification problems?
Logistic regression
Which isnot binary classification problems?
A)true/false
B)approve/deny
C)respond to medical treatment/not response
D)confidence/lift
D) confidence/lift
what does pseudo-r2 mean?
a) deviance/null deviance
b) r squared
c) square root
a)deviance/null deviance
what is naive Bayes?
determine the most probable class label for each object
what is naive Bayes based on?
Bayes law
what is naive Bayes used for?
a) spam filtering
b) scoring
c) fraud
d) text analysis
spam
fraud
what is this?
P(C | A)P(A) = P(A | C)P(C) = P(A ^ C).
bayes law
to build the naive Bayes classier what do you need?
probability of all class labels
in naive Bayes how to classify something?
work out the probability total (good/bad) then multiply all good together and times by total
what is a confusion matrix
TPR/FPR
where are decision trees found?
data mining applications
what are the two types of decision trees?
classification trees
regression trees
what is a classification tree?
segment observations into homogeneous groups
what is a regression tree?
variations of regression and the average value of each node is returned
what is a branch of decision tree?
outcome of decision
what is an internal node of decision tree?
test points
what is a leaf node of a decision tree?
end of the last branch
what should you use a decision tree?
when if-then is preferred to a linear model
what is a weak learner (decision trees)
short decision tree
in decision trees how do you get the most informative attribute?
entropy based methods
what is this for? and what does it mean?
Hcredit = -(0.7 log2(0.7) + 0.3log2(0.3)) = 0.88 ( very close to 1)
base entropy (decision tree) high entropy
what does conditional entropy do in decision trees?
attribute values give more information about the class membership
what is information gain?
difference between base and conditional entropy
if you have a high information gain what does than mean?
first variable for tree split
which classifier for these questions: do I want class probabilities or just class labels
logistic regression
decision tree
which classifier for these questions:
do I want insight into how the variables affect the model?
logistic regression
decision tree
which classifier for these questions:
is the problem high dimensional?
naive bayes
which classifier for these questions:
do I suspect some of the inputs are correlated?
decision trees
logistic regression
which classifier for these questions:
do I suspect sone if the inputs are irrelevant?
decision tree
naive bayes
which classifier for these questions:
are there categorical variables with a large number of levels?
naive bayes
decision tree
which classifier for these questions:
are there mixed variable types?
decision tree
logistic regression
which classifier for these questions:
are there non-linear elements or discontinuities in the data?
decision tree
what is time series analysis?
equally spaced out values over time
what does time series analysis do?
forecast
what is the difference between univariate time series and multivariable time series?
uni is one variable
in time series what is the box-jerkins method?
predicts the future
what does ARMA stand for?
autoregressive moving averages
who invented ARMA model?
box-jenkins
what does the box-jenkins method assume the random component is?
stationary sequence
what does a stationary sequence mean?
a) constant variance
b) autocorrelation does not change
c) constant deviance
d) constant mean
constant variance
autocorrelation does not change
constant mean
to obtain a stationary sequence the data must be?
de-trended
seasonally adjusted
what does the ARIMA model do?
uses method differencing to render the data stationary
how do you remove a simple linear trend in time series?
subtracting least-squares-fit straight line
how do you do a seasonal adjustment for time series?
calculating the average for each month and subtracting them from the actual value
what model uses P,Q in time series?
ARMA
in AR what is Y?
a) Yt is a linear combination of its last p values
b) Yt is a linear combination of its last q values
a)Yt is a linear combination of its last p values
in MA what is Y?
a) Yt is a constant value plus the effects of a dampened white noise process over the last p time values (lags)
b) Yt is a constant value plus the effects of a dampened white noise process over the last q time values (lags)
b)Yt is a constant value plus the effects of a dampened white noise process over the last q time values (lags)
What is the d in ARIMA (p,d,q)?
differencing term
what does ARIMA stand for?
autoregressive integrated moving average
what does p mean in time series (ARMA, ARIMA)?
number of autoregressive terms
what does d mean in time series (ARMA, ARIMA)?
the number of differences
what does q mean in time series (ARMA, ARIMA)?
the number of moving average terms
in time series, what does ACF mean?
auto correlation function
what is ACF?
provides indication of the stationarity of the data
in time series, what does PACF mean?
partial auto correlation function
what is PACF?
autocorrelation calculated after removing the linear dependence of the previous terms
what is text analysis?
processing of text
why is text analysis high-dimensional?
every word is a dimension
what are the three problem solving tasks in text analysis?
parsing
search/retrieval
text-mining
what is parsing in text analysis?
imposing structure
what is search/retrieval in text analysis?
searching for word or phrase
what is a corpus?
body of knowledge
what is text-mining in text analysis?
understanding the content
what is regex (regular expressions) in text analysis?
used for finding words, strings or patterns in text
what is bag of words in text analysis?
term frequency (tf)
what is reverse index in text analysis?
a list of all the documents that contain that feature
what is IDF in text analysis?
inverse document frequency
what are the metrics in text analysis that determine the quality of results?
a) recall, relevance, confidence
b) relevance, precision, recall
c) relevance, lift, recall
b)relevance, precision, recall
what does IDF do?
measles the uniqueness of a term in the corpus
what does tf-idf mean?
measure of relevance