Analytic Techniques Flashcards by Jodie Collins

what technique would you use if you needed to group items or find structure?

a) regression
b) clustering
c) time series

b)clustering

How well did you know this?

Not at all

Perfectly

what technique would you use if you needed to discover relationships between actions or items?

a) text analysis
b) regression
c) classification
d) association rules

d)association rules

How well did you know this?

Not at all

Perfectly

what technique would you use if you needed to determine the relationship between the input variables and the outcome?

a) text analysis
b) regression
c) Time series

b)regression

How well did you know this?

Not at all

Perfectly

what technique would you use if you needed to assign labels to objects?

a) classification
b) text analysis
c) regression

a)classification

How well did you know this?

Not at all

Perfectly

what technique would you use if you needed to find structure in temporal data in order to make forecasts?

a) classification
b) text analysis
c) time series

c)time series

How well did you know this?

Not at all

Perfectly

what technique would you use if you needed to analyse free text?

a) time series
b) clustering
c) classification
d) text analysis

d)text analysis

How well did you know this?

Not at all

Perfectly

what technique is clustering?

k-means

How well did you know this?

Not at all

Perfectly

what technique is regression?

linear and logistic

How well did you know this?

Not at all

Perfectly

what technique is classification?

naive bayes

decision trees

How well did you know this?

Not at all

Perfectly

what technique is association rules?

apriori

How well did you know this?

Not at all

Perfectly

what technique is time series?

ARMA, ARIMA, PACF & ACF

How well did you know this?

Not at all

Perfectly

what technique is text analysis?

regular expressions
bag of words
TF-IDF

How well did you know this?

Not at all

Perfectly

which methods are the unsupervised learning method?

k-means

apriori

How well did you know this?

Not at all

Perfectly

what is the output of k-means?

the cluster centre

How well did you know this?

Not at all

Perfectly

what is the input of k-means?

numerical - Euclidean distance

How well did you know this?

Not at all

Perfectly

what is euclidian distance?

method of calculating distance - most ordinary distance

How well did you know this?

Not at all

Perfectly

if a domain does not suggest a suitable value for k then what do you do?

plot wss and look for elbow

How well did you know this?

Not at all

Perfectly

in k-means what do you do if its missing expected splits?

increase k

How well did you know this?

Not at all

Perfectly

in k-means what do you do if its clusters have few data points?

decrease k

How well did you know this?

Not at all

Perfectly

in k-means what do you do if the centroids are close together?

decrease k

How well did you know this?

Not at all

Perfectly

what is the right description for apriori?

a) if y is observed, then x is also observed
b) if x is observed, then y is also observed

b) if x is observed, then y is also observed

How well did you know this?

Not at all

Perfectly

what’s association rules sometimes referred as?

a) market analysis
b) market basket analysis
c) task basket analysis

b) market basket analysis

How well did you know this?

Not at all

Perfectly

what is a frequent itemset for apriori?

set of items that appear together “often enough”

How well did you know this?

Not at all

Perfectly

what is normally the support % for apriori? (confidence)

50%

How well did you know this?

Not at all

Perfectly

what is confidence is apriori?

% of transactions that contain x that also contain y

in apriori, what does lift mean?

how many times more often x and y occur together than expected

in apriori, what does leverage mean?

measures the difference in the probability of x and y appearing together

how do you work out confidence with apriori? for example credit good = 700 job skilled = 544 a) 700/544 b) 544/700

b)544/700

what is a test set?

hold back some baskets with few random values removed - can the rules fill in the blanks

how do you work out lift of if 713 home owners and 527 have good credit. 700 have good credit overall a) 0.527/(0.700*0.713) b) 527/(0.700*713) c) 527/713

a) 0.527/(0.700*0.713)

what does regression do? a) looks at a variable between inputs and the outcome b) looks at the relationship between a set of variables and the outcome c) looks at the relationship between a set of outputs

b)looks at the relationship between a set of variables and the outcome

what is linear regression

used to estimate a continuous value as linear

in regression, what does OLS stand for?

Ordinary least squares

in regression, what does OLS do?

finds the best fit line

what is the p-value in regression? a) p-value can be used to look for numeric input values b) p-value can be used to determine if the coefficient is significantly not different than zero. c) p-value can be used to determine if the coefficient is significantly different than zero.

c)p-value can be used to determine if the coefficient is significantly different than zero.

what does a large p-value mean? a) null hypothesis is rejected b) null hypothesis is not rejected

b) null hypothesis is not rejected

what are residuals in regression? a) the similarities between the observed and the estimated outcomes b) the differences between the observed and the estimated outcomes

b)the differences between the observed and the estimated outcomes

what is logistic regression?

used to estimate the probability that an event will occur (probability borrower will default)

what can logistic regression also be considered as?

classifier

what is the standard threshold of logistic regression?

0.5 (50%)

What is the preferred method for binary classification problems?

Logistic regression

Which isnot binary classification problems? A)true/false B)approve/deny C)respond to medical treatment/not response D)confidence/lift

D) confidence/lift

what does pseudo-r2 mean? a) deviance/null deviance b) r squared c) square root

a)deviance/null deviance

what is naive Bayes?

determine the most probable class label for each object

what is naive Bayes based on?

Bayes law

what is naive Bayes used for? a) spam filtering b) scoring c) fraud d) text analysis

spam | fraud

what is this? | P(C | A)*P(A) = P(A | C)*P(C) = P(A ^ C).

bayes law

to build the naive Bayes classier what do you need?

probability of all class labels

in naive Bayes how to classify something?

work out the probability total (good/bad) then multiply all good together and times by total

what is a confusion matrix

TPR/FPR

where are decision trees found?

data mining applications

what are the two types of decision trees?

classification trees | regression trees

what is a classification tree?

segment observations into homogeneous groups

what is a regression tree?

variations of regression and the average value of each node is returned

what is a branch of decision tree?

outcome of decision

what is an internal node of decision tree?

test points

what is a leaf node of a decision tree?

end of the last branch

what should you use a decision tree?

when if-then is preferred to a linear model

what is a weak learner (decision trees)

short decision tree

in decision trees how do you get the most informative attribute?

entropy based methods

what is this for? and what does it mean? | Hcredit = -(0.7 log2(0.7) + 0.3log2(0.3)) = 0.88 ( very close to 1)

``` base entropy (decision tree) high entropy ```

what does conditional entropy do in decision trees?

attribute values give more information about the class membership

what is information gain?

difference between base and conditional entropy

if you have a high information gain what does than mean?

first variable for tree split

``` which classifier for these questions: do I want class probabilities or just class labels ```

logistic regression | decision tree

which classifier for these questions: | do I want insight into how the variables affect the model?

logistic regression | decision tree

which classifier for these questions: | is the problem high dimensional?

naive bayes

which classifier for these questions: | do I suspect some of the inputs are correlated?

decision trees | logistic regression

which classifier for these questions: | do I suspect sone if the inputs are irrelevant?

decision tree | naive bayes

which classifier for these questions: | are there categorical variables with a large number of levels?

naive bayes | decision tree

which classifier for these questions: | are there mixed variable types?

decision tree | logistic regression

which classifier for these questions: | are there non-linear elements or discontinuities in the data?

decision tree

what is time series analysis?

equally spaced out values over time

what does time series analysis do?

forecast

what is the difference between univariate time series and multivariable time series?

uni is one variable

in time series what is the box-jerkins method?

predicts the future

what does ARMA stand for?

autoregressive moving averages

who invented ARMA model?

box-jenkins

what does the box-jenkins method assume the random component is?

stationary sequence

what does a stationary sequence mean? a) constant variance b) autocorrelation does not change c) constant deviance d) constant mean

constant variance autocorrelation does not change constant mean

to obtain a stationary sequence the data must be?

de-trended | seasonally adjusted

what does the ARIMA model do?

uses method differencing to render the data stationary

how do you remove a simple linear trend in time series?

subtracting least-squares-fit straight line

how do you do a seasonal adjustment for time series?

calculating the average for each month and subtracting them from the actual value

what model uses P,Q in time series?

ARMA

in AR what is Y? a) Yt is a linear combination of its last p values b) Yt is a linear combination of its last q values

a)Yt is a linear combination of its last p values

in MA what is Y? a) Yt is a constant value plus the effects of a dampened white noise process over the last p time values (lags) b) Yt is a constant value plus the effects of a dampened white noise process over the last q time values (lags)

b)Yt is a constant value plus the effects of a dampened white noise process over the last q time values (lags)

What is the d in ARIMA (p,d,q)?

differencing term

what does ARIMA stand for?

autoregressive integrated moving average

what does p mean in time series (ARMA, ARIMA)?

number of autoregressive terms

what does d mean in time series (ARMA, ARIMA)?

the number of differences

what does q mean in time series (ARMA, ARIMA)?

the number of moving average terms

in time series, what does ACF mean?

auto correlation function

what is ACF?

provides indication of the stationarity of the data

in time series, what does PACF mean?

partial auto correlation function

what is PACF?

autocorrelation calculated after removing the linear dependence of the previous terms

what is text analysis?

processing of text

why is text analysis high-dimensional?

every word is a dimension

what are the three problem solving tasks in text analysis?

parsing search/retrieval text-mining

what is parsing in text analysis?

imposing structure

what is search/retrieval in text analysis?

searching for word or phrase

what is a corpus?

body of knowledge

what is text-mining in text analysis?

understanding the content

what is regex (regular expressions) in text analysis?

used for finding words, strings or patterns in text

what is bag of words in text analysis?

term frequency (tf)

what is reverse index in text analysis?

a list of all the documents that contain that feature

what is IDF in text analysis?

inverse document frequency

what are the metrics in text analysis that determine the quality of results? a) recall, relevance, confidence b) relevance, precision, recall c) relevance, lift, recall

b)relevance, precision, recall

what does IDF do?

measles the uniqueness of a term in the corpus

what does tf-idf mean?

measure of relevance

Analytic Techniques Flashcards

(110 cards)