Final Flashcards

1
Q

What is probability?

A

How likely an event will occur

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is conditional probablity?

A

The probability that A occurs given that B already occured

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is an unsupervised technique?

A

Finds relationships between groupings of data points

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is support

A

How frequently does the item occur in the dataset

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is confidence?

A

How often a rule is found to be true?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

How do support and confidence thresholds work?

A

“-Select minimum acceptable values for support and confidence

  • find association rules with support and confidence above chosen thresholds
  • items with high support are called frequent”
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Why should you use association rules?

A

“-simple data model

-understandable and actionable rules “

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is the apriori technique?

A

“-reduces number of calculations

  • If a bundle is frequent then all of its subsets are frequent
  • if a bundle is infrequent then all of the supersets are infrequent”
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is lift?

A

“-confidence/expected confidence
-the ratio that the actual probability of a transaction occuring both item A and B to the probabillity that A and B would occur if they were independent “

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is supervised method?

A

A way to describe the relationship between input attributes and a target attributes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is regression?

A

estimating the relationship between variables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is correlation?

A

The strength of the linear relationship

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What are some output types for data mining techniques?

A

“-regression

  • classification
  • ordinal “
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is a regression analysis

A

looks at numerical range

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is a classification analysis

A

factor or binary output like yes or no

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is an ordinal technique

A

classfication with output

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What technique would you use for grouping things by similarity?

A

clustering

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What techinique is used to determine the relationship between input and output variables?

A

regression

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What technique would you use to assign labels to data based on charachterisitcs?

A

Classification

20
Q

What technique would you use to determine if there was a relationship between variables in the data?

A

association rules

21
Q

What technique would you use to find structure in a temporal data set.

A

time series

22
Q

What is a parametric model?

A

makes an assumption about the form or the shape of our data and then estimate the parameters of that function

23
Q

What is a non parametric model?

A

does not make an explicit assumption as to the function

24
Q

what is model stability?

A

process of finding a model that give accurate predictions for the whole population and not just individual samples

25
Q

What is overfitting?

A

model error where the results to closely fit the data set

26
Q

What is cross validation?

A

looking at how results will effect a certain data set

27
Q

What are posterior probabilities ?

A

The statistical probability that a hypothesis is true calculated in the light of revelant observations

28
Q

What is sensitivity

A

The true positive rate. the proportioni of positives that are correctly identified.

29
Q

What is specifity?

A

The true negative rate. the proportion of negatives that are correctly identified as such

30
Q

What is discriminant analysis?

A

Used to seperate groups from each other

31
Q

What are decision trees?

A

“Allows you to develop classification systems to predict or classify current and future observations based on a set of decision rules
divide up a large collection of records into successively smaller sets of records by appying binary rules “

32
Q

What are the benefits of decision trees?

A

“-the input data and be ocntinous or discrete

  • the underlying assumption of of relationship beteen indpenedent and dependent variable
  • suited for classification and regression
  • easy to interpret “
33
Q

Why perform cluster analysis ?

A

find patterns in data

34
Q

WHat are challenges with cluster analysis?

A

“-how to we define similar?

-how do we handle otuliers “

35
Q

How do we define similarity?

A

“-symmetry

-triangle inequality”

36
Q

What is euclidean distance

A

distance between centroid and individual data point

37
Q

What is hieratchical clustering?

A

determine clusters based on some arbitary maximum distance a cluster object can be from another cluster object

38
Q

What is centroid based clustering

A

data is a part of a centroid

39
Q

What is confidence?

A

how certain you are that your results are accurate

40
Q

What is lift?

A

how well the model is performing

41
Q

What is inference vs prediction

A

“-inference used when we want to understand relationships between variables
-prediction is used to predict “

42
Q

CRISP DM cycle

A
"-Business Understanding 
-Data Understanding
-Data Prep 
-Modeling
-Evaluation
Deloyment "
43
Q

Which of the following metrics measures a model’s ability to correctly identify positive values (select all that apply).

A

“-sensitivity

  • recall
  • true positive rate “
44
Q

What is a rule about association rules?

A

D. A large confidence in an association rule, will typically result in a higher lift when support is low

45
Q

Which of the following are true of Parametric Models? Select all that apply.

A

“A.Inferences can usually be made from a smaller number of predictors than with non-parametric models
B.They are often simpler than non-parametric models

D.They are usually less prone to overfitting than non-parametric models”

46
Q

Describe the Hold-Out approach to Cross Validation.

Why it is performed / why is it necessary?

A

You randomly select some parts of the data to use for test and you keep another subset for use it for training. Once you train the model you validate with the test set. You cross validate by repeatedly taking subsets to become training sets and test sets. It is performed to predict the accuracy and will tell how well a model will generalize to future observations