Final Flashcards
What is probability?
How likely an event will occur
What is conditional probablity?
The probability that A occurs given that B already occured
What is an unsupervised technique?
Finds relationships between groupings of data points
What is support
How frequently does the item occur in the dataset
What is confidence?
How often a rule is found to be true?
How do support and confidence thresholds work?
“-Select minimum acceptable values for support and confidence
- find association rules with support and confidence above chosen thresholds
- items with high support are called frequent”
Why should you use association rules?
“-simple data model
-understandable and actionable rules “
What is the apriori technique?
“-reduces number of calculations
- If a bundle is frequent then all of its subsets are frequent
- if a bundle is infrequent then all of the supersets are infrequent”
What is lift?
“-confidence/expected confidence
-the ratio that the actual probability of a transaction occuring both item A and B to the probabillity that A and B would occur if they were independent “
What is supervised method?
A way to describe the relationship between input attributes and a target attributes
What is regression?
estimating the relationship between variables
What is correlation?
The strength of the linear relationship
What are some output types for data mining techniques?
“-regression
- classification
- ordinal “
What is a regression analysis
looks at numerical range
What is a classification analysis
factor or binary output like yes or no
What is an ordinal technique
classfication with output
What technique would you use for grouping things by similarity?
clustering
What techinique is used to determine the relationship between input and output variables?
regression
What technique would you use to assign labels to data based on charachterisitcs?
Classification
What technique would you use to determine if there was a relationship between variables in the data?
association rules
What technique would you use to find structure in a temporal data set.
time series
What is a parametric model?
makes an assumption about the form or the shape of our data and then estimate the parameters of that function
What is a non parametric model?
does not make an explicit assumption as to the function
what is model stability?
process of finding a model that give accurate predictions for the whole population and not just individual samples
What is overfitting?
model error where the results to closely fit the data set
What is cross validation?
looking at how results will effect a certain data set
What are posterior probabilities ?
The statistical probability that a hypothesis is true calculated in the light of revelant observations
What is sensitivity
The true positive rate. the proportioni of positives that are correctly identified.
What is specifity?
The true negative rate. the proportion of negatives that are correctly identified as such
What is discriminant analysis?
Used to seperate groups from each other
What are decision trees?
“Allows you to develop classification systems to predict or classify current and future observations based on a set of decision rules
divide up a large collection of records into successively smaller sets of records by appying binary rules “
What are the benefits of decision trees?
“-the input data and be ocntinous or discrete
- the underlying assumption of of relationship beteen indpenedent and dependent variable
- suited for classification and regression
- easy to interpret “
Why perform cluster analysis ?
find patterns in data
WHat are challenges with cluster analysis?
“-how to we define similar?
-how do we handle otuliers “
How do we define similarity?
“-symmetry
-triangle inequality”
What is euclidean distance
distance between centroid and individual data point
What is hieratchical clustering?
determine clusters based on some arbitary maximum distance a cluster object can be from another cluster object
What is centroid based clustering
data is a part of a centroid
What is confidence?
how certain you are that your results are accurate
What is lift?
how well the model is performing
What is inference vs prediction
“-inference used when we want to understand relationships between variables
-prediction is used to predict “
CRISP DM cycle
"-Business Understanding -Data Understanding -Data Prep -Modeling -Evaluation Deloyment "
Which of the following metrics measures a model’s ability to correctly identify positive values (select all that apply).
“-sensitivity
- recall
- true positive rate “
What is a rule about association rules?
D. A large confidence
in an association rule, will typically result in a higher lift when support is low
Which of the following are true of Parametric Models? Select all that apply.
“A.Inferences can usually be made from a smaller number of predictors than with non-parametric models
B.They are often simpler than non-parametric models
D.They are usually less prone to overfitting than non-parametric models”
Describe the Hold-Out approach to Cross Validation.
Why it is performed / why is it necessary?
You randomly select some parts of the data to use for test and you keep another subset for use it for training. Once you train the model you validate with the test set. You cross validate by repeatedly taking subsets to become training sets and test sets. It is performed to predict the accuracy and will tell how well a model will generalize to future observations