final exam Flashcards
The classification trees classification algorithm:
- estimates how likely data point is to be a member of one group or the other depending on what group the data points nearest to it are in
- uses a tree like structure to illustrate the choices available for each possible decision and its estimated outcome by showing them as separate branches of the tree
- predicts the prob that an instance is member of a certain class by basing the technique on the bayes thm
- utilizes an equation based ont he ordinary least squares ression that can predict the prob of the possible categorical outcoes
2
The naive Bayes classification algorithm:
- estimates how likely data point is to be a member of one group or the other depending on what group the data points nearest to it are in
- uses a tree like structure to illustrate the choices available for each possible decision and its estimated outcome by showing them as separate branches of the tree
- predicts the prob that an instance is member of a certain class by basing the technique on the bayes thm
- utilizes an equation based ont he ordinary least squares ression that can predict the prob of the possible categorical outcoes
3
the knn classification alg:
- estimates how likely data point is to be a member of one group or the other depending on what group the data points nearest to it are in
- uses a tree like structure to illustrate the choices available for each possible decision and its estimated outcome by showing them as separate branches of the tree
- predicts the prob that an instance is member of a certain class by basing the technique on the bayes thm
- utilizes an equation based ont he ordinary least squares ression that can predict the prob of the possible categorical outcoes
1
classification algorithms that do not use assumptions abt the structure of teh data are ___ algorithms
data driven
a good use of classification alg would be:
- estimating the net profit for dishwashers for a major manufacturer
- identifying the seasonal salws for wood stoves over the last 3 yrs
- forecasting sales for a new product
- upselling or cross selling to cuts thru an online store when a cust makes a purchase
4
in a CART model classification rules are extracted from
the decision tree
the knn techique is what type of technique
a classification technique
in setting up the knn model:
- the user allows XLminer to select the optimal value of k
- the optimal k is set by the user at 10
- the data is normalized in order to take into account the categorical variables
- it is necessary to set an optimal value for k
1
Below are the 8 actual values of the target variable in the training position:
(0,0,0,1,1,1,1,1)
What is the entropy of the target variable?
-5/8 log2(5/8)-3/8 log2(3/8)
5/8 log2(5/8)-3/8 log2(3/8)
-3/8 log2(3/8)+5/8 log2(3/8)
-5/8 log2(3/8)+log2(5/8)
1
Classification programs are distinguished from estimation problems in that
- classification problems require the output attribute to be numerical
- classification problems require the output attribute to be categorical
- classification problems do not allow an output attribute
- classification problems are designed to predict future outcomes
2
Which statement is true about the decision tree attribute selection process:
- a categorical attribute may appear in a tree node several times but a numeric attribute may appear at most once
- a numeric attribute may appear in several tree nodes but a categorical attribute may appear at most once
- both numeric and categorical may appear in several tree nodes
- numeric and categorical attributes may appear in at most 1 tree node
2
What is the ensemble enhancement that is a method of creating psudo-data from the data in an og data set? partitioning overfitting sampling bagging
bagging
What is the ensemble enhancement that is an iterative technique that adjusts the weight of any record based upon the last classification bootstrapping boosting sampling bagging
boosing
What is the most often used ensemble enhancement
bagging
What are the 3 most popular methods for creating ensembles?
- sampling, summarizing, random forest
- bagging, boosting, random forest
- bagging, boosting, clustering
- overfitting, clustering, sampling
2
What is one benefit of using an ensemble model?
- it better establishes the relationship bw 1 dep. varaible and multiple ind. variables
- it strengthens the relationship bw the multiple ind. var
- it reduces the number of errors that results
- it is more efficient at adding and removing predictors
3
What is the most common uses of clustering algorithms?
- to min variance and bias error
- to segment cust
- to determine how effectively the model can reorder the data set
- to validate the data set
2
in logit P/(1-p) represents:
the odds of sucess
In a naive bayes model it is necessary that:
-all attributes are categorical
-to partition the data into 3 parts (training, validation, scoring)
-to set cutoff values to less than .75
to have a continuous target variable
1 (ie gender, blood type); can never have cont. variables
Generally, an ensemble method works better, if the individual base model have _____
Assume each indiv. base models have accuracy greater than 50%
-less correlation among predictors
-high correlation amond predictors
-correlation does not have any impact on ensemble output
-none of the above
1
a dendogram is used w which analytics algorithsm? text mining clustering ensemble models all of the above
clustering
What is a bootstrap?
- procedure that allows the data scientists to reduce the dimensions of the training data set
- this is one of many classification type algorithms
- it is a procesure for aggregating many attributes into a few attributes
- it is based on repeatedly and systematically sampling w/out replacement from the data
4
what is clustering
- ensemble algorithm for improving the accuracy of classification models
- could be thought of as a set of nested algorithms whose purpose is to choose weak learners
- it is the process of grouping the data into classes or clusters so that objects within a cluster have high similarity in comparison to one another
- none of the above
3
Which of the following are not types of clustering?
- k means
- hierarchal
- agglomerative
- splitting
4
a major part of text mining is to
- reduce the dimensions of the data
- generalize the use of modifiers
- screen the articles from the data set
- reduce the word count of the text actually used
1
semantic processing seeks to
- extract meaning
- group indiv. terms into bins
- eliminate “extra” or unnecessary terms from an analysis
- uncover undefined words or terms in a set of textual data
1
what is the process of extracting token words from a block of text after performing cleanup procesures
tokenization
What would normalized text look like?
- all duplicate words are removed
- all stop words removed
- all spelling errors corrected
- all text is converted to lower case
4
What would the result be if you were asked to use stemming to these terms: agreed, agrees, agreeable, agreeing?
all terms would change to agree
What type of standard diagnostic is used for text mining algorithms?
lift chart and confusion matrixes
A model that goes beyond a bag of words analysis and assigns and defines consumer sentiment to words would be a ___ model
NLP
Which of the following are other procedures that could be used to reduce the text dimensions to prepare for analysis?
- # and items that appear to be monetary values are removed
- words of more than 20 letters in length are removed
- headers and page numbers are removed
- duplicates of all words are removed
1,2,3
What is entity extraction?
identifying a group of words as a single item
the words extracted from a black of text after the cleanup procedures have been performed are
tokens
latent semantic indexing:
- uses svd to identify patterns in the relationship bw terms and concepts
- reduces the dimensions of the text by trateing all versions of the same (or a very similar) cncept identically
- ccollates the most common words and phrases and identifies them as keywords
- identifies a group of words as a single item
1,3
what is a method for clearing away clutter in raw tect documents and extracting useful char. to serve as attirbutes?
dimension reduction
What algorithm takes a large # of words and compresses them into a much smaller number of linear combinations
SVD
Which of the folliwng best describe target leakage?
- it is diff to detect and harder to eliminate
- it is the diff bw the expected prediction of a model and the correct value that is targeted
- it allows alg to make predictions that are too good to be true
- it is the intro of info about the text mining target that hsould not legit be available to the alg.
1,3,4
the process of collecting data from websites is
web scraping
What is the goal of text analytics?
to reduce the dimenstion of the ___ text to manageable attributes that ca be used in data mining alg
unstructured
What is the most diff decision to make when considering the use of text as data?
- to know the attributes of the data
- the format of the data
- to define the prob you are trying to solve
- teh data mining alg that will be used
3
what looks at unprocessed text as a collection of words w./out regard to grammar
bag of words
Which of the following describe stemming?
- stem would be the token remembered and used in place of all other forms of this word
- stemming reduces words to their stem
- a stem word needs to be a real word, not a made up word
- reduces the dimensions of the text by treating all versions of the same concept identically
1,2,4
what uses svd to identify patterns in the relationships bw the terms and concpets
latent semantic indexing
“the erupting guyser with diet coke and memtos is a fun experiment for kids working at home”
what is the final step in dimension reduction?
-correct the spelling
-eliminate the stop words
-identify associations
-perform stemming
1
“the erupting guyser with diet coke and memtos is a fun experiment for kids working at home”
what is the 2nd step in dimension reduction?
-correct the spelling
-eliminate the stop words
-identify associations
-perform stemming
4
What two main problems is the random forest methodology used to address?
- settings where the # of attributes is much smaller than the # of records
- assess and rank attributes w respect to their ability to predict the classification
- construct classification rules for a learning problem
- yield a classification given the attributes for current observations
2,3
What clustering alg is an iterative process using some best criteria in the multiple passes to see if it can improve the clusters?
k-means
hierarchial method
k-means
What are the two basic types of clustering algorithms?
k-means and hierarchial
Which of the following would be correct about the random forest algorithm?
- it is limited to a random subset of attributes at each stage of the alg
- it is based on applying bagging to a decision tree algorithm that also samples attributes in addition to records
- it produces more accurate predictions than a simple CART
- it is a collection of 1 cart tree that is ind. when constructed
1,2,3
What is the most common form of unsupervised learning?
clustering
What is a decision stump? It is a decision tree with _____ root nodes
1
Which of the following correctly compare the relationship bw traditional CART model and random tree?
- over fitting can be a prob. w both CART and rando forest
- Rando forest produces more accurate predictions than a simple CART
- Random forest samples records like CART but it also samples the attributes
- CART and random forest have the same number of steps
1,2,3
What is the k-means clustering alg?
iterative process using some of the best criteria in the multiple passes to see if it can improve the clusters
If the correlation coefficient is .983, it indicates that
there appears to be a strong positive linear association bw x and y
What collates common words and identifies them as keywords?
latent semantic index
3 steps in dimension reduction
- eliminate stop words
- perform stemming
- correct spelling errors
What reduces phrase to basic identity?
phrase reduction
what is the process of extracting “token” words from a block of text after performing cleanup?
tokenization
What takes a large number of words and compresses them into a small number of linear combinations
sing. value decomp
error due to the diff bw the expected prediction of our model and the correct value were trying to predict
bias
error due to the variability of a models prediction for a given point
variance
what clustering technique adjusts the weight of any given record based upon the last classification
boosting
what is a method of creating psudo-data from data in an original set
bagging; boot strap
What are the 4 types of classification models?
KNN, classification trees (CART, decision tree, regression tree), naive bayes, logit
the naive bayes classification thecnique could best be describe by:
- its comparable in performance to decision trees
- is fast and accurate
- uses statistical classifiers
- uses the same alg. as the regression decision tree
1,2,3
How do data mining trees look?
upside down tree w leaves at bottom and root on top
A good classification tree will make the best split first followed by decision rules that are made up w:
- succesively larger and larger numbers of training records?
- succesively smaller and smaller numbers of training records?
2
The most important distinction bw a logit and ordinary regression is that the dependent variable is
categorical not continuous
classification is ___ learning
supervised
What is a mathematical concept that measures the uncertainty associated w random variables
info entropy
What does the k in KNN refer to
- # of nearest neighbors used in determining a category correctly
- # of char. of the unknown used in determining a category correctly
- # of unknowns in a category
- # of attributes nearest the unknown
1
What classification technique predicts numeric quantities?
classification tree
What are the steps in the data mining process?
Sample Explore Modify Model Assess
What are the 4 characteristics of data mining
- volume- large sizes of data bc w analytics can use unstructured data; size of data set
- velocity- rate at which we expect the data to arrive has to be fast. Algorithms must be fast in order to be useful
- variety- unstructured and structured data. Not just data from excel sheets; types of data available
- value- a lot of the data in the past was not useful. Job of data scientist to determine what is useful data and what is not; valuation of data
What is the common form of prediction in data mining
classification tool
What is the primary goal for data mining?
- to model the noise in the data
- to overfit the model so there’s a low misclassification rate
- to have accuracy and fit as characteristics of the model
- to do a good job of representing our known data set
3,4
Which correctly define data warehouse
- a firms central repository of integrated historical data
- a location where data is stored
- the memory of the firm
- collective info on every aspect of what has happened in the past
1,3,4
Which of th following describes a data mart:
- collective info on every aspect of what has happened in the past for a comp
- holds info that is specialized and has been grouped or chosen specifically
- a frims central repository of integrated hist. data
- a subset of a data warehouse
2,4
What does data mining refer to
- the physical tools used to access data and make predictions
- knowlege gained from mass data
- patterns in a mass of data
- tools that are used in the large scale or big data arena
4
Which of the following are true regarding data mining and biz forecasting?
- in data mining u simultaneous search for diff patterns in parallel, but in biz forecasting search for set patterns
- for biz forecasting, the expectation is that the data will contain some level of variation where in data mining patterns are not pre specified
- in data mining you are searching for seasonal variability, but in biz forecasting yu are searching for trend patterns only
1,2,3
Which of the following correctly contrasts data mining w database management?
- queries are well defined in database mngment but less structured in data mining
- data mining is more forward looking where database mngment is more past focused
- a query in database mngmt would be “find all cust in atlanta”, in data mining it would be “group all cust w sim buying habits”
- database mngment is extracting useful info from large, unstructured databases where dataminign is extracting specialized or grouped data
1,2,3
classification tools distinguish bw:
- data concepts and objects
- data objects and classes
- data classes or fields
- data classes or concepts
4
target algorithm feature/attribute record score data mining terms
dep var forecasting model exp variable observation forecast
What are the reasons for sampling or partitioning in data mining?
- it is common practice in database management to set aside a portion of the data
- partitioning and testing for accuracy are standard practice in analytics
- it has its roots in the “holdout” and “holdbacks” used for standard forecasting models
- in most cases, the entire data set is not needed to build a model
2,3,4
what is the process called of transforming text into numbers
datafication
bag of words analysis looks at what?
unprocessed text as a collection of words w/out regard to grammar
the frequency of any single word is inversely propotional to its rank in the frequency table is what law?
zipfs law
what are the most common mistake in text mining?
-target leakage: intro of info that should not be available to the alg
In the universal bank data in this chapter, only 10% of the records represented customers who had taken out a personal loan (the target variable). If we were to score a new cust based upon the attributes we used in the alg we would be accurate in the prediction about 90% of the time if we always scored the indiv as “not accepting a loan” bc that is indeed what most cust have done in the past. WHy not accept being right 90% of the time w this v simple decision rule?
Because with data mining we have access to the information and tools that can help us do better than predicting correctly 90% of the time. So in this scenario, we could look at our lift chart and find the customers with the highest probability of accepting a personal loan, and market to them, in order to have a better chance of finding people who will accept the loan.
Data has the characteristic of being nonrivalrous. Explain its importance
Nonrivalry: characteristic that means that one person’s use of the good to create value does not diminish the value another can extract from the data.
It’s important to realize that data has this characteristic so more researchers and data scientists can use data, because every time the data set is used, it can be used to obtain different results. Every researcher can use a data set with a different purpose and get different conclusions.
The lift chart and the confusion matrix are both standard diagnostic tools used to evaluate a data mining algorithm. Don’t the two measures display the same info? explain diff bw the two measures
Both the confusion matrix and the lift chart provide information about model performance but display the information in different ways.
Confusion matrix: This shows model performance. There is a confusion matrix for both the validation data and the training data. Most often, the results from the validation model are most relevant since they show how the model performed on unseen data. The validation confusion matrix shows model performance in classification on data that was not used to build the model. Gives results for the amount of correct classifications and the misclassifications.
Lift chart: This is the standard for accuracy in data mining. These charts help to determine how effectively the model can reorder the data set, by placing the individuals who have the highest probability of success on top, and those with the lowest probability of success on bottom. By looking at the chart, you can determine how well your model is doing compared to a naïve model.
confusion- what u got right and wrong
lift- for each % of data, how much u got right
Why do we need to sample the data?
Bc we need to see how model works on this current data set and how it will work in the real world
The risk in ignoring this step is creating bias. If a data scientist uses the same data to both build and test the model, and that model is overfit, then most likely the results will also be overfit.
Structured vs unstructured
Structured data: data that does have a predefined model
Unstructured data: Data that does not have a predefined data model
Unstructured data is a more prevalent form of data because it comes in many different forms which we are expose to daily
Excel spreadsheet: structured data
A thousand text files: unstructured
A thousand video images: unstructured
A thousand audio filed: Unstructured
Some data mining algorithms work so well they have a tendency to overfit the data. What does this mean and what difficulties does overlooking it cause for the data scientist?
Overfitting: When we put too many attributes (or try to account for too many patterns) in a model, including some unrelated to the target.
If a data scientist overfits their data they will incorrectly explain some variation in the data that is nothing more than a chance variation. In other words, they will have mislabeled the noise in the data as part of the “true signal”
To find if the coefficients are statistically significant..
t-stat>2= stat. sig
What does R2 say?
Explains ___% of the variation in the data
How do you monitor autocorrelation?
if DW is bw 1.5 and 2.5 you have no 1st order serial correlation