Lecture 7 - BI and Data Mining 2 Flashcards
Overfitting results in?
Decision trees that are more complex than necessary
Training error no longer provides?
A good estimate of how well the tree will perform on previously unseen records
What can cross validation technique be used for?
Cross validation technique can be used to compare the performance of different machine learning models on the same data set
What is leave-one-out?
Similar to k-fold where k=number of samples
What is bootstrapping?
Random sampling with replacement
Describe bootstrap method?
Refers to random sampling with replacement.
What is Neural Networks?
An artificial network consists of a pool of simple processing units which communicate by sending signals to each other over a large number of weighted connections.
What is a test stage in NN?
Each unit performs a relatively simple job:
receive input from neighbors or external sources and use this to compute an output signal which is propagated to other units
What is the learning stage in NN?
A task of the adjustment of the weights
What is feed-forward networks?
Where the data flow from input to output units is strictly feed-forward.
The data processing can extend over multiple layers of units, but no feedback connections or connections between the units of the same layer are present.
Perceptron?
A single layer feed-forward network consists of one or more output neurons, each of which is connected with a weighted factor w to all of inputs x
How does learning in Perceptrons work?
The weights of the neural networks are modified during the learning phase
Convergence Theory?
If there exists a set of connection weights w* which is able to perform the transformation y=d(x) the perceptron learning rule will converge to some solution in a finite number of steps for any initial choice of the weights
What is Backpropagation?
The multi-layer networks with a linear activation can classify only linear separable inputs or, in case of function approximation, only linear functions can be represented.
What is SVM?
Support Vector Machine
Properties of SVM?
Flexibility in choosing a similarity function
Sparseness of solution when dealing with large data sets
Ability to handle large feature spaces
Overfitting can be controlled by soft margin approach
SVM applications?
Text categorization
Image classification
Bioinformatics
Hand-written character recognition
Machine learning focuses on?
Prediction, based on known properties learned from the training data
Data mining focuses on?
The discovery of previously known properties in the data. This is the analysis step of Knowledge Discovery in Databases
Data mining uses many?
Machine learning methods but often with a slightly different goal in mind
Machine learning also employs?
Data mining methods as unsupervised learning or as preprocessing step to improve learner accuracy
What is Cluster Analysis used for in Data Mining?
Used for automatic identification of natural groupings of things
Employ unsupervised learning
Learns the clusters of things from past data, then assigns new instances
There is not an output/target variable
What is k-Means clustering alrgorithm?
K: pre-determined number of clusters
Algorithm Step 0 determine the value of K
Steps of k-means?
Step 1: Randomly generate k random points as initial cluster centers
Step 2: Assign each points to the nearest cluster center
Step 3: Re-compute the new cluster centers
What is k-means repetition step?
Repeat steps 3 and 4 until some convergence criterion is met
What is cluster analysis?
Finding groups of objects such that the objects in a group will be similar to one another and different from the objects in the other groups
Applications of Cluster Analysis?
Understanding
- Group related documents for browsing, group genes, and proteins that have similar functionality
Summarization
- reduce the size of large data sets
What is a clustering?
A set of clusters
What are the types of Clusterings?
Partitional Clustering
Hierarchical Clustering
What is Partitional Clustering?
A division data objects into non-overlapping subsets such that each data object is in exactly one subset
What is Hierarchical Clustering?
A set of nested clusters organized as hierarchical tree
What is Association Rule?
Is a rule-based machine learning method for discovering interesting relations between variables in large databases
Challenges of Frequent Itemset mining?
Multiple scans of transaction database
Huge number of candidates
Tedious workload of support counting for candidates