Week 5 Flashcards
What does it mean that the miner is often the end user?
Data mining is carried out by knowledge persons within the different business units.
What is the output of regression data mining (belongs to predictions)
A number
What does link analysis try to achieve?
Find patterns in relationship to each other.
What does the robustness of data mining refer to?
Its ability to overcome noisy data to make somewhat accurate predictions.
What does the accuracy of data mining refer to?
Its ability to predict the outcome of a previously unknown data set accurately.
In estimating the accuracy of data mining (or other) classification models, the true positive rate is
the ratio of correctly classified positives divided by the sum of correctly classified positives and incorrectly classified negatives.
When would the iteration of steps 3 and 4 stop in K-means clustering?
2 awnsers
- When the recalculation of center points does not lead to a reassignment of data points anymore.
- When a pre-defined number of iterations have been carried out.
Would the algorithm always show the same results if we keep K the same and all other parameters the same?
No, because the initial selection of cluster center points is random.
What is the output variable in Association?
There is no output variable.
What is the Euclidian distance
Ordinary distance between two points that one would measure with a ruler.
What is the manhattan distance?
rectilinear distance, or taxicab distance, between two points)
Its the total travel distance if one can only move along grid lines.
What can cluster analysis be used for?
Cluster analysis can be used for automatic identification of natural groupings of things.
What kind of learning does cluster analysis use?
Unsupervised learning
What kind of data set does supervised learning use?
A labeled data set
How does clustering work?
It works by learning the clusters of things form past data, then assigning new instances.
What are some of the use cases of clustering?
- Identify natural groupings of customers;
- Identify rules for assigning new cases to classes for targeting/diagnostic purposes;
- Provide characterization, definition, labeling of populations,
- Decrease the size and complexity of problems for other data mining methods;
- Identify outliers in a specific domain.
How is the optimal amount of clusters determined?
There is no optimal way to calculate the amount of clusters, hence heuristics are often used.
What does K-means clustering mean?
K-means clustering means that there is a pre-determined number of clusters.
What are the steps of K-means clustering?
- Determine the value of k;
- Randomly generate k random points as initial cluster centers;
- Assign each point to the nearest cluster center;
- Re-compute the new cluster centers;
- Repeat steps 3 and 4 until some convergence criterion is met.