Lecture 7: Visual Data Mining Principles Flashcards
What is Data Mining?
Automatic algorithmic extraction of valuable information from raw data
Goal: Find interesting patterns/outliers/trends
What are some common Data Mining tasks?
Anomaly detection
‣ Outlier/change/deviation detection
‣ Example: Fraud detection
Association rule learning
‣ Dependency modeling
‣ Correlations between variables
‣ Example: Find products that are bought together
Clustering
‣ Identify similar groups or structures without using known structures
‣ Example: Find groups of customers that behave similarly Fayyad et al. 1996 14
Classification
‣ Generalize known structure to apply to new data
‣ Example: Spam classification
Regression
‣ Find function which models data with least error
‣ Can be used for prediction
‣ Example: Stock market analysis
Summarization
‣ Find more compact representation
‣ Examples: Report generation
What is clustering?
Automatically group data instances into classes based on mutual similarity
Unsupervised: finding clusters
What are the two types of Hierarchical Clustering?
‣ agglomerative clustering: start with each node as a cluster and merge
‣ divisive clustering: start with one cluster, and split
How does K-Means Clustering work?
The K-Means algorithm for clustering basically consists of 5 steps:
‣ Step 1: Choose the number K of clusters
‣ Step 2: Select random K points as cluster centers
‣ Step 3: Assign each data point to the nearest centroid
‣ Step 4: Compute and place the new centroid of each cluster
‣ Step 5: Repeat step 4 until no observations change cluster
Describe Decision Trees
**Splitting dataset into branches
Task: Predict outcome of unseen observations
Build tree iteratively, starting at root
Need algorithm to decide which attribute to split **
‣ Goal: Find informative features
‣ Entropy concept = how mixed up something is
‣ Maximize information gain
Prune tree to avoid overfitting
Describe Dimensionality Reduction
Curse of dimensionality in many real world problems
Transform high-dim data to space with fewer dimensions (2D/3D)
Powerful technique to look for hidden structure in high-dimensional data Used in various domains
‣ Document categorization, drug discovery, machine learning model debugging, etc.
What are some disadvantages of Dimensionality Reduction?
‣ Hard to preserve semantics of single dimensions
‣ Hard to understand and interpret
‣ Error not visible —> inspiring false confidence