Lecture 8 – Grouping data Flashcards
Segmenting data
Sometimes the segmenting of data is because of the context of the data (e.g. sources)
Sometimes we don’t have pre-determined segments, but we want segmentation
- e.g. identifying customer segments
What is a segmentation model?
Which one of the following tasks is not a segmentation task?
A. Group all the shopping items available on web.
B. Identification of areas of similar land use in an earth observation database
C. Weather prediction based on last month’s temperature
(probably C)
What are regression trees?
- A regression tree is a supervised machine learning algorithm that predicts a continuous-valued response variable by learning decision rules from the predictors (or independent variables)
- Two main steps:
- divide the data into subsets of similar values - - estimate the response within each subset.
What is ANOVA?
How does it work to split regression trees?
ANOVA = analysis of variance
–> type of statistical test used to determine if there is a statistically significant difference between two or more categorical groups by testing for differences of means using variance.
What are pros and cons of regression trees?
Pros:
- easy to understand
- Visualizing the tree can reveal crucial information, such as how decision rules are formed, the importance of different predictors and the effect of the splitting points in the predictors.
- Implicitly performs feature selection as some of the predictors may not
be included in the tree.
- Not sensitive to the presence of missing values and outliers.
- No assumptions about the shape and the distribution of the data.
Cons:
- The fit has a high variance, meaning small changes in the data set can lead to an entirely different tree.
- Overfitting is a problem for tree-based models, but we can adjust the stopping conditions and prune the tree.
- Can be inefficient when performing an exhaustive search for the splitting points of continuous numerical predictors.
- Greedy algorithms cannot guarantee the return of the globally optimal regression tree.
When do you use regression trees and when classification trees?
- Regression tasks relate to determining quantitative numerical variables based on input variables
- Classification tasks about determining a qualitative value (e.g., category or class) based on the input variables
–> Categorical variables (nominal data, ordinal data)
How do you split classification trees?
Most popular split criteria are Gini and Entropy
Clustering and segmentation
Explain the use of clustering
Use of clustering:
Text documents, e.g.,patents,legalcases, webpages, questions and feedback ==> Topic modelling
- Clients, e.g.,recommendation systems
- Fault detection, e.g., fraud, networksecurity
- Missing data
- A clustering task may require a number of different algorithms/approaches.
What are elements of a cluster?
- Are similar in some attributes
- May consider some attributes to weigh more than others ==> Not all attributes are as important as others (feature selection)
May be considered to be close to each other ==> Needs distance measurements
What are the two clustering approaches you learned in the lecture?
- k-means algorithm
- Hierarchical
Explain the k-means algorithm
- Randomly select centroids for K clusters
- Select nearest data points as cluster population
- Find mean values in each cluster and use that as new centroid
- Re-evaluate populations and centroids until stable/convergance
- Does not work with categorical data and it is susceptible to outliers
- Have to predefine avalue for K
- No guarantee there are actually clusters to find
Explain hierarchical clustering
Clusters with in clusters!
* Agglomerative (bottom-up) vs Divisive (Top-down)
- Agglomerative:
- Treat each data point as a centroid in a cluster of population 1
- Form new clusters by merging nearby clusters
- Continue until only one cluster
- Various ways to calculate which clusters should be merged, often looking at (min or max) distances of the cluster population to each other
- The results of hierarchical clustering are usually presented in a dendrogram
- Greedy!
- Can be costly, due to having to calculate a lot of distances for each level of the tree.
- But with no randomness, the same tree will be produced each time.
- Can cut the tree at any level so as to get the population of a certain number of clusters.
What is a network made up of?
Nodes and arcs
- Node (vertices) –> entities in the data
- Edges (arcs) –> relationships between the entities