Lecture 8 – Grouping data Flashcards

You may prefer our related Brainscape-certified flashcards:
1
Q

Segmenting data

A

Sometimes the segmenting of data is because of the context of the data (e.g. sources)

Sometimes we don’t have pre-determined segments, but we want segmentation
- e.g. identifying customer segments

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is a segmentation model?

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Which one of the following tasks is not a segmentation task?

A. Group all the shopping items available on web.
B. Identification of areas of similar land use in an earth observation database
C. Weather prediction based on last month’s temperature

A

(probably C)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What are regression trees?

A
  • A regression tree is a supervised machine learning algorithm that predicts a continuous-valued response variable by learning decision rules from the predictors (or independent variables)
  • Two main steps:
  • divide the data into subsets of similar values - - estimate the response within each subset.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is ANOVA?
How does it work to split regression trees?

A

ANOVA = analysis of variance
–> type of statistical test used to determine if there is a statistically significant difference between two or more categorical groups by testing for differences of means using variance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are pros and cons of regression trees?

A

Pros:
- easy to understand
- Visualizing the tree can reveal crucial information, such as how decision rules are formed, the importance of different predictors and the effect of the splitting points in the predictors.
- Implicitly performs feature selection as some of the predictors may not
be included in the tree.
- Not sensitive to the presence of missing values and outliers.
- No assumptions about the shape and the distribution of the data.

Cons:
- The fit has a high variance, meaning small changes in the data set can lead to an entirely different tree.
- Overfitting is a problem for tree-based models, but we can adjust the stopping conditions and prune the tree.
- Can be inefficient when performing an exhaustive search for the splitting points of continuous numerical predictors.
- Greedy algorithms cannot guarantee the return of the globally optimal regression tree.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

When do you use regression trees and when classification trees?

A
  • Regression tasks relate to determining quantitative numerical variables based on input variables
  • Classification tasks about determining a qualitative value (e.g., category or class) based on the input variables
    –> Categorical variables (nominal data, ordinal data)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

How do you split classification trees?

A

Most popular split criteria are Gini and Entropy

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Clustering and segmentation

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Explain the use of clustering

A

Use of clustering:

Text documents, e.g.,patents,legalcases, webpages, questions and feedback ==> Topic modelling

  • Clients, e.g.,recommendation systems
  • Fault detection, e.g., fraud, networksecurity
  • Missing data
  • A clustering task may require a number of different algorithms/approaches.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are elements of a cluster?

A
  • Are similar in some attributes
  • May consider some attributes to weigh more than others ==> Not all attributes are as important as others (feature selection)

May be considered to be close to each other ==> Needs distance measurements

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What are the two clustering approaches you learned in the lecture?

A
  1. k-means algorithm
  2. Hierarchical
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Explain the k-means algorithm

A
  1. Randomly select centroids for K clusters
  2. Select nearest data points as cluster population
  3. Find mean values in each cluster and use that as new centroid
  4. Re-evaluate populations and centroids until stable/convergance
  • Does not work with categorical data and it is susceptible to outliers
  • Have to predefine avalue for K
  • No guarantee there are actually clusters to find
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Explain hierarchical clustering

A

Clusters with in clusters!
* Agglomerative (bottom-up) vs Divisive (Top-down)

  • Agglomerative:
  • Treat each data point as a centroid in a cluster of population 1
  • Form new clusters by merging nearby clusters
  • Continue until only one cluster
  • Various ways to calculate which clusters should be merged, often looking at (min or max) distances of the cluster population to each other
  • The results of hierarchical clustering are usually presented in a dendrogram
  • Greedy!
  • Can be costly, due to having to calculate a lot of distances for each level of the tree.
  • But with no randomness, the same tree will be produced each time.
  • Can cut the tree at any level so as to get the population of a certain number of clusters.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is a network made up of?

A

Nodes and arcs

  • Node (vertices) –> entities in the data
  • Edges (arcs) –> relationships between the entities
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Explain the specific terms for network data

A
  • Directed graphs – direction of connections, visualised as arrows, e.g., retweets, power relationships, dispersion of resources
  • Weight – the strength of a connection, e.g., number of instances
  • Degree – How many connections a node has:
  • Can incorporate the weight of the connections
  • Can distinguish between in-degree and out-degree
17
Q

How do you evaluate nodes?

A

Significant nodes
- The closeness of nodes is not an Euclidean distance
- The centrality of a node can be measured in various ways.
- Degree
- Betweenness
- Closeness

  • Clustering nodes
  • You can still identify clusters of nodes