Lecture 8 – Grouping data Flashcards

Question 1

Q

Segmenting data

Answer

A

Sometimes the segmenting of data is because of the context of the data (e.g. sources)

Sometimes we don’t have pre-determined segments, but we want segmentation
- e.g. identifying customer segments

Question 2

Q

What is a segmentation model?

Question 3

Q

Which one of the following tasks is not a segmentation task?

A. Group all the shopping items available on web.
B. Identification of areas of similar land use in an earth observation database
C. Weather prediction based on last month’s temperature

Answer

A

(probably C)

Question 4

Q

What are regression trees?

Answer

A

A regression tree is a supervised machine learning algorithm that predicts a continuous-valued response variable by learning decision rules from the predictors (or independent variables)
Two main steps:
divide the data into subsets of similar values - - estimate the response within each subset.

Question 5

Q

What is ANOVA?
How does it work to split regression trees?

Answer

A

ANOVA = analysis of variance
–> type of statistical test used to determine if there is a statistically significant difference between two or more categorical groups by testing for differences of means using variance.

Question 6

Q

What are pros and cons of regression trees?

Answer

A

Pros:
- easy to understand
- Visualizing the tree can reveal crucial information, such as how decision rules are formed, the importance of different predictors and the effect of the splitting points in the predictors.
- Implicitly performs feature selection as some of the predictors may not
be included in the tree.
- Not sensitive to the presence of missing values and outliers.
- No assumptions about the shape and the distribution of the data.

Cons:
- The fit has a high variance, meaning small changes in the data set can lead to an entirely different tree.
- Overfitting is a problem for tree-based models, but we can adjust the stopping conditions and prune the tree.
- Can be inefficient when performing an exhaustive search for the splitting points of continuous numerical predictors.
- Greedy algorithms cannot guarantee the return of the globally optimal regression tree.

Question 7

Q

When do you use regression trees and when classification trees?

Answer

A

Regression tasks relate to determining quantitative numerical variables based on input variables
Classification tasks about determining a qualitative value (e.g., category or class) based on the input variables
–> Categorical variables (nominal data, ordinal data)

Question 8

Q

How do you split classification trees?

Answer

A

Most popular split criteria are Gini and Entropy

Question 9

Q

Clustering and segmentation

Question 10

Q

Explain the use of clustering

Answer

A

Use of clustering:

Text documents, e.g.,patents,legalcases, webpages, questions and feedback ==> Topic modelling

Clients, e.g.,recommendation systems
Fault detection, e.g., fraud, networksecurity
Missing data
A clustering task may require a number of different algorithms/approaches.

Question 11

Q

What are elements of a cluster?

Answer

A

Are similar in some attributes
May consider some attributes to weigh more than others ==> Not all attributes are as important as others (feature selection)

May be considered to be close to each other ==> Needs distance measurements

Question 12

Q

What are the two clustering approaches you learned in the lecture?

Answer

A

k-means algorithm
Hierarchical

Question 13

Q

Explain the k-means algorithm

Answer

A

Randomly select centroids for K clusters
Select nearest data points as cluster population
Find mean values in each cluster and use that as new centroid
Re-evaluate populations and centroids until stable/convergance

Does not work with categorical data and it is susceptible to outliers
Have to predefine avalue for K
No guarantee there are actually clusters to find

Question 14

Q

Explain hierarchical clustering

Answer

A

Clusters with in clusters!
* Agglomerative (bottom-up) vs Divisive (Top-down)

Agglomerative:
Treat each data point as a centroid in a cluster of population 1
Form new clusters by merging nearby clusters
Continue until only one cluster
Various ways to calculate which clusters should be merged, often looking at (min or max) distances of the cluster population to each other
The results of hierarchical clustering are usually presented in a dendrogram
Greedy!
Can be costly, due to having to calculate a lot of distances for each level of the tree.
But with no randomness, the same tree will be produced each time.
Can cut the tree at any level so as to get the population of a certain number of clusters.

Question 15

Q

What is a network made up of?

Answer

A

Nodes and arcs

Node (vertices) –> entities in the data
Edges (arcs) –> relationships between the entities

Question 16

Q

Explain the specific terms for network data

Answer

Study These Flashcards

A

Directed graphs – direction of connections, visualised as arrows, e.g., retweets, power relationships, dispersion of resources
Weight – the strength of a connection, e.g., number of instances
Degree – How many connections a node has:
Can incorporate the weight of the connections
Can distinguish between in-degree and out-degree

Question 17

Q

How do you evaluate nodes?

Answer

Study These Flashcards

A

Significant nodes
- The closeness of nodes is not an Euclidean distance
- The centrality of a node can be measured in various ways.
- Degree
- Betweenness
- Closeness

Clustering nodes
You can still identify clusters of nodes

Lecture 8 – Grouping data Flashcards

(17 cards)