5 - Clustering, Segmenting, and Cutting Through the Noise Flashcards

1
Q

What is the primary goal of the Fraud Department at Shu Financial?

A

To protect the company from losses and ensure customers’ private information is secure

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are the two types of identity fraud?

A
  • Application fraud
  • Account takeover
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is a supervised machine learning model?

A

A model trained on a target variable, such as the probability of an application being fraudulent

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

True or False: The Fraud Department is considered a profit center for Shu Financial.

A

False

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is the focus of unsupervised machine learning in the context of fraud detection?

A

To find patterns and groupings among fraudulent applications without a predefined target variable

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is principal components analysis (PCA)?

A

A dimensionality-reduction method that reduces the number of variables while retaining important signals

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What does ‘dimensionality reduction’ refer to?

A

The process of reducing the number of columns in a dataset while retaining important information

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What types of variables does PCA work best with?

A
  • Continuous variables
  • Ordinal variables
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is the first preprocessing step before conducting PCA?

A

Eliminating variables that have limited incremental information

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is normalization in the context of data standardization?

A

Adjusting variables to have a mean of 0 and a standard deviation of 1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is the relationship between principal components?

A

Each component is perpendicular to the previous components, meaning there is no correlation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is cluster profiling?

A

The process of generating unique descriptions for each cluster using the input variables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What should be done if the clusters are not well separated?

A

Investigate the features to describe what the clusters represent

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is the role of the business unit in the data science process?

A

To provide domain-specific knowledge that helps interpret principal components and clusters

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Fill in the blank: The second principal component is computed by finding what combination of input variables explains the _______.

A

remaining information not explained by the first principal component

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What does the data science team plan to do after identifying different groups of fraudulent applications?

A

Build separate models to target the different clusters

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What is t-SNE?

A

A dimensionality reduction method often used when separating clusters using linear assumptions is difficult

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What happens if too much critical information is discarded during dimensionality reduction?

A

It can hinder the analysis and lead to less effective models

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What is the importance of the partnership between the business unit and the data science team?

A

It ensures that the data science products meet business needs, avoiding wasted time and resources

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What does the first principal component typically represent?

A

The combination of variables that explains the most information in the dataset

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

True or False: Categorical variables are very useful in principal components analysis.

A

False

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What is the process of profiling clusters based on?

A

Labeling clusters based on a few variables that are the most extreme in that specific cluster

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Who are the clients in the collaborative profiling process?

A

Steve and experts in the Fraud Department

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What is the limitation of manually labeling clusters when the number of clusters is very large?

A

A manual process of labeling clusters is not feasible

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
What can the data science team track when exploring different clustering algorithms?
Number of clusters identified, percentage of the population in each cluster, and cluster descriptions
26
What types of characteristics might be used to profile clusters?
* Demographics (age, sex, location) * Transaction types (gas stations, ATMs) * Patterns in online transactions
27
Who may be called in to review the clusters for assessability?
Subject-matter experts
28
What should Steve discuss with operational teams regarding modeling results?
How they plan to use the model output and their capacity limitations
29
What is a potential consequence of identifying too many clusters?
Information may not be useful and could exceed the capacity of the team
30
What might investigators do if they find excessive cluster information?
Begin to ignore the modeling output
31
What can streamline processes for application fraud types?
Developing streamlined processes for five different types of application fraud
32
What is one benefit of data exploration in predictive modeling?
Developing new features that can be used to improve predictive models
33
Fill in the blank: The data science team could examine the details of the first few principal components to see if they represent features that are not already explicitly in the _______.
predictive model
34
What should the modeling align with to be effective?
Operational capacity
35
True or False: A major opportunity loss for the organization can occur if the team exceeds their capacity to handle cluster information.
True
36
What is unsupervised machine learning?
A broader category of machine learning that includes clustering as one of its common applications.
37
What is the main goal of clustering in the context of fraud analysis?
To group similar fraud cases together and separate dissimilar fraud cases.
38
List some common clustering algorithms.
* Gaussian mixed models * Expectation-maximization models * Latent Dirichlet allocation methods * Fuzzy clustering algorithms
39
What distinguishes exclusive clusters from fuzzy clusters?
Exclusive clusters assign customers to only one group, while fuzzy clusters allow customers to belong to multiple clusters with varying probabilities.
40
What is hierarchical clustering?
A type of clustering where smaller clusters belong to larger clusters, often visualized like branches of trees.
41
What is partitional clustering?
A clustering method that does not involve hierarchy and focuses on dividing data into distinct groups.
42
What preprocessing step is commonly performed before clustering?
Standardizing features so that they have similar dimensions.
43
What is K-means clustering?
A partitioning-based clustering method that assigns observations to clusters based on their distance to centroids.
44
How does K-means determine the optimal number of clusters?
By testing a range of values for K and selecting the number that explains the most variance in the data.
45
What is a centroid in K-means clustering?
The center point of a cluster, calculated as the mean of all points in that cluster.
46
True or False: K-means clustering can be heavily influenced by the initial starting points.
True.
47
What is fuzzy K-means clustering?
A variation of K-means where customers can belong to more than one cluster, with probabilities based on distance from centroids.
48
What is the difference between agglomerative and divisive clustering?
* Agglomerative: Bottom-up approach merging clusters * Divisive: Top-down approach splitting clusters
49
What is the significance of distance measurement in clustering?
It impacts how clusters are formed and merged, affecting the final cluster shapes.
50
What are some common types of fraud that may be clustered?
* Fraudulent applications * Account takeovers * Card thefts * Illegal use of lost cards * Skimming
51
What is the purpose of cluster analysis in operational settings?
To interpret data and improve decision-making in fraud detection and prevention.
52
What was one outcome of the cluster analysis performed by Steve's team?
The development of behavioral-based predictive models that target specific features of different clusters.
53
What is the role of principal components analysis in clustering?
To reduce dimensionality and identify the most significant components contributing to variance in the data.
54
What is a key question to ask regarding the clustering algorithm used?
What metric was used for the distance in the clustering process?
55
Fill in the blank: The final number of clusters in Steve's project was set at _______.
six.
56
What is dimensionality reduction?
A technique used to reduce the number of features in a dataset while preserving its variance.
57
What should be considered when choosing a method for dimensionality reduction?
The method used and the rationale behind its selection.
58
What is a key question regarding variance in dimensionality reduction?
How much variance is explained by the different components?
59
What is an important consideration when focusing on components in dimensionality reduction?
Whether reducing focus to just a few components leads to loss of too much information.
60
How should components be interpreted in dimensionality reduction?
By analyzing what the first, second, and third components represent.
61
What is a key characteristic of clustering algorithms?
Whether the clustering is exclusive.
62
What must be specified when discussing clustering algorithms?
Which clustering algorithm was used.
63
What metric is important in clustering algorithms?
The distance metric and the linkage criteria between clusters.
64
Why is it important to discuss distance and linkage in clustering?
To understand the criteria used for clustering.