5 - Clustering, Segmenting, and Cutting Through the Noise Flashcards by Kaman Hung

What is the primary goal of the Fraud Department at Shu Financial?

To protect the company from losses and ensure customers’ private information is secure

How well did you know this?

Not at all

Perfectly

What are the two types of identity fraud?

Application fraud
Account takeover

How well did you know this?

Not at all

Perfectly

What is a supervised machine learning model?

A model trained on a target variable, such as the probability of an application being fraudulent

How well did you know this?

Not at all

Perfectly

True or False: The Fraud Department is considered a profit center for Shu Financial.

False

How well did you know this?

Not at all

Perfectly

What is the focus of unsupervised machine learning in the context of fraud detection?

To find patterns and groupings among fraudulent applications without a predefined target variable

How well did you know this?

Not at all

Perfectly

What is principal components analysis (PCA)?

A dimensionality-reduction method that reduces the number of variables while retaining important signals

How well did you know this?

Not at all

Perfectly

What does ‘dimensionality reduction’ refer to?

The process of reducing the number of columns in a dataset while retaining important information

How well did you know this?

Not at all

Perfectly

What types of variables does PCA work best with?

Continuous variables
Ordinal variables

How well did you know this?

Not at all

Perfectly

What is the first preprocessing step before conducting PCA?

Eliminating variables that have limited incremental information

How well did you know this?

Not at all

Perfectly

What is normalization in the context of data standardization?

Adjusting variables to have a mean of 0 and a standard deviation of 1

How well did you know this?

Not at all

Perfectly

What is the relationship between principal components?

Each component is perpendicular to the previous components, meaning there is no correlation

How well did you know this?

Not at all

Perfectly

What is cluster profiling?

The process of generating unique descriptions for each cluster using the input variables

How well did you know this?

Not at all

Perfectly

What should be done if the clusters are not well separated?

Investigate the features to describe what the clusters represent

How well did you know this?

Not at all

Perfectly

What is the role of the business unit in the data science process?

To provide domain-specific knowledge that helps interpret principal components and clusters

How well did you know this?

Not at all

Perfectly

Fill in the blank: The second principal component is computed by finding what combination of input variables explains the _______.

remaining information not explained by the first principal component

How well did you know this?

Not at all

Perfectly

What does the data science team plan to do after identifying different groups of fraudulent applications?

Build separate models to target the different clusters

How well did you know this?

Not at all

Perfectly

What is t-SNE?

A dimensionality reduction method often used when separating clusters using linear assumptions is difficult

How well did you know this?

Not at all

Perfectly

What happens if too much critical information is discarded during dimensionality reduction?

It can hinder the analysis and lead to less effective models

How well did you know this?

Not at all

Perfectly

What is the importance of the partnership between the business unit and the data science team?

It ensures that the data science products meet business needs, avoiding wasted time and resources

How well did you know this?

Not at all

Perfectly

What does the first principal component typically represent?

The combination of variables that explains the most information in the dataset

How well did you know this?

Not at all

Perfectly

True or False: Categorical variables are very useful in principal components analysis.

False

How well did you know this?

Not at all

Perfectly

What is the process of profiling clusters based on?

Labeling clusters based on a few variables that are the most extreme in that specific cluster

How well did you know this?

Not at all

Perfectly

Who are the clients in the collaborative profiling process?

Steve and experts in the Fraud Department

How well did you know this?

Not at all

Perfectly

What is the limitation of manually labeling clusters when the number of clusters is very large?

A manual process of labeling clusters is not feasible

How well did you know this?

Not at all

Perfectly

What can the data science team track when exploring different clustering algorithms?

Number of clusters identified, percentage of the population in each cluster, and cluster descriptions

What types of characteristics might be used to profile clusters?

* Demographics (age, sex, location) * Transaction types (gas stations, ATMs) * Patterns in online transactions

Who may be called in to review the clusters for assessability?

Subject-matter experts

What should Steve discuss with operational teams regarding modeling results?

How they plan to use the model output and their capacity limitations

What is a potential consequence of identifying too many clusters?

Information may not be useful and could exceed the capacity of the team

What might investigators do if they find excessive cluster information?

Begin to ignore the modeling output

What can streamline processes for application fraud types?

Developing streamlined processes for five different types of application fraud

What is one benefit of data exploration in predictive modeling?

Developing new features that can be used to improve predictive models

Fill in the blank: The data science team could examine the details of the first few principal components to see if they represent features that are not already explicitly in the _______.

predictive model

What should the modeling align with to be effective?

Operational capacity

True or False: A major opportunity loss for the organization can occur if the team exceeds their capacity to handle cluster information.

True

What is unsupervised machine learning?

A broader category of machine learning that includes clustering as one of its common applications.

What is the main goal of clustering in the context of fraud analysis?

To group similar fraud cases together and separate dissimilar fraud cases.

List some common clustering algorithms.

* Gaussian mixed models * Expectation-maximization models * Latent Dirichlet allocation methods * Fuzzy clustering algorithms

What distinguishes exclusive clusters from fuzzy clusters?

Exclusive clusters assign customers to only one group, while fuzzy clusters allow customers to belong to multiple clusters with varying probabilities.

What is hierarchical clustering?

A type of clustering where smaller clusters belong to larger clusters, often visualized like branches of trees.

What is partitional clustering?

A clustering method that does not involve hierarchy and focuses on dividing data into distinct groups.

What preprocessing step is commonly performed before clustering?

Standardizing features so that they have similar dimensions.

What is K-means clustering?

A partitioning-based clustering method that assigns observations to clusters based on their distance to centroids.

How does K-means determine the optimal number of clusters?

By testing a range of values for K and selecting the number that explains the most variance in the data.

What is a centroid in K-means clustering?

The center point of a cluster, calculated as the mean of all points in that cluster.

True or False: K-means clustering can be heavily influenced by the initial starting points.

True.

What is fuzzy K-means clustering?

A variation of K-means where customers can belong to more than one cluster, with probabilities based on distance from centroids.

What is the difference between agglomerative and divisive clustering?

* Agglomerative: Bottom-up approach merging clusters * Divisive: Top-down approach splitting clusters

What is the significance of distance measurement in clustering?

It impacts how clusters are formed and merged, affecting the final cluster shapes.

What are some common types of fraud that may be clustered?

* Fraudulent applications * Account takeovers * Card thefts * Illegal use of lost cards * Skimming

What is the purpose of cluster analysis in operational settings?

To interpret data and improve decision-making in fraud detection and prevention.

What was one outcome of the cluster analysis performed by Steve's team?

The development of behavioral-based predictive models that target specific features of different clusters.

What is the role of principal components analysis in clustering?

To reduce dimensionality and identify the most significant components contributing to variance in the data.

What is a key question to ask regarding the clustering algorithm used?

What metric was used for the distance in the clustering process?

Fill in the blank: The final number of clusters in Steve's project was set at _______.

six.

What is dimensionality reduction?

A technique used to reduce the number of features in a dataset while preserving its variance.

What should be considered when choosing a method for dimensionality reduction?

The method used and the rationale behind its selection.

What is a key question regarding variance in dimensionality reduction?

How much variance is explained by the different components?

What is an important consideration when focusing on components in dimensionality reduction?

Whether reducing focus to just a few components leads to loss of too much information.

How should components be interpreted in dimensionality reduction?

By analyzing what the first, second, and third components represent.

What is a key characteristic of clustering algorithms?

Whether the clustering is exclusive.

What must be specified when discussing clustering algorithms?

Which clustering algorithm was used.

What metric is important in clustering algorithms?

The distance metric and the linkage criteria between clusters.

Why is it important to discuss distance and linkage in clustering?

To understand the criteria used for clustering.