5 - Clustering, Segmenting, and Cutting Through the Noise Flashcards
What is the primary goal of the Fraud Department at Shu Financial?
To protect the company from losses and ensure customers’ private information is secure
What are the two types of identity fraud?
- Application fraud
- Account takeover
What is a supervised machine learning model?
A model trained on a target variable, such as the probability of an application being fraudulent
True or False: The Fraud Department is considered a profit center for Shu Financial.
False
What is the focus of unsupervised machine learning in the context of fraud detection?
To find patterns and groupings among fraudulent applications without a predefined target variable
What is principal components analysis (PCA)?
A dimensionality-reduction method that reduces the number of variables while retaining important signals
What does ‘dimensionality reduction’ refer to?
The process of reducing the number of columns in a dataset while retaining important information
What types of variables does PCA work best with?
- Continuous variables
- Ordinal variables
What is the first preprocessing step before conducting PCA?
Eliminating variables that have limited incremental information
What is normalization in the context of data standardization?
Adjusting variables to have a mean of 0 and a standard deviation of 1
What is the relationship between principal components?
Each component is perpendicular to the previous components, meaning there is no correlation
What is cluster profiling?
The process of generating unique descriptions for each cluster using the input variables
What should be done if the clusters are not well separated?
Investigate the features to describe what the clusters represent
What is the role of the business unit in the data science process?
To provide domain-specific knowledge that helps interpret principal components and clusters
Fill in the blank: The second principal component is computed by finding what combination of input variables explains the _______.
remaining information not explained by the first principal component
What does the data science team plan to do after identifying different groups of fraudulent applications?
Build separate models to target the different clusters
What is t-SNE?
A dimensionality reduction method often used when separating clusters using linear assumptions is difficult
What happens if too much critical information is discarded during dimensionality reduction?
It can hinder the analysis and lead to less effective models
What is the importance of the partnership between the business unit and the data science team?
It ensures that the data science products meet business needs, avoiding wasted time and resources
What does the first principal component typically represent?
The combination of variables that explains the most information in the dataset
True or False: Categorical variables are very useful in principal components analysis.
False
What is the process of profiling clusters based on?
Labeling clusters based on a few variables that are the most extreme in that specific cluster
Who are the clients in the collaborative profiling process?
Steve and experts in the Fraud Department
What is the limitation of manually labeling clusters when the number of clusters is very large?
A manual process of labeling clusters is not feasible