Data Mining Flashcards
Data Mining
Data mining is the process of discovering patterns, correlations, trends and useful information from large sets of data using statistical, mathematical, and computational techniques
can work for efficiency optimization, quality control, predictive maintenance, supply chain management, customer insights, cost reduction …
ML
Supervised learning: Here is the data set where the right answers are given for each example. Please produce more right answers
Classification
Unsupervised learning: “Here is the unlabelled data. Please find peculiarities, similarities or structures (e.g. clusters) in the data yourself
Associated Rule learning
Clustering
Reinforcement Learning: Learn to do something by maximising your reward
Reinforcement Learning: Learn to do something by maximising your reward
to classify objects, events, or observations into predefined labels or classes.
Involve training a model on a dataset with known labels, and then using that model to predict the labels of unobserved data
KNN - K nearest neighbours
Aims to predict the correct class for the test data by calculating the distance between the test data and all training points
How to find optimal K value?
No predefined statistical methods to find the most favourable value of K
Choose a range of K values and then train the KNN with these K values
Evaluate the performance of KNN with different K values.
Naïve Bayes
a simple technique for constructingclassifiers. These classifiers assign class labels to problem instances based on feature values
calculates the probability of a hypothesis (class label) given the observed data and prior knowledge.
Naïve Bayes treats all word orders the same.
0 Frequency problem
Main Methods- measuring accuracy
There are many other algorithms for classification
Measuring the accuracy of a model – Use the test set data (not data that was used to train the model) – Accuracy is simply the number of data records correctly classified by the mode
Performance of Classification
A Confusion Matrix is N x N matrix used for evaluating the performance of a classification model
N is the total number of classes
Accuracy measures the ability of the classifier making the correct prediction. It’s the ratio of the number of correct predictions to the total number of predictions
Recall measures how many of the actual positive samples were correctly identified by the classifier.
Precision measures how many of the samples predicted as positive are actually positive.
TP: True positive
TP, FP
Precision = TP/TP+FP
Recall = TP/TP+FN
Accuracy= TP+TN/TP+TN+FP+FN
Associated Rule Learning
It is a rule-based machine learning method for discovering interesting relations between variables in large datasets.
Aims to find patterns from items that are dependent on one another and map the connections between them
Useful in predicting behaviours and relationships between variables in a dataset
Useful for classification and discovering patterns within data.
lIFT
Lift > 1 means if A is purchased, the probability of selling B increases 1 times more.
Lift =1 means the probability of occurrence of A and B is independent on each other.
Lift < 1 means A is a substitute for B
In data mining and market basket analysis, lift is a measure used to quantify the importance of a rule for predicting the occurrence of an item (or set of items) in a transaction, relative to the occurrence of the item(s) without the rule. It helps to determine the strength of association between items in a dataset.
The lift of an association rule is calculated as the ratio of the observed support of the rule to the expected support of the rule if the items were independent. Mathematically, it can be represented as:
[ \text{lift}(X \rightarrow Y) = \frac{\text{support}(X \cup Y)}{\text{support}(X) \times \text{support}(Y)} ]
Where:
Support(A and B)/ (A) *(B)
Interpretation of lift:
- If the lift is greater than 1, it indicates that the presence of ( X ) in a transaction increases the likelihood of the presence of ( Y ) in the same transaction, suggesting a positive association between ( X ) and ( Y ).
- If the lift is less than 1, it suggests that the presence of ( X ) in a transaction decreases the likelihood of the presence of ( Y ) in the same transaction, indicating a negative association between ( X ) and ( Y ).
- If the lift is equal to 1, it means that the occurrence of ( X ) and ( Y ) is independent, and there is no association between them.
Lift is a useful metric in market basket analysis for identifying meaningful associations between items and can help in making decisions related to product placement, cross-selling, and marketing strategies.
How do we find an optimal rule among many rules
Association Rule Learning can be divided into three algorithms:
Apriori algorithm
Eclat algorithm
FP-growth algorithm
The Apriori algorithm uses frequent datasets to generate association rules.
Involves finding sets of items or itemsets. An itemset can be a collection of one or more items that appear together in a transaction or dataset.
An itemset can be either a single item, also known as a 1-itemset, or a set of k items, known as a k-itemset
Let’s consider a simplified example of market basket analysis using the Apriori algorithm. Suppose we have a transaction dataset representing purchases at a grocery store:
Transaction 1: {bread, milk, eggs}
Transaction 2: {bread, butter}
Transaction 3: {milk, eggs}
Transaction 4: {bread, milk, butter}
Transaction 5: {bread, eggs}
We want to use the Apriori algorithm to identify frequent item sets and association rules from this dataset.
Step 1: Identify frequent single items:
- Support threshold = 2 (minimum support count)
Frequent single items:
- {bread}, {milk}, {eggs}, {butter}
Step 2: Generate candidate item sets of length 2:
- Join {bread} with {milk}, {eggs}, {butter}
- Join {milk} with {eggs}
Candidate item sets:
- {bread, milk}, {bread, eggs}, {bread, butter}, {milk, eggs}
Step 3: Prune candidate item sets:
- Prune {bread, butter} because {butter} is not a frequent single item.
Pruned candidate item sets:
- {bread, milk}, {bread, eggs}, {milk, eggs}
Step 4: Count support for candidate item sets:
- Count support for {bread, milk}: 2
- Count support for {bread, eggs}: 2
- Count support for {milk, eggs}: 2
Step 5: Filter candidate item sets:
- Filter {bread, milk}, {bread, eggs}, {milk, eggs} because their support counts meet the minimum support threshold.
Frequent item sets:
- {bread, milk}, {bread, eggs}, {milk, eggs}
Step 6: Repeat the process to generate higher-order frequent item sets if necessary. In this example, we stop at frequent item sets of length 2.
Step 7: Generate association rules from frequent item sets:
- {bread, milk} => {eggs}
- {bread, eggs} => {milk}
- {milk, eggs} => {bread}
These association rules indicate that customers who purchase bread and milk are likely to also purchase eggs, and similarly for other combinations.
This example demonstrates how the Apriori algorithm can be used to extract meaningful insights from transaction data by identifying frequent item sets and association rules.
Clustering
Aim:
To group data into classes or “clusters”
Helps in identifying patterns or structures within unlabelled data.
Simplifies complex datasets and makes them easier to analyse and interpret by grouping similar items.
Helps in decision-making by segmenting markets, customers, or products based on similarities.
K-means Clustering
A K-means clustering algorithm tries to group similar items in the form of clusters
The number of groups is represented by K
Select K (here we set K as 2).
2. Randomly generate 2 seeds as centroids (due to k=2).
3. Calculate the distance between each point to the 2 centroids and then sign each point to its closed centroid.
4. Calculate mean of each cluster and use them to replace the original centroids
5. Repeat step 3 and step 4 many times and until we have reached a clustering solution where we can no longer adjust any of clusters
Choosing K
The Elbow Method: calculate the WithinCluster-Sum of Squared Errors (WSS) for different values of K, and choose the K for which WSS becomes first starts to diminish
The Silhouette Method: measure how similar a point is to its own cluster compared to other cluster
Graph Clustering
A type of unsupervised learning
About partitioning nodes in a graph into clusters based on their common characteristics.
Nodes assigned into the same cluster have more properties in common with each other than nodes in different clusters.
Why:
Essential in identifying inherent structures or patterns within complex datasets, especially for dataset that is best represented as a network or graph, such as social networks, biological networks, or transportation systems.
Good for simplifying complex networks: Large networks can be overwhelming and difficult to analyze. Graph clustering simplifies these networks by grouping similar nodes together, making the network easier to understand and analyze.
In social network analysis, graph clustering helps identify communities or groups of individuals with common interests or connections. This can be crucial for marketing, sociology studies, and understanding social dynamics