Data Mining Flashcards
Data Mining
Data mining is the process of discovering patterns, correlations, trends and useful information from large sets of data using statistical, mathematical, and computational techniques
can work for efficiency optimization, quality control, predictive maintenance, supply chain management, customer insights, cost reduction …
ML
Supervised learning: Here is the data set where the right answers are given for each example. Please produce more right answers
Classification
Unsupervised learning: “Here is the unlabelled data. Please find peculiarities, similarities or structures (e.g. clusters) in the data yourself
Associated Rule learning
Clustering
Reinforcement Learning: Learn to do something by maximising your reward
Reinforcement Learning: Learn to do something by maximising your reward
to classify objects, events, or observations into predefined labels or classes.
Involve training a model on a dataset with known labels, and then using that model to predict the labels of unobserved data
KNN - K nearest neighbours
Aims to predict the correct class for the test data by calculating the distance between the test data and all training points
How to find optimal K value?
No predefined statistical methods to find the most favourable value of K
Choose a range of K values and then train the KNN with these K values
Evaluate the performance of KNN with different K values.
Naïve Bayes
a simple technique for constructingclassifiers. These classifiers assign class labels to problem instances based on feature values
calculates the probability of a hypothesis (class label) given the observed data and prior knowledge.
Naïve Bayes treats all word orders the same.
0 Frequency problem
Main Methods- measuring accuracy
There are many other algorithms for classification
Measuring the accuracy of a model – Use the test set data (not data that was used to train the model) – Accuracy is simply the number of data records correctly classified by the mode
Performance of Classification
A Confusion Matrix is N x N matrix used for evaluating the performance of a classification model
N is the total number of classes
Accuracy measures the ability of the classifier making the correct prediction. It’s the ratio of the number of correct predictions to the total number of predictions
Recall measures how many of the actual positive samples were correctly identified by the classifier.
Precision measures how many of the samples predicted as positive are actually positive.
TP: True positive
TP, FP
Precision = TP/TP+FP
Recall = TP/TP+FN
Accuracy= TP+TN/TP+TN+FP+FN
Associated Rule Learning
It is a rule-based machine learning method for discovering interesting relations between variables in large datasets.
Aims to find patterns from items that are dependent on one another and map the connections between them
Useful in predicting behaviours and relationships between variables in a dataset
Useful for classification and discovering patterns within data.
lIFT
Lift > 1 means if A is purchased, the probability of selling B increases 1 times more.
Lift =1 means the probability of occurrence of A and B is independent on each other.
Lift < 1 means A is a substitute for B
In data mining and market basket analysis, lift is a measure used to quantify the importance of a rule for predicting the occurrence of an item (or set of items) in a transaction, relative to the occurrence of the item(s) without the rule. It helps to determine the strength of association between items in a dataset.
The lift of an association rule is calculated as the ratio of the observed support of the rule to the expected support of the rule if the items were independent. Mathematically, it can be represented as:
[ \text{lift}(X \rightarrow Y) = \frac{\text{support}(X \cup Y)}{\text{support}(X) \times \text{support}(Y)} ]
Where:
Support(A and B)/ (A) *(B)
Interpretation of lift:
- If the lift is greater than 1, it indicates that the presence of ( X ) in a transaction increases the likelihood of the presence of ( Y ) in the same transaction, suggesting a positive association between ( X ) and ( Y ).
- If the lift is less than 1, it suggests that the presence of ( X ) in a transaction decreases the likelihood of the presence of ( Y ) in the same transaction, indicating a negative association between ( X ) and ( Y ).
- If the lift is equal to 1, it means that the occurrence of ( X ) and ( Y ) is independent, and there is no association between them.
Lift is a useful metric in market basket analysis for identifying meaningful associations between items and can help in making decisions related to product placement, cross-selling, and marketing strategies.
How do we find an optimal rule among many rules
Association Rule Learning can be divided into three algorithms:
Apriori algorithm
Eclat algorithm
FP-growth algorithm
The Apriori algorithm uses frequent datasets to generate association rules.
Involves finding sets of items or itemsets. An itemset can be a collection of one or more items that appear together in a transaction or dataset.
An itemset can be either a single item, also known as a 1-itemset, or a set of k items, known as a k-itemset
Let’s consider a simplified example of market basket analysis using the Apriori algorithm. Suppose we have a transaction dataset representing purchases at a grocery store:
Transaction 1: {bread, milk, eggs}
Transaction 2: {bread, butter}
Transaction 3: {milk, eggs}
Transaction 4: {bread, milk, butter}
Transaction 5: {bread, eggs}
We want to use the Apriori algorithm to identify frequent item sets and association rules from this dataset.
Step 1: Identify frequent single items:
- Support threshold = 2 (minimum support count)
Frequent single items:
- {bread}, {milk}, {eggs}, {butter}
Step 2: Generate candidate item sets of length 2:
- Join {bread} with {milk}, {eggs}, {butter}
- Join {milk} with {eggs}
Candidate item sets:
- {bread, milk}, {bread, eggs}, {bread, butter}, {milk, eggs}
Step 3: Prune candidate item sets:
- Prune {bread, butter} because {butter} is not a frequent single item.
Pruned candidate item sets:
- {bread, milk}, {bread, eggs}, {milk, eggs}
Step 4: Count support for candidate item sets:
- Count support for {bread, milk}: 2
- Count support for {bread, eggs}: 2
- Count support for {milk, eggs}: 2
Step 5: Filter candidate item sets:
- Filter {bread, milk}, {bread, eggs}, {milk, eggs} because their support counts meet the minimum support threshold.
Frequent item sets:
- {bread, milk}, {bread, eggs}, {milk, eggs}
Step 6: Repeat the process to generate higher-order frequent item sets if necessary. In this example, we stop at frequent item sets of length 2.
Step 7: Generate association rules from frequent item sets:
- {bread, milk} => {eggs}
- {bread, eggs} => {milk}
- {milk, eggs} => {bread}
These association rules indicate that customers who purchase bread and milk are likely to also purchase eggs, and similarly for other combinations.
This example demonstrates how the Apriori algorithm can be used to extract meaningful insights from transaction data by identifying frequent item sets and association rules.
Clustering
Aim:
To group data into classes or “clusters”
Helps in identifying patterns or structures within unlabelled data.
Simplifies complex datasets and makes them easier to analyse and interpret by grouping similar items.
Helps in decision-making by segmenting markets, customers, or products based on similarities.
K-means Clustering
A K-means clustering algorithm tries to group similar items in the form of clusters
The number of groups is represented by K
Select K (here we set K as 2).
2. Randomly generate 2 seeds as centroids (due to k=2).
3. Calculate the distance between each point to the 2 centroids and then sign each point to its closed centroid.
4. Calculate mean of each cluster and use them to replace the original centroids
5. Repeat step 3 and step 4 many times and until we have reached a clustering solution where we can no longer adjust any of clusters
Choosing K
The Elbow Method: calculate the WithinCluster-Sum of Squared Errors (WSS) for different values of K, and choose the K for which WSS becomes first starts to diminish
The Silhouette Method: measure how similar a point is to its own cluster compared to other cluster
Graph Clustering
A type of unsupervised learning
About partitioning nodes in a graph into clusters based on their common characteristics.
Nodes assigned into the same cluster have more properties in common with each other than nodes in different clusters.
Why:
Essential in identifying inherent structures or patterns within complex datasets, especially for dataset that is best represented as a network or graph, such as social networks, biological networks, or transportation systems.
Good for simplifying complex networks: Large networks can be overwhelming and difficult to analyze. Graph clustering simplifies these networks by grouping similar nodes together, making the network easier to understand and analyze.
In social network analysis, graph clustering helps identify communities or groups of individuals with common interests or connections. This can be crucial for marketing, sociology studies, and understanding social dynamics
Edge Betweeness
Number of shortest paths between pairs of nodes that run through it
Calculate the betweenness of all existing edges in the network
Remove the edge(s) with the highest betweenness, because the edges with the highest betweenness connect communities.
Recalculate the betweenness of all remaining edges.
Repeat step 2 and step 3 and stop this process at a condition. This condition could be a desired number of clusters.
Edge betweenness is a measure used in network analysis to quantify the importance or centrality of edges within a network. It calculates the number of shortest paths between pairs of nodes in the network that pass through a particular edge. Here’s how it works:
- Shortest Paths: First, calculate the shortest paths between all pairs of nodes in the network using algorithms like Dijkstra’s algorithm or Floyd-Warshall algorithm.
- Counting Paths: For each edge (e) in the network, count the number of shortest paths that traverse (e). This involves examining all pairs of nodes and determining if the shortest path between them includes edge (e).
-
Edge Betweenness Calculation: The edge betweenness for edge (e) is calculated as the sum of the fraction of all shortest paths that pass through (e) over all pairs of nodes. Mathematically, it can be represented as:[ E(e) = \sum_{s \neq t \in V} \frac{\sigma_{st}(e)}{\sigma_{st}} ]Where:
- ( V ) is the set of nodes in the network.
- ( \sigma_{st} ) is the total number of shortest paths between node (s) and node (t).
- ( \sigma_{st}(e) ) is the number of shortest paths between node (s) and node (t) that pass through edge (e). - Normalization: Optionally, the edge betweenness values can be normalized by dividing each value by the total number of pairs of nodes in the network, ( |V|(|V|-1)/2 ), to make the values comparable across networks of different sizes.
- Interpretation: Higher edge betweenness values indicate that the corresponding edge is more central to the network’s connectivity. These edges often serve as bridges between different parts of the network and play a crucial role in information flow, communication, or resource transfer within the network.
Edge betweenness is commonly used in network analysis tasks such as identifying critical edges, detecting community structures, or finding key pathways in transportation or communication networks.