data_mining_flashcards
Front
Back
What is data mining?
The process of discovering patterns and extracting useful knowledge from large datasets.
What are the two main types of learning?
Supervised: Labeled data, prediction tasks (Regression, Classification). Unsupervised: Unlabeled data, pattern discovery (Clustering, Association).
Give examples of data mining applications
Healthcare, fraud detection, marketing, finance, recommendation systems.
What are common techniques to handle missing data?
Drop rows/columns with missing data. Imputation (mean, median, mode).
What is normalization vs. standardization?
Normalization: Rescales values to [0, 1]. Standardization: Centers data with mean = 0, std = 1.
What is one-hot encoding used for?
To convert categorical features into binary columns.
What is the purpose of EDA?
To explore, visualize, and summarize data to find patterns and outliers.
What is a boxplot?
A visualization tool to show data spread, quartiles, and detect outliers.
What does a correlation matrix tell us?
The relationships (positive/negative/none) between numerical variables.
What is the formula for linear regression?
Y = b0 + b1X + ε.
What is logistic regression used for?
To predict binary outcomes using the sigmoid function.
What are common evaluation metrics for classification?
Accuracy, Precision, Recall, F1 Score, ROC-AUC.
How does KNN work?
It classifies a point based on the majority class of its k-nearest neighbors.
Name two distance metrics used in KNN.
Euclidean, Manhattan.
What is the impact of increasing k in KNN?
Larger k reduces overfitting but can oversmooth boundaries.
What is entropy in decision trees?
A measure of impurity in a dataset.
What is pruning in decision trees?
Reducing tree size to prevent overfitting.
What criteria are used for splitting in decision trees?
Gini impurity, Entropy, Information Gain.
What is bias in machine learning?
Error from overly simplistic models (underfitting).
What is variance in machine learning?
Error from sensitivity to training data (overfitting).
What is PCA used for?
To reduce dimensionality while retaining important features.
What are eigenvectors and eigenvalues?
They define the principal components in PCA.
How does a random forest work?
It uses bagging to create multiple decision trees and combines their outputs.
Why is random forest better than a single decision tree?
It reduces overfitting and improves accuracy.
What is the goal of SVM?
To find the optimal hyperplane that maximizes the margin between classes.
What are the assumptions of Naive Bayes?
Features are conditionally independent given the class.
What are the steps in KMeans clustering?
- Initialize k cluster centers. 2. Assign points to the nearest cluster. 3. Update cluster centers.
How does GMM differ from KMeans?
GMM uses probability distributions, while KMeans uses centroids.
What is the Apriori algorithm used for?
To identify frequent itemsets in market basket analysis.
What are Support
Confidence
What are content-based and collaborative filtering?
Content-based: Uses item features. Collaborative: Uses user interactions.
What is survivorship bias?
Bias caused by focusing only on successful cases.
How is NLP used in data mining?
Text analysis for tasks like sentiment analysis and machine translation.