data_mining_flashcards
Front
Back
What is data mining?
The process of discovering patterns and extracting useful knowledge from large datasets.
What are the two main types of learning?
Supervised: Labeled data, prediction tasks (Regression, Classification). Unsupervised: Unlabeled data, pattern discovery (Clustering, Association).
Give examples of data mining applications
Healthcare, fraud detection, marketing, finance, recommendation systems.
What are common techniques to handle missing data?
Drop rows/columns with missing data. Imputation (mean, median, mode).
What is normalization vs. standardization?
Normalization: Rescales values to [0, 1]. Standardization: Centers data with mean = 0, std = 1.
What is one-hot encoding used for?
To convert categorical features into binary columns.
What is the purpose of EDA?
To explore, visualize, and summarize data to find patterns and outliers.
What is a boxplot?
A visualization tool to show data spread, quartiles, and detect outliers.
What does a correlation matrix tell us?
The relationships (positive/negative/none) between numerical variables.
What is the formula for linear regression?
Y = b0 + b1X + ε.
What is logistic regression used for?
To predict binary outcomes using the sigmoid function.
What are common evaluation metrics for classification?
Accuracy, Precision, Recall, F1 Score, ROC-AUC.
How does KNN work?
It classifies a point based on the majority class of its k-nearest neighbors.
Name two distance metrics used in KNN.
Euclidean, Manhattan.