Data Mining 1 Flashcards
Series of tasks, activities, or operations to achieve a goal or an outcome
Combination of hardware and software to facilitate or automate processes
Discrete measurement, fact, or observation representing a real-world process
the mathematical discipline that studies the methods of collecting, analyzing, and interpreting data.
specific collection of items of interest
subset or subcollection of the population
two scopes of data
Sample & Population
Logic is built based on business rules
Traditional Rule-Based AI
Logic is built by modelling and training data
Machine Learning
Input and sometimes output data are provided to a machine which will build a logic based on mathematical rules
Machine Learning
Machine learning algorithms in which the training data includes both input and output
Supervised Machine Learning
Inputs are called
feature values
outputs are called
label values
the label predicted by the model is a numeric value
the model predicts whether a record is an instance of a specific class or category
Binary Classification
the model predicts whether a record is an instance of one of multiple classes or categories
Multiclass Classification
Training data consists only of input without any known output
Unsupervised Machine Learning
the model identifies similarities between observations based on their features and groups them into discrete clusters
A model that groups existing customers into clusters based on age, location, gender, social media usage, and purchasing behavior.
A model that classifies whether a social media post is positive, negative, or neutral.
Multiclass Classification
A model that predicts whether a customer will cancel their subscription.
Binary Classification
A model that predicts the price of an apartment based on the size, number of rooms, barangay, and date of building.
Used to train the model, data where the algorithm learns patterns from
Training Data
Used to evaluate the model
Test Data
Proportion of predictions that the model got right
Proportion of predicted positive cases where the true label is actually positive
Proportion of positive cases that the model identified correctly
Overall metric combining Recall and Precision
F1 Score
a lazy learning algorithm, predicts the class of a data point based on the majority class of its k nearest neighbors
k-NN classifier
predicts the probability that a given data point belongs to a particular class, uses the logistic function
Logistic Regression
an S-shaped curve, used to represent logistical regression
logistic function
occurs when one class is significantly more frequent than the other
Class Imbalance
reducing the number of instances in the majority class by removing samples until the classes are balanced.
increasing the number of instances in the minority class by duplicating samples or generating new synthetic examples.
Generates synthetic samples for the minority class by interpolating between existing samples
SMOTE (Synthetic Minority Oversampling Technique)
Cons of Oversampling
Oversampling can cause overfitting, especially with random oversampling.
Cons of Undersampling
Important information from the majority class may be lost, potentially underfitting the model.