Midterm Flashcards
Classification Accuracy
percentage of data correctly classified
Classification Coverage
percentage of data to which the classification rule applies
Supervised learning (include example)
training data includes class labels (ex. classification)
Unsupervised learning (include example)
training data does uses unlabeled data (ex. clustering)
Semi-supervised learning
uses both labeled and unlabeled training data
Issues with data mining
Methodologies, user interaction, efficiency and scalability, diversity of data types, data mining and society
Types of quantitative attributes
Nominal, binary, boolean, Ordinal
Types of qualitative/numeric attributes
Interval-scaled, ratio-scaled, discrete, continuous
Nominal attributes
Symbols or names (categorical)
Binary attributes
0 or 1
Boolean attibutes
True or false
Ordinal attributes
Values have order/rank, but magnitude between values unknown (ex. large and small)
Interval-scaled attributes
measured on a scale of equal size unites but no zero point (i.e. temperature)
Ratio-scaled attributes
measured on a scale of equal-sized unites with a zero point
Discrete attribues
finite or countably infinite set of values
Continuous attributes
not discrete, such as floating point numbers
Data quality factors
accuracy, completeness, consistency, timeliness, believability, interpretability
Steps of data preprocessing
Cleaning, integration, reduction, transformation/discretization
Data cleaning
Fill in missing values, smooth out noise, ID outliers, correct inconsistencies
Noise
random error or variance in a measured variable
Noise smoothing techniques
Binning, Regression, Outlier Analysis
Data integration issues
Entity ID problem, attribute correlation, tuple duplication, data value conflicts