Data Mining/ ML Methods Flashcards
Discuss the general approach to classification
Classification is when you want to assign an item to a specific category based on various conditions. Generally find location of items that need classification, compare it to items close by, and then assign group. Also used for object detection spam cancer etc method is called K nearest neighbors.
Clustering
Groupings are unknown, and analyst wants to determine if object belongs to any group. Clustering is unsupervised learning and data set is unlabeled.
Bayes Theorem
Given the hypothesis and the observed data, this theorem is the probability of observing data. Basically the probability of getting the data that you found.
Naive Bayes
Estimates the conditional probability of an outcome. Naive Bayes is an algorithm that applies to Bayes theorem. Naive Bayes classifier is a ml model used to classify the object based on different features.
PCA principal component analysis
This is an attempt to find out if variables themselves group in any meaningful way. This is a data reduction method used to reduce dimensionality of large data sets. This is done by transforming large set of variables into smaller ones that still contains most of the information in the large set.
Dimensionality reduction
Reduces the number of variables and the amount of data. PCA is a technique for this
Data reduction
Reducing volume of data in storage or in database. Goal is or optimize storage capacity.
Hierarchal clustering
Algorithm that groups similar objects into groups that are called clusters.
Anomaly detection
Identify rare items. Can be used to detect fraud. Using R or tableau with s local outlier factor or Alfa function
Neural networks
Algorithm that mimics the operation of human brain to recognize relationships in data sets.
Deep learning
Type of neural network capable of performing text classification. Also type of recurrent neural network RNN that works best on sequential data.
Decision Trees
Tree like model of alternative decisions and the consequences. It is a sequence of binary decisions based on your data that can combine to predict an outcome by branching out from one decision to the next.
Optimization Analysis
Finding the best value for one or more target variables given certain constraints. Showing what value a variable should have given certain conditions or restraints
Supervised model versus unsupervised
Supervised is an ml algorithm that has a labelled data set. Such as classification or regression
Unsupervised is unlabeled data that an ml algorithm tries to find patterns. This would be clustering anomaly detection or a neural network.