Machine Learning Flashcards
Nominal Data (1/4 types of data)
Data that is mutually exclusive, but not ordered (eg. Eye color, sex, type of car, zip codes )
Ordinal Data (1/4 types of data)
Corresponds to Categories where order matters but not difference between values. Eg. Letter Grades, Movie Ratings, Pain Level, Cold-warm-hot of coffee cup, gender
BNN
Biological Neural Network
ANN
Artificial Neural Network
Typical Neural Network
[input pattern] → Input Layer → Hidden Layers → Output Layer → [Output pattern]
Input pattern is presented to the input layer. Then the output pattern is returned from the output layer. What happens between the input and output layers is a black box.
Sigmoid Activation Function
An S curve from 0 to 1
Hyperbolic Tangent Activation Function
an S curve from -1 to 1
Ways to Normalize Nominal Values
- One-of-n Normalization 2. Equilateral Normalization
One-of-n Normalization (aka One-hot encoding )
One way of normalizing Nominal Observations. You have one neuron for each of the output class.
The other way to normalize Nominal Observations is Equilateral Encoding
Equilateral Encoding
- How it works
- Neurons needed
A way of normalizing Nominal Observations.
Floating point numbers is created for each class item with uniform equilateral distance to the other class data items. This allows all output neurons to play a part in each class item and causes an error to affect more neurons than one-of-n encoding (the other way to normalize nominal observations)
Requires one less output neuron than One-of-N normalization
Row of a dataset (3)
- An Entity
- An observation
- Instance
Group of input variable
Input Vector
Columns of a dataset (2)
- Features
2. Attributes of the Observation
Models vs Algorithms
Model = Algorithm(Data)
Field of machine learning that focusing on making predictions
Predictive Modeling - A target function “f” that best maps input variable “X” to output variable “Y”. There is an irreducible error “e”
Y=f(X) + e
We are trying to learn the shape of “f”. Different machine learning algorithms make different assumptions on the shape of “f”. This is why we must try different ML algorithms
Parametric ML Algorithms
Parametric Functions make assumptions on the shape of “f” in Y=f(X) + e
- Linear ML Algorithms
- Logistic Regression
- Linear Discriminant Analysis
- Perceptron
Advantages are Parametric algorithms are simpler, faster, and require less data to train. Disadvantage are they are constrained, have limited complexity, and a poor fit to map the shape of “f”
Non-Parametric ML Algorithms
Do no make assumptions on the shape of the target function.
They are good when you have lots of data and don’t want to worry about choosing all the right features
Examples:
Decision Tree, Neural Networks, Naive Bayes, Support Vector Machines
(Dis)Advantages of Non-Parametric ML Algorithms
Advantages
- Flexibility - may fit a large number of target functions
- Power - no assumptions
- performance - Higher prediction performance
Disadvantages:
- More data needed
- slower
- overfitting - more likely to overfit
4 common types of Data Modeling problems
- Data Classification
- Regression Analysis
- Clustering
- Time Series
Data Classification
Try and determine the class the data falls into using Supervised Learning. A class is usually a non-numerical data attribute
Regression Analysis
A predictive modeling technique which investigates the relationship between a dependent (target) and independent variable (s). Regression problem is when the output variable is a real value, such as “dollars” or “weight.”
Clustering
Clustering algorithms take input data and place it into clusters. The programmer usually specifies the number of clusters to be created before training the algorithm. Because there is no expected output, clustering is considered unsupervised training. If the number of clusters changes, the clustering machine learning method will need to be retrained
Temporal Algorithm
Algorithm that accepts input for values that range over time. Algorithms often use a sliding input window and a prediction window.
Deterministic Training vs Stochastic Training
Deterministic Training Algorithms always perform the exact same way given the same initial state. No random numbers are used.
Stochastic training uses random numbers to train, so the algorithm trains differently each time
Internval Data (1/4 types of data)
Data where the difference between two values is meaningful but the value of zero is arbitrary. Eg. Temperature (in F or C), year
Ratio Data (1/4 types of data)
It has properties of interval data but a clear concept of zero.
eg. Age, speed, length, width, volume, mass
Supervised Learning - Definition + Types of Problems solved
Your training data has the input and output variables and you are using an algorithm to learn the mapping function f
Y=f(X)
Problems solved: 1) Regression 2) Classification
Ex: Linear Regression ; Random Forest, SVM
Unsupervised Learning - Definition and Types of Problems solved
You have input data X and no corresponding output variables with the goal to model the underlying structure to learn more about data. Problems solved: 1. Clustering (grouping of data) 2. Association (rules which describe portions of your data
Algorithms: k-means for clustering ; Apriori algorithms for association rule learning
Semi-Supervised Learning - Definition and Types of Problems solved
Some data is labeled but most is unlabeled and a mixture of supervised and unsupervised techniques
Types of ML Error (3) - Definition
- Bias Error- Simplifying Assumptions made by algorithm to make it easier to solve
- Variance Error - Sensitivity of the model to changes in training data
- Irreducible Error - Unknown variables influencing the mapping of input to output
Power calculations
Helps determine amount of data required for training given expected accuracy/reliability
Reinforcement Learning
A computer program interacts with a dynamic environment in which it must perform certain tasks, learning through trial and error as it seeks to achieve it’s goal
Linear and Polynomial regression
Regression is concerned with modeling relationship between numerical variables that is iteratively refined using a measure of error in the prediction made by the model. Basic assumption is that the output variable(a numeric value) can be expressed as a combination(weight sum) of numeric input variables
Neural Networks - 1) Definition 2) Types of Problems
A large number of highly interconnected processing elements work in unison to solve specific problems, usually classification or pattern-matching problems. Each neuron ‘votes’ on the decision outcome, which might trigger out neurons to vote, and the votes are tallied creating a ranking of the outcomes depending on the support each has received.
Decision Trees - 1) Definition 2) Types of Problems
Tree like flowcharts use branching to illustrate every possible outcome of a decision. Most decision trees use binary branching (two options) baed on actual values or attributes of a data.
Types of Problems: 1. Classification 2. Regression
Overfitting - 1. Definition 2. Solution
ML model learns both the details and the noise too well at the expense of not generalizing to new data.
If we train too long, the error rate on model keeps dropping but error rate on test data goes up!
Solution: Resampling methods(k-fold cross validation) and held-back validation (hold data to very end - if you have enough)
Underfitting 1. Definition 2. Solution
Definition: Failing to learn the problem from the training data sufficiently.
Solution: Try different ML algorithms
Advice: You want to be in middle of overfitting and underfitting
Generalization
How well the concepts learned from the model apply to specific examples not seen by the model when it was learning
Goodness of Fit
measures used in statistics to estimate how well the approximation of the function matches the target function
K-fold cross validation
A cross validation technique used to evaluate model on unseen data
- Shuffle the dataset randomly.
- Split the dataset into k groups
- For each unique group:
3a. Take the group as a hold out or test data set
3b. Take the remaining groups as a training data set
3c. Fit a model on the training set and evaluate it on the test set - Retain the evaluation score and discard the model
- Summarize the skill of the model using the sample of model evaluation scores
Cross Validation
Cross-validation is a RESAMPLING PROCEDURE used to EVALUATE ML models on a limited data sample. It is primarily used in applied machine learning to ESTIMATE the SKILL of a machine learning model on UNSEEN DATA.
Gradient Decent - Definition + Types(2)
An OPTIMIZATION algorithm which can be used with many ML problems. It is used to find the values of parameters (coefficients) of a function (f) that minimizes a cost function. Best used when parameters cannot be estimated analytically(Linear Algebra).
Types: Batch and Stochastic
Gradient Decent Steps
- Choose Random Coefficients or set to zero
- Compute Cost: cost = evaluate(f(coefficient))
- Find derivative of cost: delta = derivative(cost)
- Change coefficient: coefficient - (learning_rate * delta)
- Goto back to step 2; new iteration
Batch Gradient Descent
Cost is calculated by looking at entire dataset before updating the coefficients (for each iteration of the algorithm)
Stochastic Gradient Descent
Used in situations in which you have too much data.
Cost is calculated by taking the derivative from each training data instance and calculating the update immediately
Elements of a Decision
- Data (input, training, feedback)
- prediction
- judgement - determine reward s penalties for each possible outcome
- action
- Outcome
As prediction becomes cheap due to ML and human prediction will decline in value
Value of Judgement will go up
Define: Feature Scaling / Normalization
Common Types:
The goal of normalization is to transform features to be on a similar scale.
- Scaling to Range - convert from 0 to 1
- Clipping - Capping extreme outliers to a min/max value(ie. limit values to +-3σ
- Log Scaling - Compute log of values to compress wide range to a narrow range
- Z-Score - Scaling that represents number of standard deviations away from mean
- BoxCox
Define: Bucketing / Binning + Types (2)
Transforms numeric features into categorical features, using a set of thresholds, is called bucketing (or binning). Needed when there is no linear relationship between the numbers (ie. zip code)
Equal Buckets - Buckets are of equal range
Quartile Buckets - Buckets with equal number of points
Feature Vocabulary
Numerical index given to items(unique features) in a category
Out of Vocab (OOV)
A catch all category for rare ordinal data in a category (low training data) so that machine won’t waste time training on those categories
Rectangular Data
- Definition
- Another term
A rectangular data object like a spreadsheet or data table
Also called a Data Frame
Logistic Regression 1) Types of Problems 2) Algorithm/Process to estimate coefficient
A LINEAR algorithm for a two class BINARY classification problem. It will predict the probability that of an instance belonging to the default class, which can be snapped to 0 or 1. Coefficients are estimated using a process called MAXIMUM LIKEIHOOD Estimation
Linear Discriminant Analysis 1) Types of Problems
A LINEAR algorithm for classifying data in multiple classes.
LDA makes prediction by estimating the probability that a new set of inputs belongs to each class using Bayes Theorem. It uses statistical properties of your data(mean for each class, and variance for dataset) to make predictions.
CART 1) Type of Problems 2) How it’s constructed
Classification and Regression Trees (Decision Trees)
A decision tree is constructed by lining up all values and different split points are tried and tested.
Naive Bayes 1) Types of Problems 2) How it is constructed
- Classification Problems only
- Makes a “naive” assumption that the features in the dataset are not correlated.
- Uses Bayes theorem
Advantages: 1) Low training data needed 2) Training is super fast because there is no coefficient optimization steps
Disadvantages: 1) Expects normal distribution for numerical data
2) Bad estimator of probabilities
3) Assumption of independent uncorrelated features
k-Nearest Neighbors
1) Types of problems
2) how it works
3) Unique factors
- Classification and Regression
2a. Prediction is made by finding k number of instances in the training data that have the shortest distance by comparing the instance to all of the data in the training dataset
2b. Then choose either the median or mode of the output from the training data as the output - No model is trained
k-Nearest Neighbor
- Advantages
- Disadvantages
Advantages
- Lazy Learning / No model needs to be prepared
- Non-parametric
- “Instance-Based Learning” - Raw training instances are used to make predictions
Disadvantages
- Suitable for low dimensions (few inputs)
- Suitable for small dataset
- During prediction, the distance on the entire training dataset needs to be computed