Quant 2.6 Flashcards
What is Machine Learning?
- is a set of computer-driven approaches aimed at generating structure or predictions from data by finding a pattern and then applying the pattern without any human intervention.
What is the objective of ML and how does it work?
- Objective is to make some meaning out of large amounts of data.
The way it works is, a large amount of data is given to the computer to access and find patterns or establish some relationships. This data usually consists of known examples or usable data. Then the ML runs over and over to find a pattern, establish some meaning to the pattern and then apply the pattern over if required. All of this, without any human intervention!
What are the advantages associated to ML? What are the classes of ML techniques?
- Advantages are that, unlike regression ML isn’t based on any assumptions. ML also easily works with data which has high degree of non-linear relationships, can deal with a very large number of variables (high dimensionality).
The three classes are:
Supervised learning
Unsupervised learning
Deep learning
What is Supervised ML?
- Under supervised ML, our objective is to let the machine develop a prediction rule by studying a target labeled data which we provide and also an output is provided. The ML then analyses the given labeled data (CC example - Date, time of payment, amount all X variables) and then compares it to the output Y and forms a pattern or establishes a relationship.
Then once the training data set is exhausted, we can give a similar new data set on which the ML will run the learned prediction rule (which was created by working on the training data set) and then we can compare how well it performs in the actual data set. Basically, it will predict outputs based on new inputs (Y variable - fraudulent or not).
Here, the X variables are called features (independent variables in multiple linear regression) and the dependent variable is called the Target (Y).
What are the categories of data sets that can be used in Supervised learning?
- Two broad categories of data:
Regression - which means that our target variable is continuous and thus will be some function of the features.
Classification - data can be classified based on features and the target can be yes or no (binary) like in our CC example or can have multiple classifications as well.
What is unsupervised learning?
- It is the process in which the machine doesn’t use labeled data. (this is a key disctinction from supervised learning)
Similarly, there are several features that are being used but no target is provided here. The algorithm tries to discover structure (make sense of the data all by itself). Can be used for large complex data set where it is hard to visualize.
What type of problems are well suited for the unsupervised ML?
- two types:
Problems where we need to ‘Reduce dimensions’ or Dimension reduction and
Clustering - where we need to sort the observations into groups.
What is Deep learning?
- refers to highly sophisticated algorithms which are used for highly complex tasks like image classification, face recognition, speech recognition and natural language processing.
What is reinforcement learning?
- a situation/process where the computer learns from interacting with itself.
What are deep learning and reinforcement learning based upon?
- Neural networks
So these algorithms work well when we have non-linearities in our data. They can be supervised or unsupervised. Also, works well when our features interact among themselves.
When creating a model, how do you divide the data into samples?
- It’s typically divided into 3 non-overlapping samples.
A. Training sample - One used in Supervised learning to let the algo study.
B. Validation sample - The sample where the algo can run and tune the prediction rule it created from the training sample.
C. Test sample - The final test sample where we want the machine to predict outcomes.
What is generalization?
- Is a degree to which our prediction rule/model retains its explanatory power while predicting on out-of-sample data.
What is Overfitting?
- A situation where our model performs well on the test sample, but doesn’t generalise well with other samples/data.
How do you explain the type of fit of the Model?
- Use the suit example.
If you go to a tailor for a suit, and they’ve one which fits only one person perfectly and no one else, then it’s called Overfit.
If the suit is so baggy that it can’t properly fit on anyone, it’s called Underfit.
If the suit fits anyone with 5ft 10 inch height then it’s called a Goodfit.
What is the complexity of the model based upon?
- No. of features, terms or branches in the model & whether the model is linear or non-linear
The higher the complexity of the model, the higher is the risk of it being Overfit.
How are out-of-sample errors categorised?
- There are three types into which out-of-sample errors can be categorised.
A. Base error - present in the test sample due to the randomness of the data.
B. Bias error - the degree to which a model fits the training data.
C. Variance error - how much the model’s results change in response to new data from validation and test samples.
What are learning curves and what is a robust model?
- Learning curves plot accuracy rate vs traning sample size. (It’s a graph or a plot which shows which is the type of error in our model.
Desired level of accuracy is 1 - the base error (it’s due to the randomness of the data & there’s nothing we can do about it)
Robust model is a model where out-of-sample accuracy increases towards the desired level of accuracy when the the number of training sample size increases.
What are some methods to reduce overfitting of the data in Supervised Machine learning?
- By overfitting, we mean the model doesn’t perform well out of sample.
There are two methods:
A. As discussed earlier, when the complexity of a model increases, so does the overfitting of the model.
Thus reducing complexity will directly result in reduction of overfitting problem. (the simplest solution tends to be the correct one)
B. Cross-validation - Based on principle of avoiding sample bias.
K-fold cross validation technique ->
The data set is broken down into two sections - 1st is Training plus validation section & the other section is Out-of-sample section.
Now, the t+v data is randomly shuffled into k-1 samples and the kth data set is used for validation.
Doing the shuffling multiple times, helps the algo learn all the variations and reduces the bias error.
Which in turn increases t+v accuracy rates, and thus makes the model a better fit for OOS data.
What are the different types of algorithms under Supervised Learning?
- There are 5:
A. Penalised Regression
B. Support Vector Machine
C. K-Nearest Neighbor
D. Classification & Regression Tree
E. Ensemble learning and Random Forest
Explain Penalised Regression.
- It is a computationally efficient technique used in prediction problems where the target variable is continuous.
Regression coefficients are chosen to minimise sum of squared residuals plus a penalty term that increases with the number of included variables.
Classic example - LASSO (just remember LASSO is penalised regression algorithm)