Week 7 + 8: Machine Learning Flashcards
What is machine learning
learning from data without previous programming used to discover hidden patterns/trends enables data driven decisions
4 categories of machine learning
- Classification
- Regression
- Clustering
- Association analysis
Classification is used to predict a…
category
Regression is used to predict a…
numeric value
Cluster analysis is used to
organise simliar items into groups eg customers
Association analysis is used to…
capture assocations between items or events
Name 2 supervised machine learning techniques
- Classification
- Regression
Name 2 unsupervised machine learning techniques
- Clustering
- Association analysis
What is supervised machine learning
Where you have input variables an output variable using an algorithm to learn the mapping function between the input and the output
4 examples of supervised machine learning algorithms
- KNN
- Decision tree
- Linear Regression
- SVM (Support vector machines)
What is unsupervised machine learning? It is where you only have…
input data and not corresponding output variables
What is the goal of unsupervised machine learning?
To model the underlying structure or distribution in the data
2 examples of unsupervised machine learning algorithma
- k-means clustering
- apriori for association analysis
kNN is used to
classify a sample based on its neighbors
What is k in kNN? The value of k determines…
the number of nearest neighbors to consider
kNN 4 distance metrics
- Euclidean Distance
- City Block Distance
- Chi square distance
- Cosine distance
2 pros of kNN
- No separate training phase
- Can generate complex decision boundaries
2 cons of kNN
- Can be susceptible to noise
- Can be slow, since distance is recalculated each time
What is decision tree ? It is a..
hierarchical structure with nodes and directed edges
Name 3 parts of decision tree
- Root node - node at the top
- Internal nodes - in between
- Leaf nodes - nodes at the bottom
A decision tree classification decision is made by
- traversing the decision tree from the root node
- answer to the test condition determines the branch when leaf node is reached
- the category at the leaf node is the classifiction
What is decision tree - depth of a node
number of edges from the root node to that node
What is decision tree - depth of a decision tree
number of edges in the longest path from the root node to the leaf node
What is decision tree - size of a decision tree
the number of nodes in the tree
When to stop splitting a node?
- All samples in the node have the same class label
- Max tree depth is reached
- change in impurity is reached
2 pros of decision tree
- resulting tree is easy to interpret induction is computationally inexpensive
2 cons of decision tree
- greedy approach does not guarantee best solution
- rectilinear decision boundaries
What is linear regression?
a statistical method that allows us to summarise the relationship between two continuous variables
In linear regression what 3 names can the X variable be called?
- predictor
- explanatory
- independent variable
In linear regression what 3 names can the Y variable be called?
- response
- outcome
- dependent variable
In one dimension a hyperplane is called a
point
In two dimensions a hyperplane is called a
line
In one dimensions a hyperplane is called a
plane
In 4 or more dimensions a hyperplane is called a
hyperplane
The goal of a SVM is to find the…
optimal separating hyperplane which maximizes the margin of the training data
3 pros of SVM
- works well with a clear margin of separation
- effetive in high dimensional spaces
- effective in cases where number of dimensions is greater than the number of samples
3 cons of SVM
- doesn’t work well with large data set
- doesn’t perform very well with noisy data
- doesn’t provide probability estimates
feed forward NN indicates there are
no loops in the network
feedback NN
is also known as a recurrent neural network
4 pros of NN
- can be trained directly on data with thousands of input variables
- once trained predictions are fast
- good for complex problems (image recognition)
- out-performs other models with high quality labelled data
4 cons of NN
- black box
- training is computationally expensive
- suffers from interference where new data causes to forget old data
- often abused where simpler solutions such as linear regression would be best
What distance calucation does K means clustering use?
Euclidean distance
What does the K in K-Means represent?
The amount of clusters to divide into
What type of data does K-means clustering work with?
Continuous
What does K-means cluster try to impove?
The inter group simliarity while keeping the groups as far as possible from each other