Machine Learning Algorithms Flashcards

1
Q

Logistic regression

A
  • Supervised, binary Y/N
  • Credit risk, medical conditions, will person perform an action
  • Sigmoid function fitted to the data (‘S’ shape between Y/N)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Linear regression

A
  • Supervised, numeric response variable
  • Economic/financial forecasts, marketing effectiveness, risk valuation
  • Line of best fit, least sum of squares
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Support vector machine

A
  • Supervised, multi-classification
  • Customer classification (l, m, h), genomic identification
  • Draw a line between classes by choosing certain “support” data points to map our ‘margins’, then draw a hyperplane between the two margins
  • Non-linear version chooses a distance function named “kernel” and maps the learning task to a higher dimensional space. Then applies a SVM classifier in the new space
  • SVMs are not memory efficient because they have to store the support vectors, which can grow in size with training data
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Decision trees

A
  • Supervised, binary/multi-class/numeric response
  • Customer analysis, medical conditions
  • Start at the ‘root node’, internal node takes inputs and provides output. Leaf node is the final decision
  • Decisions can be binary, numeric (i.e. boundary), or multi-class
  • Adv: less need for feature transformations prior to running model
    DisAdv: Very susceptible to overfitting, therefore we must “prune” the tree
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Decision trees: deciding on structure

A
  • How do we decide on the root node? Which features correlate most with the label
  • We can sometimes get to the leaf nodes without the use of all features (faster training)
  • Nodes are split based on the feature that has the largest information gain (IG) between parent node and its split nodes
  • One metric to quantify IG is to compare entropy before and after splitting
  • Training (i.e. building the tree) is by maximising IG to choose splits (i.e. the impurity of split sets are lower).
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Random forests

A
  • Supervised, binary/multi-class/numeric response
  • Collection of decision trees, sometimes decisions trees can be inaccurate - RF are much more accurate and reduce overfitting in decision trees
  • Randomly select a subset of features from data, find feature with highest correlation to the label and use as root node. Then repeat, excluding the feature used previously as the root node
  • Performed numerous times, which may produce different predictions
  • We take a ‘survey’/vote of the results, and our final result will be the summary result (i.e mean, mode)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

k-means clustering

A
  • Unsupervised, multi-classification
  • Use cases: data exploration, customer categorisation
  • “k” groups of classes
  • Start with random points, then iterate to find better locations for each point by comparing the distance of each ‘k’ point against the data points (i.e. they become centroids)
  • The centroids then move to become the centre point between the data points
  • Continue to iterate until we reach ‘equilibrium’, where the total variation between the centroid and its corresponding data points is the lowest
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

k-means clustering: finding the optimum ‘k’

A

Plot ‘reduction in variation’ and ‘k’ against each other and find the “elbow point”: where the variation stops changing considerably.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

k-nearest neighbour

A
  • Supervised, multi-classification
  • Uses: recommendation engines, similar articles, objects
  • The number of neighbours to take into account when classifying a new point
  • ‘k’ is often decided by the business case. Should be:
    1. Large enough to reduce the influence of outliers
    2. Small enough that classes with a small sample size don’t lose influence
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Latent Dirichlet allocation (LDA)

A
  • Unsupervised, multi-classification
  • Uses: topic discovery, sentiment analysis, automated document tagging
    1. perform standard text preprocessing (i.e. stopwords, stemming, tokenisation), then choose ‘k’, which is the number of topics we want the LDA to classify the data into
    2. Count words by topic, then by document
    3. Take the product of each word-topic matrix in each document
    4. Take the highest value and reallocate that word to the topics accordingly
    5. Do for all words
    6. Use the structure to then analyse and classify a new document
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Factorisation Machines Algorithm (FMA)

A
  • General-purpose supervised learning algorithm you can use for both classification/regression
  • Extension of the linear model that is designed to capture interactions between features within high dimensional sparse data sets
  • Good choice for tasks dealing with high dimensional spare data sets, such as click prediction and item recommendation
  • E.g. click prediction system, capturing patterns observed when ads from certain category are placed on pages from a certain page category
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Entropy

A

A relative measure of disorder in the data source

  • In classification we are trying to reduce the entropy
  • Disorder is present when the distinction between two or more distinct groups is not pure (e.g. if all 100 observations were 1, then disorder = 0)
  • For a group of 100 observations where 50% are 0 and 50% are 1, then we say that the entropy is very high (lots of disorder in the data - 50% chance of being classified either way)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly