Week 1: Data and KNN Flashcards by Toni Toyobo

What is machine learning approach?

Programming an algorithm to automatically learn from data, or from experience, uncover patterns in data, building autonomous agents

How well did you know this?

Not at all

Perfectly

What should be emphasized in machine learning?

Predictive performance
Scalability
Autonomy

How well did you know this?

Not at all

Perfectly

Why might you want to use a learning algorithm?

Hard to code solution by hand (vision, speech)
System needs to adapt to a changing environment (spam detection)
Want the system to perform better than human programmers
Privacy/ fairness (ranking search results)

How well did you know this?

Not at all

Perfectly

How does machine learning perform compared to humans?

It may perform better or worse than humans

How well did you know this?

Not at all

Perfectly

Define artificial intelligence

A subfield of CS that refers to computer programs that can solve problems humans are good at
E.g vision, natural language

How well did you know this?

Not at all

Perfectly

Define machine learning

A subfield of AI focused on learning (tuning parameters) from data

How well did you know this?

Not at all

Perfectly

Define neural networks

Parametric model used in ML loosely based on biological neurons

How well did you know this?

Not at all

Perfectly

What is deep learning?

Neural networks with multiple layers

How well did you know this?

Not at all

Perfectly

What is data science?

An emerging field which applies ml techniques to domain-specific problems

How well did you know this?

Not at all

Perfectly

What are some machine learning domains?

Computer vision
Speech recognition
Natural Language Processing
Recommender system
Games

How well did you know this?

Not at all

Perfectly

Types of machine learning

Supervised learning
Semi-supervised learning
Reinforcement learning
Unsupervised learning

How well did you know this?

Not at all

Perfectly

What is supervised learning

They have labeled examples of the correct behavior
Predict unknown values of the data using other known data
Classification (is this A or B?)
Anomaly detection (is this weird?)
Regression (how much/ how many)

How well did you know this?

Not at all

Perfectly

What is semi-supervised learning

Utilizes both labeled and unlabeled data

How well did you know this?

Not at all

Perfectly

What is reinforcement learning

Learning system which interacts with the world and learns to maximize a scalar reward signal

How well did you know this?

Not at all

Perfectly

What is unsupervised learning

No labeled examples, instead looking for interesting patterns in the data
Find human interpretable and previously unknown patterns that describe the unlabeled data
Clustering (how is data organized)
Association rule mining (are these related?)

How well did you know this?

Not at all

Perfectly

Why is machine learning so powerful nowadays?

Study These Flashcards

Abundance of data
Computing power

What is the machine learning problem?

Study These Flashcards

Should I use ml on this problem?
Gather and organize data (pre-processing, cleaning, visualizing)
Establish a baseline
Choosing a model
Optimization
Hyperparameter search
Analyze performance and mistakes
-Iterate back to step 4 or 2

What is data?

Study These Flashcards

Collection of objects and their attributes

What does a ml training set consist of?

Study These Flashcards

Inputs (vectors)
Labels

Why do we use input vectors in machine learning?

Study These Flashcards

Algorithms need to handle lots of data
A common strategy is mapping data to another space that is easy to manipulate (Representation)
Vectors are a good representation since we can do linear algebra

What is regression and classification in a training set?

Study These Flashcards

Regression- t is a real number
- Classification- t is an element of a discrete set

What are the classification metrics for evaluation?

Study These Flashcards

Accuracy= # correct predictions/ # test instances
Error= 1 - accuracy= # incorrect predictions/ # test instances

What is similarity?

Study These Flashcards

The simplest method of learning we know
Classifying according to similar objects you’ve seen
aka manohorse

What happens when more data points come in to nearest neighbor?

Study These Flashcards

More complicated boundaries are possible

What is nearest neighbors relationship with noise?

It is sensitive to noise or mislabeled data (class noise)

What is the solution to noisy data?

Have k-nearest neighbors vote and pick the majority

What are the steps for k-nearest neighbors?

- Calculate the distance between the new data point and all the datapoints in the set - Identify the k points with the shortest distance to the new point, these are the k nearest neighbors - Among the nearest neighbors, count how many points there are for each class type and pick the majority

What is k?

K determines the tradeoff between fitting the data and overfitting the data

What happens when there is a small k?

- Good at capturing fine-grained patterns - May overfit, sensitive to local variations in training data

What happens when there is a large k?

- Makes stable predictions by averaging over lots of examples - May underfit because model is too generalized and oversimplifies underlying patterns in the data

How do you balance k?

- Optimal k depends on the number of datapoints (n) - As a rule of thumb, choose k=3 - k < root n

What is validation set used for?

Tuning hyperparameters

What is cross validation?

Used to estimate generalization error of a learning algorithm when the given dataset is too small for a simple train/test or train/valid split to yield accurate estimation of generalization error

What is k-fold cross validation?

- A partition of dataset is formed by splitting it into k non-overlapping subsets - Estimate the test error by taking the average test score across k trials - On trial i, the i-th subset is the test set, the rest is training set

What are the highlights of k-nearest neighbor?

- Simple - No training - Easy to justify classification to customer - Can easily do multiclass

What are the limitations of KNN? Large dataset

- Lazy learning technique - in training phase KNN doing nothing, so training is fast - in time of prediction it becomes slow as large dataset comes since model has to calculate Euclidean distance from given point to all points in the dataset

What are the limitations of KNN? Curse of Dimensionality

- feature space becomes increasingly sparse as the number of dimensions (features) grows - In high-dimensional spaces, the notion of proximity or similarity becomes less meaningful

What are the limitations of KNN? Imbalanced dataset

- the majority class typically has significantly more samples than the minority class. - large number of neighbors from the majority class can overpower the neighbors from the minority class - dominate the decision making process, leading to a bias towards the majority class in the predictions.

Week 1: Data and KNN Flashcards

(38 cards)