L3 Flashcards
Machine Learning
Branch of AI and CS that focuses on the use of data and algorithms to imitate the way humans learn, gradually improving accuracy
Supervised ML
- use of labelled datasets to train algorithms which classify data or predict outcomes
- classification or regression
Unsupervised ML
- not supervised using training dataset with unlabeled data
- model find hidden patterns and insights by itself
- clustering or association (rules)
Reinforcement ML
- simulates an agent that perceives and interprets its environment, takes action and learns through trial and error
- wants to maximise cumulative reward in environment where each action has reward or penalty
Workflow ML (7)
- gather data
- prepare data
- split into testin, train, valid
- train model
- test and validate model
- deploy model
- iteration
KNN
pro and con
practical things
what is it
- classifies object based on closest training example in feature space -> nearest neighbour
- k is number of examples closest to query
-
distances between point and all other points are found, k nearest points are selected, most frequent label is voted (classif) or averaged (regres)
Pro: simple and usable for regression and classification, achieves high accuracy in wide type of predicition problems
Con: becomes slow as size of data grows, high computing power needed - can be improved with preprocessing: decision trees, PCA
- most useful when labeled data can’t be obtained
eg) handwriting detection, image/video recognition, stock prediction
Decision trees
what is it
how does it work
pros and cons
- supervised learning, can be used for class and regr but it usually used for binary class problems
- tree-structured classifier, internal nodes are features of dataset, branches are decision rules and leaf nodes are outcome
- algorithm starts from node, compares values and jumps to next node
- pro: simple to understand, decision-related problems, helps think about all possible outcomes for problem, less data cleaning required, good preprocessing method
- con: layers make it complex, computation complexity increases with layers, overfitting (resolved with RF)
Random Forest
what is it
differences from decision trees
- average of several decision trees
- each is trained with a random sample of data
- takes majority vote or average (regress) of outcome of each tree
- less overfitting
- slower due to more computation
- doesn’t use a set of formulas but average of many trees
- much more successful if diverse
Bootstrap Sampling
drawing of samples from data with replacement to estimate population parameter
Naive Bayes
- uses conditional probability (bayes theorem) to calculate likelihood of a point belonging to a certain class
- naively assumes that predictors aren’t related
- used for binary or multiclass classif problems
posterior = (prior x likelihood)/evidence - be able to explain in detail
Linear Regression
- model that describes relationship between predictors and outcomes
- simplest linear model
- key algorithms, commonly used for statistical analysis
Logistic Regression
- adapts linear regress to classification
- models the probability of an event by taking the logistic funct of a linear combination of 1 or more independent variables
- basically puts linear combo into function that is bounded from 0 to 1
- binary, multinomial and ordinal logistic regression
K-Means Clustering
what it is
steps
elbow approach
- groups items into clusters without predefined classes
- each observation belongs to the cluster with the nearest mean
- tries to keep clusters as small as possible
process
- pick centroids
- data forms cluster with nearest points
- find new centroids of the cluster
- iterate until convergence
Elbow approach: how to chose best number for K, sum of square will go down quickly until its reduction becomes slow -> ideal point with least variation
PCA
- dimensionality reduction, removes data not useful
- takes attributes and data with most variance/relevance and mapps all data onto less dimentions
- projection-based method that projects onto set of orthogonal axes
- useful for exploratory analysis
- eigenvalues can be used to determine nr of PC