topic 1 Flashcards
what is AI
- systems that mimic human intelligence, including reasoning, decision-making and problem solving
- minimal human involvement
what is ML
- teaching machines to find patterns in data and use them to make predictions or decisions
- requires human involvement for data prep, model training and optimisation
what is training
give a training set of labelled examples, estimate the prediction function by minimising the prediction error on the training set
what is prediction
applying f to a never seen before x and predict the value of y = f(x)
what is the ground truth
refers to the reality you want to model with your machine learning algorithm
- it is the actual correct output associated with a dataset, used as a reference for training and evaluating models
what is data splitting
splitting the data into training and testing sets, helps the creation of data models and processes that use data models are accurate
what is the loss (error) function
quantifies the difference between predicted outputs of ML algorithm and actual target values
examples of loss (error) functions in regression
- coefficient of determination (R^2)
- mean square error
- root mean square deviation
what is overfitting
occurs when the machine learning model gives accurate predictions for training data but not for new data
things that cause overfitting
- data size is too small (doesn’t represent overall data accurately
- training data contains irrelevant information (noisy data)
- trains for too long on single sample of data
- model complexity is high (learns noise within training data)
what is underfitting
occurs when machine learning model has not learned patterns in training data well
reasons for underfitting
- training data not cleaned and contains noise
- model; has high bias
- size of training dataset used is not enough
- model too simple
what is cross validation
evaluate the performance of model on unseen data
how does cross validation work
- data is divided into multiple folds or subsets
- one fold = validation set, rest = training set
- repeat multiple times
- average the results
what is data leakage
ML model already has information of test data in training set
what is feature performance
calculates the score for all input features in machine learning model to establish the importance of each feature, in decision making process.
higher score = larger effect on model prediction
sources that lead to garbage in garbage out
- low quality data
- biased sampling
- incorrect labels
- missing values
- outliers
- data inconsistences
what is supervised learning and what is it useful for
-learns input, output relation
useful for fast screening and classification
what is unsupervised learning and what is it useful for
- does not require knowledge of outputs, only inputs
- it finds similarities in complex data
- requires user to know how many classes to expect
- useful to reduce data dimensionality
name types of supervised learning
- classification (predicts a category)
- regression ( predicts a value)
name types of unsupervised learning
- clustering ( divided by similarity)
- association (identify sequences)
- dimension reduction/generalization (find hidden dependencies)
what is reinforcement learning
- train model then access, train the model with new data then access again
what are the pitfalls of machine learning
- nondeterministic = even for the same input, can exhibit different behaviours on different runs
- stochastic = use probability distribution to make predictions, they rely on randomness and uncertainty to make predictions and analyse data
- can be biased