theory lecture 5 Flashcards
statistics
more theory-based and top-down ideas. it is more model based and focuses on testing hypotheses.
machine learning
more heuristic and focused on improving performance of a learning agent. it also looks at real-time learning and robotics.
data mining and knowledge discovery
integrates both theory and heuristics. the focus is on the entire process of knowledge discovery, including data cleaning, learning, and integration and visualisation of results.
test data
shows how well the machine is learning after the training in supervised learning systems.
regression
a machine learning model where you try to predict a score.
association
a type of unsupervised learning where you try to see the data types and how well they associate with each other.
clustering
a type of unsupervised learning where you eg. try to differentiate dogs and cats.
ANN
the data is split into three subsets for classification; ~60% training, ~20% validation, and ~20% testing. it is a prediction model that is inspired by the way a brain works with neurons. it is what deep learning is based on.
overtraining
when you use too much data for training and the algorithm knows everything about the sample, but it may not recognise anything outside of the sample.
target variable
the variable we are trying to predict based on the attributes in the columns of a table.
dimensionality of a data set
the sum of the dimensions of the features/attributes.
curse of dimensionality
when you have too many dimensions and it becomes hard to predict a value.
CRISP-DM
a model used to show the knowledge discovery process flow. the process is highly repetitive and experimental. you may have to back in steps, eg. if your model is different in practice.
C&RT
a prediction model. it stands for Classification and Regression Trees.
Random Forest
a prediction model that combines different trees.
Boosted Tree
a prediction model that combines trees in a boosting way.
Fusion
a prediction model that combines different algorithms.
1-Away
means the accuracy including a prediction of 1 class away. eg. it predicts 4, but it is actually 5.
SVM
a prediction model, using a line that divides your data.
linear regression
a method used for classification with the formula w0 + w1x + w2y >= 0. it computes w1 from the data to minimise the squared error to ‘fit’ the data. it uses a line to classify data into a class.
decision trees
a method for classification that splits data by drawing multiple horizontal and vertical lines.
confusion matrix
the primary source for accuracy estimation in classification problems. it shows how confused your model is between two classes. you can put your testing data into the matrix to see how many are correct.
precision
given something is positive in a predicted class, how often do you predict it right?
recall
given that the true class is positive, how often do you predict it right?
decision tree
puts your data in a format to split it up. the higher attributes in the tree are more important.