theory lecture 5 Flashcards
statistics
more theory-based and top-down ideas. it is more model based and focuses on testing hypotheses.
machine learning
more heuristic and focused on improving performance of a learning agent. it also looks at real-time learning and robotics.
data mining and knowledge discovery
integrates both theory and heuristics. the focus is on the entire process of knowledge discovery, including data cleaning, learning, and integration and visualisation of results.
test data
shows how well the machine is learning after the training in supervised learning systems.
regression
a machine learning model where you try to predict a score.
association
a type of unsupervised learning where you try to see the data types and how well they associate with each other.
clustering
a type of unsupervised learning where you eg. try to differentiate dogs and cats.
ANN
the data is split into three subsets for classification; ~60% training, ~20% validation, and ~20% testing. it is a prediction model that is inspired by the way a brain works with neurons. it is what deep learning is based on.
overtraining
when you use too much data for training and the algorithm knows everything about the sample, but it may not recognise anything outside of the sample.
target variable
the variable we are trying to predict based on the attributes in the columns of a table.
dimensionality of a data set
the sum of the dimensions of the features/attributes.
curse of dimensionality
when you have too many dimensions and it becomes hard to predict a value.
CRISP-DM
a model used to show the knowledge discovery process flow. the process is highly repetitive and experimental. you may have to back in steps, eg. if your model is different in practice.
C&RT
a prediction model. it stands for Classification and Regression Trees.
Random Forest
a prediction model that combines different trees.