Machine Learning Flashcards
What is machine learning?
- Machine learning is a type of artificial intelligence that allows computer programs to become more accurate at making predictions without being explicitly programmed to do so.
- Machine learning algorithms use historical data as input to predict new output values.
- It helps us predict future outcomes or classify information to make decisions.
What is unsupervised learning? Give an example
- Algorithms that train on unlabeled data.
- The algorithm scans through data sets looking for connections and trends.
It adds structure to the data in the form of clustering or grouping.
Example: market segmentation − cluster users into groups on the basis of their previous purchases, viewing patterns etc. This can feed into recommender systems.
What is artificial intelligence? What are some examples of artificial intelligence?
Artificial intelligence is a branch of computer science concerned with building programmes capable of performing tasks that usually require human intelligence, and can iteratively improve themselves based on the information they collect.
The main types of artificial intelligence are:
Machine Learning - algorithms allow computers to learn a task with minimal instructions and improve with experience
eg - recommendation engines
Deep Learning - a subset of machine learning, processes richer datasets with less preprocessing (eg image recognition) Artificial neural networks, which are algorithms inspired by the human brain, learn from large amounts of data. Deep learning algorithm would perform a task repeatedly, each time tweaking it a little to improve the outcome.
eg - image recognition
ML model fit
Underfitting - when the model has not learned enough from the training data, resulting in low generalisation and unreliable predictions. The model is too simple.
Overfitting - the model fits the training data too well, resulting in poor generalisation. It will underperform when it sees new data. Happens due to high complexity and inadequate training data.
Balanced - good generalisation, so the model can infer conclusions with new data.
Testing and training data
Raw data is split into 2 sets, training and testing. The training set is used to develop a model, the testing data is used to test and validate model performance. The ratio is usually 90:10, 80:20 or 70:30 for train:test.
What is SkiKit-Learn?
A powerful library for machine learning in Python. It contains tools for machine learning and statistical modelling, including:
- classification
- regression
- clustering
- dimensionality reduction
What are the two categories of supervised learning?
In supervised learning, algorithms learn from labeled data. After understanding the data, the algorithm determines which label should be given to new data by associating patterns to the unlabeled new data.
Classification
Classification is a process of building a model which can divide the dataset into classes based on different parameters. The program is trained on a training dataset and based on that training, it categorizes data into different classes.
The task of the classification algorithm is to find the mapping function to map the input(x) to the discrete output(y).
Eg - spam detection
Regression
Regression is a process of finding the correlations between dependent and independent variables. It helps in predicting the continuous variables such as prediction of Market Trends, prediction of House prices, etc.
The task of the Regression algorithm is to find the mapping function to map the input variable(x) to the continuous output variable(y).
eg - weather forecasting
What is a correlation coefficient?
A value indicating the strength of a relationship between two variables. -1 = strong negative relationship, +1 = strong positive relationship
df.corr()
sis.heatmap
Define logistic regression, decision tree and random forest
Logistic Regression is an example of supervised learning. It is used to calculate or predict the probability of a binary (yes/no) event occurring
Decision Trees: a type of Supervised Machine Learning where the data is continuously split according to a certain parameter.
Random Forest: An extension of a simple decision tree, the only difference being this algorithm provides the combined result of many such trees.