Machine Learning Basics Flashcards
Taken from various sources inc: https://d2wvfoqc9gyqzf.cloudfront.net/content/uploads/2018/09/Ng-MLY01-13.pdf
What is Machine Learning?
The art and science of giving computers the ability to learn to make decisions from data (improve at a task based on experience) without being explicitly programmed.
Name three types of Machine Learning categories
Supervised Learning
Unsupervised Learning
Reinforcement Learning
What is Unsupervised Learning?
Uncovering hidden patterns from unlabelled data e.g. grouping customer into distinct categories (clusters) that were unknown before and are hopefully meaningful
What is Reinforcement Learning
Software agents interact with an environment and try to find the most efficient pathway to a goal, learn how to optimise behaviour.
Given a system of rewards and punishments. Sucessful routes get reward, failures are restarted.
What is Supervised Learning?
The Machine Learning model trains on sets (variables/features) of labelled training data (target variable) then predicts the labels (target variable) of subsequent testing datasets for unseen, often through multiple iterations.
Also called Predictive Data Analytics.
What is Exploratory Data Analysis (EDA)?
Data analysis which performs initial explore of data, using mostly graphical techniques, to gain insight into nature of data and structure. What are important variables and outliers.
Who codified EDA practice?
John Tukey in 1970s
Complete the following …“Science does not begin with a tidy question …” (EDA)
“… nor does it end with a tidy answer”
What did Tukey refer to EDA as?
Detective work
Name some EDA techniques
Plot data norms: level of distribution, measures of central tendency: mode, median, mean
Range of spread of distribution: Standard Deviations, Percentiles, Quartiles
Relationships between variables/features in datasets/observations
Investigate trends for variables over time.
Describe Data Wrangling
A process that occurs during the Data Preparation stage.
Take messy, incomplete data or data that is too complex and simplify and/or clean it so that it’s useable for analysis
Remove or impute missing values Convert categorical to numeric Standardise/Normalise data Clean data Join data together Generate new fields
Overlap with Feature Engineering
What is Feature Engineering?
Taking whatever information you have about your problem and turning it into a usable numeric format that you can use to build your feature matrix.
How does Machine Learning work?
Use data to form a hypothesis, new data exposed errors in your hypothesis so the error gap is measured and hypothesis is adjusted to fit. Aim to get the error gap as low as possible.
Name some types of Feature Engineering.
Converting Categorical features to numeric - could use one-hot encoding
Encode images to pixel representation
Impute missing data - fill Nan with mean of column
Build Feature Pipeline to chain together above tasks
What do Machine Learning Algorithms do?
Algorithms learn a pattern inherent in existing data. These patterns can be used to make predictions about data that has not yet been analysed. This pattern, or model, is much smaller than the training data.
Describe the Machine Learning lifecycle.
Derive pattern/data model using training data and algorithm
Check model using test data
Use formal process to check accuracy of model
Apply model to new data
What is Dimensional Reduction in terms of Feature Engineering?
No of dimension too high = too long to process data/produce model
Some dimensions may not be of use
Can either just throw away dimension - use intuition
Employ Dimension Reduction techniques: Decision Trees, Principal Component Analysis (PCA)
What is Principal Component Analysis (PCA)?
PCA is a feature extraction technique for reducing the dimension of a feature space (curse of dimensionality), so that there are fewer relationships between features to consider, and => less likely to overfit model.
What is Clustering in terms of Unsupervised learning?
Finding islands of similarity in complex data sets.
Uniting singular points into distinct groups or clusters.
Examining data and assembling data points into sluters based on a measure of distance.
Describe the K-Means Clustering algorithm
Unsupervised method utilising clustering.
- Choose No of clusters (K) to be used by algorithm (Scree plot)
- Randomly plot K cluster centre points as start position
- Assign each point to nearest centriod
- Update position of centriods to reflect new centre/average location of data points
- Repeat 3+4 until no new data assignment occurs.