Week 1 - Supervised ML + Linear Regression Flashcards
How do we describe data?
The 4 Vs of Big Data
Velocity - streaming data (sensors etc)
Veracity - uncertainty of data (poor quality)
Variety - different forms
Volume - scale
Structured vs semi-structured vs unstructured
Structured - adheres to a data model (tabular format e.g. SQL) makes it easier to contextualise and understand
Semi - doesn’t follow the tabular structure but does contain tags and metadata to separate semantic elements and establish hierarchies of records and fields (xml).
Unstructured - information that is not arranged according to a preset data model or schema (e.g. text and audio)
What is data integration?
Consolidating data from heterogenous sources into a single coherent data source
What are the 5 data integration techniques?
Uniform data access
Common data storage
Application based integration
Common user interface
Middleware data integration
What is uniform data access?
A technique that retrieves and uniformly displays data but leaves
it in its original source.
Use to automate and translate communications between systems and allow for more complicated analysis
What is common data storage?
An approach that retrieves and uniformly displays data but it also makes a copy of the data and stores it.
Use to create and store a copy of original data and present uniformly for sophisticated data analysis
What is application based integration?
Software applications locate, retrieve and integrate data by making data from different sources and systems compatible with one another.
Use to automate and translate communications between legacy and modern systems
What is common user interface?
Manually conduct all phases of the integration, from retrieval to presentation.
Use to merge a small amount of data sources for basic analysis
What is middleware data integration?
A middleware is a type of software that facilitates communication between legacy systems and modern systems
Use to automate and translate communications between legacy and modern systems.
Supervised vs unsupervised vs semi-supervised
Supervised - uses data with labelled outcomes
Unsupervised - uses data without labelled outcomes
Semi-supervised - uses both data with labelled outcomes and without labelled outcomes
Two types of supervised ML?
Regression
Classification
Parameters vs hyperparameters?
Parameters: the values that change as the model learns from the data. (e.g. regression coefficients)
Hyperparameters: parameter that is not learned directly from the data but relates to implementation, i.e., training our ML model.
(e.g. in simple linear regression, include the intercept in the model)
Difference between regression and classification
Regression refers to any time we are trying to predict a numeric value.
Classification is when the outcome variable is categorical.
What is the loss function?
Quantitive measure of how close yp was to y. Update rule will determine how to update the model parameters i.e. find parameters that minimise this loss function.
What is the Pearson correlation?
Measure of the strength of the linear relationship between two samples