Data Integration, Learning from data, Supervised Learning Flashcards
Define Data Integration… Why is it needed?
Process of combining data from heterogenous sources into a single, coherent data store.
Data sources are usually disparate and siloed. Data integration enables the access and interpretation of data from different sources and types.
What are the 5 main ways of integration data? Describe each…
Common User Interface : Manual data integration by a data manager from retrieval to presentation.
Middleware Data Integration : A piece of Middleware that facilitates integration between systems. Usually legacy and new systems.
Application-Based Integration : A Software Application that locates, retrieves and integrates data into storage. Essentially, conducting the entire process, as opposed to Middleware.
Uniform Data Access : Provides a consistent view of data from a variety of sources, but doesn’t retrieve or manipulate the data.
Common Data Storage : E.g a Data Warehouse.
For each type of Data Integration process, give a pro and a con…
Common User Interface :
Pro = Total control and handling.
con = Poor scaling.
Middleware Data Integration :
Pro = Automated integration.
con = Must be maintained.
Application-Based Integration :
Pro = Automated end to end process.
con = Complex setup.
Uniform Access Integration :
Pro = Low storage requirements.
con = Hosts struggle w/ data request count.
Common Data Storage :
Pro = Reduces burden on host system.
con = Increased storage costs.
What are the 3 categories of learning from data? Define each…
Supervised : Learning that has an Input set and Output set. The goal is to establish the mapping function that gives the most precise continuous target or outcome.
Unsupervised : Learning in which we only have input, and we are tasked with making sense of it.
Semi-supervised :
What is the most common type of learning?
Supervised
What are the 2 types of Supervised Learning?
Regression
Classification
Define Regression…
The process of finding a continuous target or outcome.
Define Classification…
The process of classifying inputs.
In Supervised Learning, what are we trying to find?
The mapping function.
What are the inputs of a supervised learning model?
Features, covariates, predictors etc.
What are the outputs of a supervised learning model?
Target, label, response etc.
Give an example of a usage of Supervised Learning. Define the Inputs and Outputs.
An input set of dog photos, a boolean output set, and a model that predicts whether each photo is a dog.
Define Unsupervised Learning…
The process of making sense of a data set by recognising patterns.
Define a Regression Problem and give an example…
A problem in which we need to find a continuous target or outcome.
Define what is meant by inputs, outputs and parameter variables of a Mapping Function…
Inputs : Input value
Parameters : The values that will change as the model learns from the data.
Output : The predicted value
Regarding the a data frame in linear regression, what does a column represent? And what does a row represent?
Each column is a Feature of the input data e.g join data, monthly price etc.
Each row is an Observation of the customer.
Regarding the input and output data frames, what is the difference in their number of rows and columns?
The number of rows will always be the same, N, if N is the count of input data.
The number of columns in the output Y will be 1. The number of columns in input X will vary depending on features.
Define what parameters are in a mapping function.
Values that will change over time to give us the most accurate regression line.
What makes good parameters?
When the parameter values give us the most accurate model.
Define Hyper-Parameters…
Parameters that we select as the model progresses. These are not learned from the data.
What are the 2 phases of learning the parameters? Define each…
Training phase : Use past data to find quality parameters. The more past data used, the better the parameters.
Prediction phase : Run new data into our model, and assess accuracy via a Loss Function.
What is a Loss Function and when is it used?
A Loss Function is used in the Prediction Phase of Machine Learning.
Its purpose is to output a value that represents how close our models output value is to the actual value. Parameters are then updated depending on the Loss Functions output.
What is the reason for updating the parameters after the prediction phase? What determines the values to update to?
To improve the prediction accuracy of the model.
The Loss Function determines the best parameters to update to.
Give a simple Linear Regression function…
yb(x) = b0 + b1x + e
In order to find the linear regression value, what 2 properties of the linear regression line do we need to know?
Slope of the line.
Y-intercept.
What is the equation of finding the Y-intercept of the simple regression line?
b0 = y median - (b1 * x median)
What is the equation to find the slop of the simple linear regression line?
b1 = r * ( sy / sx )
Define the Pearson Correlation…
A measure of strength of the linear relationship between 2 samples.