Data Integration, Learning from data, Supervised Learning Flashcards by Jason Swift

Define Data Integration… Why is it needed?

Process of combining data from heterogenous sources into a single, coherent data store.

Data sources are usually disparate and siloed. Data integration enables the access and interpretation of data from different sources and types.

How well did you know this?

Not at all

Perfectly

What are the 5 main ways of integration data? Describe each…

Common User Interface : Manual data integration by a data manager from retrieval to presentation.

Middleware Data Integration : A piece of Middleware that facilitates integration between systems. Usually legacy and new systems.

Application-Based Integration : A Software Application that locates, retrieves and integrates data into storage. Essentially, conducting the entire process, as opposed to Middleware.

Uniform Data Access : Provides a consistent view of data from a variety of sources, but doesn’t retrieve or manipulate the data.

Common Data Storage : E.g a Data Warehouse.

How well did you know this?

Not at all

Perfectly

For each type of Data Integration process, give a pro and a con…

Common User Interface :
Pro = Total control and handling.
con = Poor scaling.
Middleware Data Integration :
Pro = Automated integration.
con = Must be maintained.
Application-Based Integration :
Pro = Automated end to end process.
con = Complex setup.
Uniform Access Integration :
Pro = Low storage requirements.
con = Hosts struggle w/ data request count.
Common Data Storage :
Pro = Reduces burden on host system.
con = Increased storage costs.

How well did you know this?

Not at all

Perfectly

What are the 3 categories of learning from data? Define each…

Supervised : Learning that has an Input set and Output set. The goal is to establish the mapping function that gives the most precise continuous target or outcome.

Unsupervised : Learning in which we only have input, and we are tasked with making sense of it.

Semi-supervised :

How well did you know this?

Not at all

Perfectly

What is the most common type of learning?

Supervised

How well did you know this?

Not at all

Perfectly

What are the 2 types of Supervised Learning?

Regression
Classification

How well did you know this?

Not at all

Perfectly

Define Regression…

The process of finding a continuous target or outcome.

How well did you know this?

Not at all

Perfectly

Define Classification…

The process of classifying inputs.

How well did you know this?

Not at all

Perfectly

In Supervised Learning, what are we trying to find?

The mapping function.

How well did you know this?

Not at all

Perfectly

What are the inputs of a supervised learning model?

Features, covariates, predictors etc.

How well did you know this?

Not at all

Perfectly

What are the outputs of a supervised learning model?

Target, label, response etc.

How well did you know this?

Not at all

Perfectly

Give an example of a usage of Supervised Learning. Define the Inputs and Outputs.

An input set of dog photos, a boolean output set, and a model that predicts whether each photo is a dog.

How well did you know this?

Not at all

Perfectly

Define Unsupervised Learning…

The process of making sense of a data set by recognising patterns.

How well did you know this?

Not at all

Perfectly

Define a Regression Problem and give an example…

A problem in which we need to find a continuous target or outcome.

How well did you know this?

Not at all

Perfectly

Define what is meant by inputs, outputs and parameter variables of a Mapping Function…

Inputs : Input value
Parameters : The values that will change as the model learns from the data.
Output : The predicted value

How well did you know this?

Not at all

Perfectly

Regarding the a data frame in linear regression, what does a column represent? And what does a row represent?

Study These Flashcards

Each column is a Feature of the input data e.g join data, monthly price etc.

Each row is an Observation of the customer.

Regarding the input and output data frames, what is the difference in their number of rows and columns?

Study These Flashcards

The number of rows will always be the same, N, if N is the count of input data.

The number of columns in the output Y will be 1. The number of columns in input X will vary depending on features.

Define what parameters are in a mapping function.

Study These Flashcards

Values that will change over time to give us the most accurate regression line.

What makes good parameters?

Study These Flashcards

When the parameter values give us the most accurate model.

Define Hyper-Parameters…

Study These Flashcards

Parameters that we select as the model progresses. These are not learned from the data.

What are the 2 phases of learning the parameters? Define each…

Study These Flashcards

Training phase : Use past data to find quality parameters. The more past data used, the better the parameters.

Prediction phase : Run new data into our model, and assess accuracy via a Loss Function.

What is a Loss Function and when is it used?

Study These Flashcards

A Loss Function is used in the Prediction Phase of Machine Learning.

Its purpose is to output a value that represents how close our models output value is to the actual value. Parameters are then updated depending on the Loss Functions output.

What is the reason for updating the parameters after the prediction phase? What determines the values to update to?

Study These Flashcards

To improve the prediction accuracy of the model.

The Loss Function determines the best parameters to update to.

Give a simple Linear Regression function…

Study These Flashcards

yb(x) = b0 + b1x + e

In order to find the linear regression value, what 2 properties of the linear regression line do we need to know?

Slope of the line. Y-intercept.

What is the equation of finding the Y-intercept of the simple regression line?

b0 = y median - (b1 * x median)

What is the equation to find the slop of the simple linear regression line?

b1 = r * ( sy / sx )

Define the Pearson Correlation...

A measure of strength of the linear relationship between 2 samples.

Data Integration, Learning from data, Supervised Learning Flashcards

(28 cards)