Learning from data Flashcards
Define data integration and the goal of data integration
Data integration is the practice of combining data from heterogeneous sources into a single coherent
data store.
It’s goal is to provide users with consistent access and delivery of data across a spectrum of
subjects and data structure types
Define a common user interface(manual data integration)
A hands-on approach where data
managers manually handle every step of the integration, from retrieval to presentation.
Define middleware data integration
Uses middleware software to bridge and facilitate communication between different systems, especially between legacy and newer systems
Define application based data integration
Software applications locate, retrieve and integrate data by making
data from different sources and systems compatible with one another
Define uniform data access
It provides a consistent view of data from diverse sources without moving or
altering it, keeping the data in its original location.D
Define common data access(Data Warehousing)
It retrieves and presents data uniformly while creating and storing a duplicate copy, often in a central repository.
What is a pro and a con for a common user interface
Reduced cost, requires little maintenance, integrates a small number of data sources, user has total control.
Data must be handled at each stage, scaling for projects require changing code, manual orchestration.
What is a pro and a con for middleware data integration
Middleware software conducts the integration automatically, and the same way each time.
Middleware needs to be deployed and maintained.
What is a pro and a con for application based integration
Simplified process, application allows systems to transfer information seamlessly, much of the process is automated.
Requires specialist technical knowledge and
maintenance, complicated setup.
What is a pro and a con for uniform data access
Lower storage requirements, provides a simplified view of the data to the end user, easier data access
Can compromise data integrity, data host systems are not designed to handle amount and frequency of data requests.
What is a pro and a con for common data storage ( data warehousing )
Reduced burden on the host system, increased data version management control, can run sophisticated queries on a stored copy of the data without compromising data integrity
Need to find a place to store a copy of the data, increases storage cost, require technical experts to set up the integration, oversee and maintain the data warehouse.
What is the difference between supervised and unsupervised learning? Also what is Semi-Supervised learning?
Supervised learning algorithms use data with labelled outcomes while unsupervised learning algorithms use data without labelled outcomes.
Semi-supervised learning algorithms use both data with labelled outcomes and without labelled outcomes.
What is the task of a supervised learning
The task is to learn a mapping function from possible inputs to outputs.
What is the task of unsupervised learning
In unsupervised learning, our task is to try to “make sense of” data, as opposed to learning a mapping. This is as we have inputs but no associated responses.
Strictly define a hyperparameter
A hyperparameter is a parameter that is not learned directly from the data but relates to
implementation
Define the training and prediction phase in a ML model
In the training phase a ML model can learn the parameters that define this relationship between the features and the outcome variable. The more data the better
In the prediction phase we get new observations, feed these values into our
trained model, and we have a prediction.
What can we use to measure the quality of our predictions
Most models will define a loss function which is some quantitative measure
of how close our prediction is to the actual value. In addition there will also be an update rule that will determine how to update the model parameters
Define the difference between regression and classification
Classification deals categorizing data sets into one of a set of predetermined preexisting classes. Regression deals with continuous data e.g. any continuous outcome like loss, revenue, number of years or anything that can be answered with the question, how much?
Look over calculating linear regression coefficients by hand
https://ele.exeter.ac.uk/pluginfile.php/4546128/mod_resource/content/0/LfD-L2.pdf
Define the mean squared error function and how it works
The Mean Squared Error (MSE) function is the sum of squared errors divide by the number of values
What is the difference between explained and unexplained variation in a regression model?
The explained variation measures how much of the total variation is captured by the regression model, i.e., how much of the variation in
𝑦 can be explained by the independent variable(s) 𝑥. While the unexplained variation measures the variability in the dependent variable that is not captured by the regression model. It is also known as the error or residual variation.
Total variation is the unexplained variation added to the explained variation
What is the primary objective for a prediction model?
The primary objective is to make the best prediction.
What is the main focus for prediction models?
The focus is on performance metrics, which measure the quality of the model’s predictions.
Performance metrics usually involve some measure of closeness between ypred and y.
Without focusing on interpretability, we risk having a Black-box model.
How can we determine the accuracy of a prediction model?
The closer the predicted values are to the observed
values, the more accurate the prediction is.
The further the predicted values are to the observed
values, the less accurate the prediction is.