Data Integration Flashcards
Data integration
take multiple datasets and bring them together
Conceptual integration
Statistical integration
Model-based integration
Source-matched
Use the same source for the sample but to collect different information/samples. For example you take the urine and blood samples of the same people then you try to integrate
Split sample study
You split in half all the samples you took and you can make different measurements
Problem: need to have a large enough sample
Conceptual integration
the situation where multiple omics data sets are analysed separately, and then, the resulting conclusions are matched without any further analysis of the data set as a whole
Statistical integration
statistical associations are sought between the elements from the different data sets.
Correlation-based integration
Concatenation-based integration
Multivariate-based integration
Pathway-based integration
Model-based integration
computational model of biological preknowledge to generate data that is not yet available
Repeated study
Redo the entire experiment. Not same sample, not done at the same time.
Problem is that its not reproducible
Batch effects: cant correct
BUT independent measurement ! -> can take advantage for the analysis
Replicate matched study
Take all samples twice
differs critically from a repeat study, as the samples for both omics are produced/obtained at the same time, and thus the introduction of batch effects is avoided.
NO statistical independence
Use when not possible to split sample !
Concatenation-based integration
Put both dataset after one another and perform the analysis
=> different omics have different distributions and background noise. Can have different number of parameters could give too much weights ( can normalize), entities tend to cluster
Correlation based integration
Similarity measures between the 2 datasets
metabolites can have postive/negative correlation under one set of conditions and none under another. These relation can cancel out and the correlation be hidden when putting the 2 sets together.
Correlation can be used to compare things through time. Find alignment because dont occur simultaneously. Dynamic time wraping
Multivariate Data integration
PCA: principal component analysis (graph will ellipse). Mathematical model links the 2 matrices. Partial Least Squares is related to it. Each direction represent the most variance, copes well with colinearity. Principal components and latent variables represent the covariance. PLS DA is regression model to predict classes. O PLS alows to rotate the model to have othogonal or parallel directions. Can have vertically what is not important to seperate class and horizontally what is. Easier to distinguish
Pathway based integration
pathway contain gene/proteins and metabolites
easy for interpretation