Bioinformatics for Stratified Medicine Flashcards
What are examples of large datasets?
Much store data that…
- MIMIC III- De-identified health data from ~40K critical care patients
Demographics, vital signs, laboratory tests, medications, notes.
- The Cancer Genome Atlas TCGA
- National Initiatives: 100K Genomes project / Genomics England. President Obama’s initiative to create a 1 million person research cohort (Precision medicine initiative). Includes Baseline health exam, Clinical data derived from electronic health records (EHRs), Healthcare claims, Laboratory data
- Biological databases: Hundreds of thousands of species to explore. Millions of written articles in scientific journals
Detailed genetic information: gene names phenotype of mutants location of genes/mutations on chromosomes linkage (distances between genes)
High Throughput lab technologies:
PCR
Rapid inexpensive DNA sequencing (Illumina HiSeq)
Microarrays (Affymetrix)
Genome-wide SNP chips / SNP arrays (Illumina)
Must store data such that:
- Minimum data quality is checked
- Well annotated according to standards
- Made available to wide public to foster research
What is a database?
A collection of data that are
- Structured
- Searchable (index)
- updated periodically (release)
- cross-referenced (hyperlinks)
Databases are often categorised as primary or secondary. How do these differ?
- Primary databasesare populated with experimentally derived data such as nucleotide sequence, protein sequence or macromolecular structure.
- Secondary databasescomprise data derived from the results of analysing primary data.
What are essential aspects of primary and secondary databases?
Primary database-
Synonyms: Archival database
Source of data: Direct submission of experimentally derived data from researchers
Secondary database-
Synonyms: Curated database, knowledgeable
Source of data: Results of analysis, literature research and interpretation, often of data in primary databases
What are the challenges of databases?
Heterogeneous data sources (need for data fusion);
Complexity of the data (high-dimensionality);
Noisy, uncertain data, dirty data, the discrepancy between data-information-knowledge (various definitions)
Big data sets (when is data big? when manual handling of the data is impossible)
What is machine learning
Development of algorithms which can learn from data
What is the difference between supervised and unsupervised machine learning?
Supervised/Prediction- Guided
Unsupervised/Discovery- Unguided
What issue can arise with supervised learning?
Incorrect findings may be concluded if the right dataset and specific question are not decided/used.
What is a type 1 error?
false positive
What is a type 2 error?
false negative
What does sensitivity (also called the true positive rate, the ecall, or probability of detection[1] in some fields) measure?
Measures the proportion of positives that are correctly identified as such (i.e. the percentage of sick people who are correctly identified as having the condition).
What does Specificity/ true negative rate measure?
The proportion of negatives that are correctly identified as such (i.e., the percentage of healthy people who are correctly identified as not having the condition).
What are the positive and negative predictive values (PPV and NPV respectively) ?
Proportions of positive and negative results in statistics and diagnostic tests that are true positive and true negative results
What is a Receiver Operating characteristic curve (ROC)?
Trade-off between sensitivity (or TPR) and specificity (1 – FPR).
X axis: Plot one minus the specificity( the probability of being a false positive)
Y axis: Plot the sensitivity over the probability at true positive.
Thus trying to produce a curve where every single point along this curve corresponds to exactly one cut off.
What machine learning method requires No a priori hypothesis on the real number of clusters (groups) present and requires no additional information besides the data itself?
Discovery/ Unsupervised Learning
What is an example of a discovery problem?
Patient Similarity Problem- how can the physician categorise patient to allocate the best treatment
How is similarity among genes expressed
As a mathematical distance
Euclidean distance-length of a line segment between the two points. Linear associations
Manhattan distance- distancebetween two points measured along axes at right angles.
Correlation distance- Measures both linear and non linear associations
Genes close in the “expression space” have similar expression profiles
What are two unsupervised learning technique/ technique for reducing the dimensionality of such datasets?
Principal Component Analysis (PCA)
Hierarchical clustering
What is Principal component analysis?
An exploratory technique to simply a dataset
It is a linear transformation that chooses a new coordinate system for the data set such that
- greatest variance by any projection of the data set comes to lie on the first axis (then called the first principal component)
- the second greatest variance on the second axis and so on
Principal component analysis can be used to…
Reduce number of dimensions in data
Find patterns in high-dimensional data
Visualise data of high dimensionality