Week 6.11.12 Visualisation Flashcards

Question 1

Q

6.1 Visualisation

Topic; Tables, ideograms, Genome browsers, Galaxy, CLC workbench, integrative maps.

Reading: EPG Ch 4

Learning outcomes

Answer

A

Following this lecture (an attending the workshop and doing associated private study) you should be able to;
Recognise why genome visualisation is important and appreciate the particular challenges that it poses. Demonstrate an awareness of the history of visualisation.
Identify visualisation methods that are available for human genome visualisation and how they may be used.
Apply web-based genome visualisation tools to analyse the human genome.
Describe some of the technical solutions for genome visualisation

Question 2

Q

What is the point of having base pair reading visualisation?

Answer

A

Human chromosome 3 is approximately 198,022,480 base pairs in length
This is 0.0005% of it;

[IMAGE]

The truth is having this kind of visualisation is useless because there are no annotations.

Question 3

Q

Genome variation data

Much less data than a whole genome, but still meaningless in its raw form.

Answer

A

File for 23andMe

This is the individual SNP locations, telling you which chromosomes these SNPs are on – this is slightly more readable. But again these files are long, although they are more to the point somebody looking at this won’t tell them much.

Question 4

Q

So what would be useful?
This depends on who you ask;

What does a clinician need?
What does a normal person need?
What does a sicientist need? among researchers do they all want the same thing?

Answer

A

*A clinician** wants to know is a high level of information of direct relevance to the patient’s health – something that they can use in their diagnosis
*For a normal person:** actionable information that is easy to interpret with explanations and lifestyle advice – they dont need details of the science.
i. e they just want to know what to look out for in their diet, and just changes to their lifestyle

Scientist need detailed information that can reveal new biological insights. Typically dealing with genome for more than one individual. *

* Even among researchers, different people have different aims; high level view for population-wide studies; sequence level view if looking at SNPs etc.

Question 5

Q

Consumer setting (urine-based tests)

What works for consumers? Definitive indication of one state – no interpretation required.

What works for clinicians?

Answer

A

For consumers: definitive indication of one state – no interpretation required.

For clinicians: Multiple concentration values – disease relevance determined by the clinician.

The amount of data being visualised here is tiny compared to what’s available in the human genome – for genomic data we must move to computer-based reports.

Question 6

Q

How are genotype results are presented in genotyping services such as the direct To Consumer (23andMe) genotyping results?

Until recently what level of information was given?

Answer

A

Raw SNP data is processed to provide context prior to being shown in tabular form.
Algorithms are used to calculate disease risk based in the status of known SNPs in the individual.

Until recently they gave a lot of information about the results from the genome – until recently where regulation imposed on them by the FDA.
People want to know about what diseases and drug responses they can infer from their genomes.

Question 7

Q

A history of data visualisation

2,600 BC:
… nothing for ~800 years …
1669
1822
1829
1977
1987
1994
2001
2013

Answer

A

2,600 BC: World’s first known data table
10th Century: Position of the planets over time (unknown).
… nothing for ~800 years …
1669: Median remaining lifetime as a function of age (Christiaan Huygens graph of data from John Graunt)
1822: Price of wheat (bar) compared to weekly wage (red line) over several hundred years by William Playfair.
1829: Crime rate indicated by shaded regions (Adriano Balbi and André Michel Guerry).
1977: PRIM-9 - early interactive data visualistaion (John Tukey).
1987:“Brushing Scatterplot” - An interactive multi-part graph for desktop computers (Richard Becker and William Clevelan).
1994: Chromoscope E. coli genome viewer (Zhang et al.). Desktop-based.
2001: UCSC Human Genome Browser (Kent et al.). Web-based.
2013: RCircos (Zhang et al.)

Question 8

Q

What do all visualisations have in common?

What does the viewer need?

Answer

A

All these visualisations aim to show large amounts of data to the viewer, with the minimal cognitive load (i.e as easily as possible).

In each case, the viewer needs to be taught how to understand the visualisation e.g what the elements and colours represent.

Question 9

Q

Why bother with graphics?

Answer

A

Graphics can reveal trends hidden by summary statistics

All of these plots contain the same bits of information.

Anscombe’s quartet (1973); Each of these four datasets has exactly the same;

·Number of points

·Mean average x and y

·Variance

·Correlation coefficient

·Straight line of best fit

We still rely on graphs to visualise things to make sense of them.

Question 10

Q

What does this graph tell us?

Answer

A

This graph tells us that there has been an increase in the protein levels of this type, we can also see that the molecular increase is symbolically shaped – no fancy maths is needed.
We could see that before we even started that there was some protein in the sample.

Question 11

Q

6.2 Visualisation

Not that long ago did people started using colour to start representing quantitative information such as in 1829 – crime rate indicated by shaded regions (Adriano Balbi and Andre Michel Guerry)

Using colour

Examples of this;

Simple Univariate (only one variable show here – the cancer incidence), shading like that French map colour – only used for aesthetic reason (BBC brand in this case).

Question 12

Q

Colour maps can increase the resolution of shaing by mapping values to a wider range of colours. But this is still univariate, and can be confusing.

Question 13

Q

RCB colour mapping

Different primary colours can mix together making a combination of colours – this is the basic principle of how this works

We can cram in more information by mixing red green and blue in different proportions.

What can the colour be used to indicate in transcriptomic data from microarrays

Answer

A

Mixing varying amounts of red and green is very common for visualising different gene expression – last week we looked at transcriptomic data and the microarrays spots – we had the amount of red colour indicating the abundance of transcript in one sample – green in another and then mixing them together we had an idea of what was expressed with different relatives of extremes.

Question 14

Q

We can show three variables if we add blue – in a 3D style

What does HSV mean?

Answer

A

HSV colour mapping

RGB colouring can be difficult to interpret

HSV – hue saturation and value can potentially show three variables in a more intuitive way.

Question 15

Q

Transparency

Transparency can be useful when we want to indicate the confidence in a particular piece of data.

Consider this protein coverage data from GPMDB:

Question 16

Q

Static versus interactive visualisations

The move to interactive visualisations was a major breakthrough, because instead of having to fit everything into a single figure of a fixed size, we can massively increase the amount of information we can convey by using;

Answer

A

· Different views of the same data (e.g rotation)

· Collapsible elements, e.g drop downs and accordians

· Zooming in and out

Today we can produce highly customised visualisations on the fly, by allowing users to select which elements of the data to display.

Find we can produce highly customised visualisations on the fly, by allowing users to select which elements of the data to display.

Question 17

Q

Ensemble Genome Browser

Answer

A

Getting your own “track” into Ensembl (or other genome browser)

Two main options;

Upload a file containing the annotations, either from a local file or via a URL. Annotations need to be an accepted format, e.g GFF3
Host your data on a BioDAS server.

Question 18

Q

Technical challenges for web-based genome browsers…

Many different annotation formats, e.g GFF3, BED, Bam, Big Wig, GTF, WIG, VCF

Large amounts of data to be manipulated, particularly challenging within a web browser – the sequence of chromosome 3 alone (without any annotations) is a 192MB file. Just downloading this would take about three minutes at 10Mbps.

The nature of annotations mean that there are many objections to draw and keep track of within the browser.

Answer

A

And solutions

Developments in web technologies make ever more sophisticated web-based applications possible. Genoverse uses the very latest HTML5 technology to provide a slick web-based genome browser.

Question 19

Q

What is BioDAS?

The biological distributed annotation system (DAS) is a protocol for exchanging genome annotations.

A genome browser can pull in annotations from the currently displayed region of the genome from DAS servers when needed, instead of storing everything locally.

Question 20

Q

Beyond the genome

Everything we looked at so far has been about visualising the genome sequence itself.

Many more aspects to consider;

Answer

A

·Dynamic data (e.g gene, protein expression)

·Spatial data (e.g gene, protein expression)

·Interaction data (between genes and other factors)

Question 21

Q

Visualising interactions

Network diagrams are now commonplace for visualising interactions between different objects, or states. Consider this trivial example;

Answer

A

States or objects are represented by NODES (circles here).

Interactions are represented by EDGES (arrows)

Applying similar ideas to gene interaction data;

Question 22

Q

Cytoscape is a free software package for exactly this type of analysis;

Question 23

Q

Answer

A

Kitanon notation – a more mechanistic view

Kitano notation and similar schemes represent complex biological systems like a electronic circuit diagram, where different physical entities and processes have designated shapes.

Question 24

Q