Data analysis Flashcards

1
Q

What does data analysis consist of?

A

Turning raw image data into an extremely useful sequence data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What has happened following the development of bioinformatic tools?

A

The importance of data analysis has increased

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is bioinformatics?

A

Broad, interdisciplinary field that integrates principles from computer science, mathematics and statistics

In order to manage, mine, visualize and analyze biological data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Which field has bioinformatics co-evolved with?

A

Genomics

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is the role of a bioinformatician?

A

Develop analytical methods

Construct and curate computational tools and databases

Data mining, interpretation and analysis

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are the aims of a bioinformatician?

A

Identify differentially expressed genes

Identify epigenetic changes

Analyse pathways

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Examples of ways genes can be differentially expressed

A

Somatic muations

Copy number alterations

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What type of bioinformatician is the best bioinformatician?

A

Relevant background knowledge regarding the biological component of the data

Can differentiate between technical and biologically relevant artifacts

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

How can we increase genomic understanding of disease?

A

Pairing information obtained through genomic technologies with clinical data

This entails integrating omic and expression data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What are basic genome browsers?

A

Curated databases

Allow annotation of human and other species DNA

These are references or working drafts

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are the two main hosts of genome browsers?

A

USCS

ENSEMBL

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Where is USCS based?

A

University of California

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Where is ENSEMBL based?

A

Europe - UK and Germany

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What do basic genome browsers bring together?

A

Genomic annotation from multiple species as well as many other data like transcription factor binding sites

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What can we measure using microarrays?

A

DNA

Gene expression

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What aspects of DNA can we measure using microarrays?

A

SNPs

Copy number variations

Methylation

Chromosome conformation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What aspects of gene expression can be measure using microarrays?

A

mRNA

miRNA

inRNA

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What type of variation can be detected through SNPs, somatic mutations and CNVs ?

A

Genetic variation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What type of variation can be detected through DNA methylation and chromatin analysis?

A

Epigenetic variation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What type of information can be detected through RNA expression and gene structure?

A

Expression variation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Describe the 6 steps of array processing

A
  1. Experimental design
  2. Image analysis
  3. Normalisation to clean the data
  4. More low level analysis (fold change, ANOVA and data filtering)
  5. Data mining
  6. Validation
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Why is it important to normalise the data?

A

Cleaning the data allows us to compare data across arrays without altering the interpretation of changes in gene expression

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Why does the data need to be normalised?

A

The intensity of fluorescent markers might be different from one batch to another

Technical variation can hide real data

Unavoidable systematic bias

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What is the main reason we normalise data?

A

Because the experimental goal is to identify biological variation and expression changes between samples

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

What is the most appropriate test used to analyse data?

A

Pairwise analysis using t-tests or ANOVA is the most appropriate

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

What is the goal of data analysis?

A

Determining the fold up or down cutoffs to figure out what is truly significant

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

What is a common theme to measure differences in gene expression in arrays and NGS?:

A

Ranking genes according to the evidence of difference in gene expression

Score the differences using fold changes, t-statistics or a combination

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

What are ways to interpret the changes in signal intensity?

A

Heat maps

Volcano plots

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

What does the p-value inform?

A

Determines how deleterious the mutation is

Smaller the p-value, the more deleterious

30
Q

What does the t-value inform?

A

The magnitude and direction of the fold-change

31
Q

What is Moore’s Law?

A

A law that describes how the computer power doubles every two year

Technologies which keep up the pace with Moore’s law are cutting edge

32
Q

Does genomics keep up with Moore’s law?

A

No

The computer power is weaker than the sequencing power

We have the sequence, but don’t know how to interpret it

33
Q

Which aspects of the genome can be investigated through sequencing genomic DNA?

A

Genomic footprinting

Epigenetic profiling

Whole genome sequencing

34
Q

Which aspects of the genome can be investigated through sequencing cDNA?

A

Transcriptome

RNA footprinting

Transcriptome expression profiling

35
Q

What sample an be used to analyse germline diseases?

A

Blood exome

36
Q

When is transcriptomics commonly used?

A

Looking at individual tissue types with certain diseases to infer the function of mutated genes

Since post-translational mutations in RNA is not present in DNA

37
Q

For what is methylation and histone modification analyses used?

A

To integrate and interpret all the information together

To look for the clinical application of data

38
Q

What are the 3 steps of NGS data analysis?

A

Primary analysis

Secondary analysis

Tertiary analysis

39
Q

What is the use of primary analysis of NGS data?

A

Determine the run/sample quality through looking at the quality values of the visual information (colours)

40
Q

What is the use of secondary analysis of NGS data?

A

Determine the sample /information quality through aligning the sequences

41
Q

What is the use of tertiary analysis of NGS data?

A

Data interpretation

42
Q

What is the output format of data obtained from sequencing ?

A

Top line = position of the flow cell and the read it comes from

Second line = sequence itself

Third line = positive or negative read

At the bottom = uality score

43
Q

What does the quality score mean?

A

How confident the software is that this is the specific base

44
Q

What are intensities turned into in NGS?

A

Count number

Number of sequences that line up the reference

45
Q

What are intensities turned into in microarrays?

A

Numerical values

46
Q

How can read number be used to quantify gene expression?

A

Count number

Counting the number of reads that map to each gene using programs

47
Q

What is a read?

A

Text-based formats for storing both a biological sequence and its corresponding quality scores

48
Q

What is the ASCII characteri?

A

Character showing the sequence letter and quality score

49
Q

What are the steps of to analysing NGS data?

A
  1. Base calling
  2. Variant calling
  3. Annotation
  4. Filtering
  5. Reporting
50
Q

What is base calling?

A

Aligning the sequence to the reference to compare

51
Q

Two steps of base calling

A

QC alignment

Alignment

52
Q

What do we need to check when aligning the sequence to the reference?

A

Percentage of reads properly or uniquely mapped

Among the mapped reads, determine the percentage or reads in the exon, intron and intergenic regions

5’ or 3’ bias

53
Q

What is IGV?

A

Intergrated genome viewer

A software that allows us to visualise the reads on the genome, highlighting potential variants

54
Q

What are counts?

A

Expression levels in IGV

55
Q

How does IGV determine whether the variants in the genome are SNPs or mutations?

A

The ratio of the commonality of alterations in the host cell

If it is a SNP = 50/50
If it is a mutation = less frequent

56
Q

What can alter the number of reads?

A

The gene expression levels

The length of the gene

57
Q

Why is the read number not always accurate?

A

Longer genes have naturally more reads, so the read number does not always reflect the expression rate

58
Q

How can read number be normalised?

A

RPKM

Reads per kilobase per million mapped reads

59
Q

Equation for RPKM

A

Counts of mapped fragments / total mapped fragments (million) X exon length of transcripts (KB)

60
Q

What information can RNA sequencing provide?

A

Fold changes in protein expression

If the gene is expressed at higher levels than the protein = splice variant

61
Q

What are variant calling methods?

A

Rather than looking at the level of expression, we look art somatic mutations

62
Q

What are the 3 ways in which variants can be detected?

A

Allele counting

Probabilistic methods - uses a bayesian model to statistically quantify the number of allelic variants

Heurisic approach - based on thresholds

63
Q

What method does varScan2 use?

A

Looks at the number of reads to determine whether there is a variant or not

Heuristic approach

64
Q

Describe what the VarScan2 procedure will determine about a cancer genome if tumour allele frequency matches normal

A

If tumour and normal match the reference = reference

If tumour and normal do not match the reference = germline

65
Q

Describe what the VarScan2 procedure will determine about a cancer genome if the tumour allele frequency does not match normal

A

Calculate the significance of allele frequency difference by Fisher’s exact test

If the difference is significant

  • if normal matches reference = somatic
  • if normal is heterozygous = LOH
  • if normal and tumour are both variants and different = unknown

If the difference is not significant
- combine the tumour and normal read counts for each allele, recalculate p-value and call germline

66
Q

What are the classes of structural variation?

A

Deletion

Novel sequence insertion

Mobile-element insertion

Interspersed duplication

Tandem duplication

Inversion

Translocation

67
Q

3 software that annotate variants

A

SeattleSeq

Oncotator

Annovar

68
Q

What does SeattleSeq annotate?

A

SNVs

Small indels

Both common and novel

69
Q

What does Oncotator annotate?

A

Human genomic point mutations and indels with relevant data to cancer researchers

70
Q

What does Annovar annotate?

A

Genetic variants

Detected from diverse genomes

71
Q

What do variant annotation software do?

A

Annotates the SNPs and informs as to how likely they are to be deleterious