Data analysis Flashcards

1
Q

What does data analysis consist of?

A

Turning raw image data into an extremely useful sequence data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What has happened following the development of bioinformatic tools?

A

The importance of data analysis has increased

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is bioinformatics?

A

Broad, interdisciplinary field that integrates principles from computer science, mathematics and statistics

In order to manage, mine, visualize and analyze biological data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Which field has bioinformatics co-evolved with?

A

Genomics

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is the role of a bioinformatician?

A

Develop analytical methods

Construct and curate computational tools and databases

Data mining, interpretation and analysis

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are the aims of a bioinformatician?

A

Identify differentially expressed genes

Identify epigenetic changes

Analyse pathways

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Examples of ways genes can be differentially expressed

A

Somatic muations

Copy number alterations

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What type of bioinformatician is the best bioinformatician?

A

Relevant background knowledge regarding the biological component of the data

Can differentiate between technical and biologically relevant artifacts

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

How can we increase genomic understanding of disease?

A

Pairing information obtained through genomic technologies with clinical data

This entails integrating omic and expression data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What are basic genome browsers?

A

Curated databases

Allow annotation of human and other species DNA

These are references or working drafts

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are the two main hosts of genome browsers?

A

USCS

ENSEMBL

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Where is USCS based?

A

University of California

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Where is ENSEMBL based?

A

Europe - UK and Germany

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What do basic genome browsers bring together?

A

Genomic annotation from multiple species as well as many other data like transcription factor binding sites

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What can we measure using microarrays?

A

DNA

Gene expression

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What aspects of DNA can we measure using microarrays?

A

SNPs

Copy number variations

Methylation

Chromosome conformation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What aspects of gene expression can be measure using microarrays?

A

mRNA

miRNA

inRNA

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What type of variation can be detected through SNPs, somatic mutations and CNVs ?

A

Genetic variation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What type of variation can be detected through DNA methylation and chromatin analysis?

A

Epigenetic variation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What type of information can be detected through RNA expression and gene structure?

A

Expression variation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Describe the 6 steps of array processing

A
  1. Experimental design
  2. Image analysis
  3. Normalisation to clean the data
  4. More low level analysis (fold change, ANOVA and data filtering)
  5. Data mining
  6. Validation
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Why is it important to normalise the data?

A

Cleaning the data allows us to compare data across arrays without altering the interpretation of changes in gene expression

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Why does the data need to be normalised?

A

The intensity of fluorescent markers might be different from one batch to another

Technical variation can hide real data

Unavoidable systematic bias

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What is the main reason we normalise data?

A

Because the experimental goal is to identify biological variation and expression changes between samples

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
What is the most appropriate test used to analyse data?
Pairwise analysis using t-tests or ANOVA is the most appropriate
26
What is the goal of data analysis?
Determining the fold up or down cutoffs to figure out what is truly significant
27
What is a common theme to measure differences in gene expression in arrays and NGS?:
Ranking genes according to the evidence of difference in gene expression Score the differences using fold changes, t-statistics or a combination
28
What are ways to interpret the changes in signal intensity?
Heat maps Volcano plots
29
What does the p-value inform?
Determines how deleterious the mutation is Smaller the p-value, the more deleterious
30
What does the t-value inform?
The magnitude and direction of the fold-change
31
What is Moore's Law?
A law that describes how the computer power doubles every two year Technologies which keep up the pace with Moore's law are cutting edge
32
Does genomics keep up with Moore's law?
No The computer power is weaker than the sequencing power We have the sequence, but don't know how to interpret it
33
Which aspects of the genome can be investigated through sequencing genomic DNA?
Genomic footprinting Epigenetic profiling Whole genome sequencing
34
Which aspects of the genome can be investigated through sequencing cDNA?
Transcriptome RNA footprinting Transcriptome expression profiling
35
What sample an be used to analyse germline diseases?
Blood exome
36
When is transcriptomics commonly used?
Looking at individual tissue types with certain diseases to infer the function of mutated genes Since post-translational mutations in RNA is not present in DNA
37
For what is methylation and histone modification analyses used?
To integrate and interpret all the information together To look for the clinical application of data
38
What are the 3 steps of NGS data analysis?
Primary analysis Secondary analysis Tertiary analysis
39
What is the use of primary analysis of NGS data?
Determine the run/sample quality through looking at the quality values of the visual information (colours)
40
What is the use of secondary analysis of NGS data?
Determine the sample /information quality through aligning the sequences
41
What is the use of tertiary analysis of NGS data?
Data interpretation
42
What is the output format of data obtained from sequencing ?
Top line = position of the flow cell and the read it comes from Second line = sequence itself Third line = positive or negative read At the bottom = uality score
43
What does the quality score mean?
How confident the software is that this is the specific base
44
What are intensities turned into in NGS?
Count number Number of sequences that line up the reference
45
What are intensities turned into in microarrays?
Numerical values
46
How can read number be used to quantify gene expression?
Count number Counting the number of reads that map to each gene using programs
47
What is a read?
Text-based formats for storing both a biological sequence and its corresponding quality scores
48
What is the ASCII characteri?
Character showing the sequence letter and quality score
49
What are the steps of to analysing NGS data?
1. Base calling 2. Variant calling 3. Annotation 4. Filtering 5. Reporting
50
What is base calling?
Aligning the sequence to the reference to compare
51
Two steps of base calling
QC alignment Alignment
52
What do we need to check when aligning the sequence to the reference?
Percentage of reads properly or uniquely mapped Among the mapped reads, determine the percentage or reads in the exon, intron and intergenic regions 5' or 3' bias
53
What is IGV?
Intergrated genome viewer A software that allows us to visualise the reads on the genome, highlighting potential variants
54
What are counts?
Expression levels in IGV
55
How does IGV determine whether the variants in the genome are SNPs or mutations?
The ratio of the commonality of alterations in the host cell If it is a SNP = 50/50 If it is a mutation = less frequent
56
What can alter the number of reads?
The gene expression levels The length of the gene
57
Why is the read number not always accurate?
Longer genes have naturally more reads, so the read number does not always reflect the expression rate
58
How can read number be normalised?
RPKM Reads per kilobase per million mapped reads
59
Equation for RPKM
Counts of mapped fragments / total mapped fragments (million) X exon length of transcripts (KB)
60
What information can RNA sequencing provide?
Fold changes in protein expression If the gene is expressed at higher levels than the protein = splice variant
61
What are variant calling methods?
Rather than looking at the level of expression, we look art somatic mutations
62
What are the 3 ways in which variants can be detected?
Allele counting Probabilistic methods - uses a bayesian model to statistically quantify the number of allelic variants Heurisic approach - based on thresholds
63
What method does varScan2 use?
Looks at the number of reads to determine whether there is a variant or not Heuristic approach
64
Describe what the VarScan2 procedure will determine about a cancer genome if tumour allele frequency matches normal
If tumour and normal match the reference = reference If tumour and normal do not match the reference = germline
65
Describe what the VarScan2 procedure will determine about a cancer genome if the tumour allele frequency does not match normal
Calculate the significance of allele frequency difference by Fisher's exact test If the difference is significant - if normal matches reference = somatic - if normal is heterozygous = LOH - if normal and tumour are both variants and different = unknown If the difference is not significant - combine the tumour and normal read counts for each allele, recalculate p-value and call germline
66
What are the classes of structural variation?
Deletion Novel sequence insertion Mobile-element insertion Interspersed duplication Tandem duplication Inversion Translocation
67
3 software that annotate variants
SeattleSeq Oncotator Annovar
68
What does SeattleSeq annotate?
SNVs Small indels Both common and novel
69
What does Oncotator annotate?
Human genomic point mutations and indels with relevant data to cancer researchers
70
What does Annovar annotate?
Genetic variants Detected from diverse genomes
71
What do variant annotation software do?
Annotates the SNPs and informs as to how likely they are to be deleterious