Data analysis Flashcards

1
Q

What is data analysis

A

taking image data and turning it into sequence data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are the roles of bioinformatics?

A
  • Analytic method development
  • Construction and curation of computational tools and databases
  • Data mining, interpretation and analysis
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What does bioinformatics encompass

A
  • Identifying differentially expressed genes
  • Somatic mutations
  • Copy number alterations
  • Epigenetic changes
  • Genomic understanding
  • Multifactorial analyses: including BP, pulse ox, etc.

But also
• Pathway analysis
• Genome analysis
• Literature searches

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

name 2 genome browsers

A

UCSC: UCSC genome browser
EMBL: ENSMEBL Genome browser

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

what are genome browsers?

A

curated databases of all the annotated genomes that we have (like PubMed for genes)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Array processing steps

A
  1. Experimental Design
  2. Image Analysis – scan to intensity measures (raw data)
    a. You can get many different types of image data files
  3. Normalization – “clean” data
  4. More “low level” analysis -fold change, ANOVA, data filtering
  5. Data mining-how to interpret > 6000 measures
  6. Validation: repeat with another technology
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

what do you have to consider in your experimental design in an array experiment?

A

Sample size
Biological (different mouse) and technical (repeating sequencing) replicates
etc

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

what are the different ways you can data mine in an array experiment?

A

a. Databases
b. Software
c. Techniques-clustering, pattern recognition etc.
d. Comparing to prior studies, across platforms?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

How do we do array image analysis?

A

softwares use algorithms to look at gene abundance estimates (expression) and can make e.g. volcano plots

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Why do we need to normalise data in array experiments

A

“Normalizing” data allows comparisons ACROSS different array

○ Intensity of fluorescent markers might be different from one batch to the other due to differences in experiments, machines, etc.
○ Normalization allows us to compare those chips without altering the interpretation of changes in GENE EXPRESSION: technical variation can hide real biological variation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

How are most “low level” analysis on array experiments done?

A

There is no standard way

pairwise (usually)
list of up and down regulated genes are made and determine the cutoffs (by fold increase, t-statistic [p-value], or a combination)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

what is the usuall fold cutoff for significance in array experiments of down or up regulation?

A

3-fold

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

what should not be forgotten during array experiment “low level” analysis ?

A

multiple test correction (some things may have very large change in fold but low significance)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What are the 3 stages of NGS data analysis?

A
  1. Primary analysis (run/sample quality): Raw data, images, signals –> basecalling –> bases/colours, quality values
  2. Secondary analysis (sample quality/info) +/- reference –> allignment and assembly
  3. Tertiaty analysis (science): comparison –> statistical analysis and database searches
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

what is basecalling?

A

tells you which base is present or not

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

moore’s law

A

computing power doubles every two years

- NGS is getting cheaper faster but computing power is following moores law

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

what is a read?

A

how many bits of genomic data you have

18
Q

what do Fastq files tell you?

A

identifier
sequence
+ or - sign (which strand)
quality score

19
Q

Outline NGS data analysis (not 3 steps)

A
  1. Base calling
  2. QC
  3. Allignment
  4. Variant calling
  5. Annotation
  6. Filtering
  7. Reporting
20
Q

what does NGS reporting tell you?

A

if the mutation is deleterious or not

21
Q

what trend do you see in a FASQC file?

A

base quality decreases along read because of lack of synchronisation as the clusters grow

22
Q

what is done during aligning?

A

We are checking the percentage of reads properly or uniquely mapped
checking for 5’ or 3’ bias
Among the mapped reads, the percentage of reads in exon, intron, and intergenic regions.

23
Q

what do we use to align our genome?

A

Integrated genomeviewer (IGV)

24
Q

What does IGV do?

A

Visualises reads in the genome: highlights potential variants (SNPs, hetero/homozygous)
Visualises number of reads in a region: if it only spans exons = mRNA, splice variants, expression levels

25
Q

what is the number of reads roughly proportional to?

A

the length of the gene and the total nr. of reads in the library

26
Q

If Gene A: 200 and Gene B: 300 reads, is expression of Gene A < Expression of Gene B?

A

Gene B could be a larger transcript than A and thus make more reads
Depends on analysis

27
Q

How do we get over the problem of the length of the gene interfering with read analysis?

A

we make RPKM (reads per kilobase of transcript per Million mapped reads) to find out expression

28
Q

name some variant calling method categories

A

allele counting
probablaistic methods: Baysian model
Heuristic approach

29
Q

How do the variant calling methods differ?

A

Allele counting just counts how many of each allele e.g. A or G you have
Probabilistic quantifies statistical uncertainty
Heuristic is basically a fancy allele counting with a threshold for read depth, base quality, variant allele frequency, stat significance

30
Q

what is the most common variant calling method?

A

VarScan2

31
Q

What kinds of variants can we look for in variant calling?

A
somatic mutations
structural ones: 
deletions
insertions (novel or mobile-element)
tandem or interspersed duplication
inversion
translocation
32
Q

name 3 programmes for variant annotations

A

SeattleSeq
Onconator
Annovar

33
Q

What does SeattleSeq do?

A

Annotates SNVs and small indels, both known and novel (everything from functions, protein positions to HapMap frequencies)

34
Q

What does the onconator do?

A

annotates human genomic point mutations and indels with data relevant to cancer researchers

35
Q

What does Annovar do?

A

annotates genetic variants detected from diverse genomes including human genome
gives word cloud

36
Q

what is GATK?

A

The genomic analysis tool kit is a way of standardising the pipeline for genomic analysis made by the Broad.

37
Q

What is Cosmic?

A

Tells you which cancer is associated with what mutations and information about the mutations

38
Q

What is IntOGen?

A

Identifies the ‘driver’ genes in your tumour sample

and makes word clouds

39
Q

What is DAVID and what does the same thing?

A

It’s a molecular signature database:
Can put gene expression data, top hits, mutation data
Tells you what your top genes do and what functions are associated with your condition
GSEA can do the same thing

40
Q

What do cufflinks and cuffdiff do?

A

cufflinks: assembles transcripts, estimates their abundances (FRKM),
cuffdiff: tests for differential expression and regulation in RNA-Seq samples

41
Q

FPKM equation

A

counts of mapped fragments / (total mapped fragments (million)*exon length of transcript(KB))

42
Q

outline pipeline for RNA seq

A
  1. Raw data (FASTQ)
  2. Reads mapping: unspliced mapping (BWA, Bowtie) or spliced mapping (tophat mapsplice)
    - SAM/BAM files QC via RNA-SeQC
  3. Expression quantifiction: summarize read counts or FPKM/RPKM (cufflinks)
  4. Differential expression testing: DEseq, edgeR, etc. or Cuffdiff
  5. Functional interpretation: functional enrichment, infer networks, integrate with other data
  6. Biological insights and hypothesis