Data analysis Flashcards

Question 1

Q

What is data analysis

Answer

A

taking image data and turning it into sequence data

Question 2

Q

What are the roles of bioinformatics?

Answer

A

Analytic method development
Construction and curation of computational tools and databases
Data mining, interpretation and analysis

Question 3

Q

What does bioinformatics encompass

Answer

A

Identifying differentially expressed genes
Somatic mutations
Copy number alterations
Epigenetic changes
Genomic understanding
Multifactorial analyses: including BP, pulse ox, etc.

But also
• Pathway analysis
• Genome analysis
• Literature searches

Question 4

Q

name 2 genome browsers

Answer

A

UCSC: UCSC genome browser
EMBL: ENSMEBL Genome browser

Question 5

Q

what are genome browsers?

Answer

A

curated databases of all the annotated genomes that we have (like PubMed for genes)

Question 6

Q

Array processing steps

Answer

A

Experimental Design
Image Analysis – scan to intensity measures (raw data)
a. You can get many different types of image data files
Normalization – “clean” data
More “low level” analysis -fold change, ANOVA, data filtering
Data mining-how to interpret > 6000 measures
Validation: repeat with another technology

Question 7

Q

what do you have to consider in your experimental design in an array experiment?

Answer

A

Sample size
Biological (different mouse) and technical (repeating sequencing) replicates
etc

Question 8

Q

what are the different ways you can data mine in an array experiment?

Answer

A

a. Databases
b. Software
c. Techniques-clustering, pattern recognition etc.
d. Comparing to prior studies, across platforms?

Question 9

Q

How do we do array image analysis?

Answer

A

softwares use algorithms to look at gene abundance estimates (expression) and can make e.g. volcano plots

Question 10

Q

Why do we need to normalise data in array experiments

Answer

A

“Normalizing” data allows comparisons ACROSS different array

○ Intensity of fluorescent markers might be different from one batch to the other due to differences in experiments, machines, etc.
○ Normalization allows us to compare those chips without altering the interpretation of changes in GENE EXPRESSION: technical variation can hide real biological variation

Question 11

Q

How are most “low level” analysis on array experiments done?

Answer

A

There is no standard way

pairwise (usually)
list of up and down regulated genes are made and determine the cutoffs (by fold increase, t-statistic [p-value], or a combination)

Question 12

Q

what is the usuall fold cutoff for significance in array experiments of down or up regulation?

Question 13

Q

what should not be forgotten during array experiment “low level” analysis ?

Answer

A

multiple test correction (some things may have very large change in fold but low significance)

Question 14

Q

What are the 3 stages of NGS data analysis?

Answer

A

Primary analysis (run/sample quality): Raw data, images, signals –> basecalling –> bases/colours, quality values
Secondary analysis (sample quality/info) +/- reference –> allignment and assembly
Tertiaty analysis (science): comparison –> statistical analysis and database searches

Question 15

Q

what is basecalling?

Answer

A

tells you which base is present or not

Question 16

Q

moore’s law

Answer

A

computing power doubles every two years

- NGS is getting cheaper faster but computing power is following moores law

Question 17

Q

what is a read?

Answer

A

how many bits of genomic data you have

Question 18

Q

what do Fastq files tell you?

Answer

A

identifier
sequence
+ or - sign (which strand)
quality score

Question 19

Q

Outline NGS data analysis (not 3 steps)

Answer

A

Base calling
QC
Allignment
Variant calling
Annotation
Filtering
Reporting

Question 20

Q

what does NGS reporting tell you?

Answer

A

if the mutation is deleterious or not

Question 21

Q

what trend do you see in a FASQC file?

Answer

A

base quality decreases along read because of lack of synchronisation as the clusters grow

Question 22

Q

what is done during aligning?

Answer

A

We are checking the percentage of reads properly or uniquely mapped
checking for 5’ or 3’ bias
Among the mapped reads, the percentage of reads in exon, intron, and intergenic regions.

Question 23

Q

what do we use to align our genome?

Answer

A

Integrated genomeviewer (IGV)

Question 24

Q

What does IGV do?

Answer

A

Visualises reads in the genome: highlights potential variants (SNPs, hetero/homozygous)
Visualises number of reads in a region: if it only spans exons = mRNA, splice variants, expression levels

Question 25

Q

what is the number of reads roughly proportional to?

Answer

A

the length of the gene and the total nr. of reads in the library

Question 26

Q

If Gene A: 200 and Gene B: 300 reads, is expression of Gene A < Expression of Gene B?

Answer

A

Gene B could be a larger transcript than A and thus make more reads
Depends on analysis

Question 27

Q

How do we get over the problem of the length of the gene interfering with read analysis?

Answer

A

we make RPKM (reads per kilobase of transcript per Million mapped reads) to find out expression

Question 28

Q

name some variant calling method categories

Answer

A

allele counting
probablaistic methods: Baysian model
Heuristic approach

Question 29

Q

How do the variant calling methods differ?

Answer

A

Allele counting just counts how many of each allele e.g. A or G you have
Probabilistic quantifies statistical uncertainty
Heuristic is basically a fancy allele counting with a threshold for read depth, base quality, variant allele frequency, stat significance

Question 30

Q

what is the most common variant calling method?

Question 31

Q

What kinds of variants can we look for in variant calling?

Answer

A

somatic mutations
structural ones: 
deletions
insertions (novel or mobile-element)
tandem or interspersed duplication
inversion
translocation

Question 32

Q

name 3 programmes for variant annotations

Answer

A

SeattleSeq
Onconator
Annovar

Question 33

Q

What does SeattleSeq do?

Answer

A

Annotates SNVs and small indels, both known and novel (everything from functions, protein positions to HapMap frequencies)

Question 34

Q

What does the onconator do?

Answer

A

annotates human genomic point mutations and indels with data relevant to cancer researchers

Question 35

Q

What does Annovar do?

Answer

A

annotates genetic variants detected from diverse genomes including human genome
gives word cloud

Question 36

Q

what is GATK?

Answer

A

The genomic analysis tool kit is a way of standardising the pipeline for genomic analysis made by the Broad.

Question 37

Q

What is Cosmic?

Answer

A

Tells you which cancer is associated with what mutations and information about the mutations

Question 38

Q

What is IntOGen?

Answer

A

Identifies the ‘driver’ genes in your tumour sample

and makes word clouds

Question 39

Q

What is DAVID and what does the same thing?

Answer

A

It’s a molecular signature database:
Can put gene expression data, top hits, mutation data
Tells you what your top genes do and what functions are associated with your condition
GSEA can do the same thing

Question 40

Q

What do cufflinks and cuffdiff do?

Answer

A

cufflinks: assembles transcripts, estimates their abundances (FRKM),
cuffdiff: tests for differential expression and regulation in RNA-Seq samples

Question 41

Q

FPKM equation

Answer

A

counts of mapped fragments / (total mapped fragments (million)*exon length of transcript(KB))

Question 42

Q

outline pipeline for RNA seq

Answer

A

Raw data (FASTQ)
Reads mapping: unspliced mapping (BWA, Bowtie) or spliced mapping (tophat mapsplice)
- SAM/BAM files QC via RNA-SeQC
Expression quantifiction: summarize read counts or FPKM/RPKM (cufflinks)
Differential expression testing: DEseq, edgeR, etc. or Cuffdiff
Functional interpretation: functional enrichment, infer networks, integrate with other data
Biological insights and hypothesis