Data analysis Flashcards
What is data analysis
taking image data and turning it into sequence data
What are the roles of bioinformatics?
- Analytic method development
- Construction and curation of computational tools and databases
- Data mining, interpretation and analysis
What does bioinformatics encompass
- Identifying differentially expressed genes
- Somatic mutations
- Copy number alterations
- Epigenetic changes
- Genomic understanding
- Multifactorial analyses: including BP, pulse ox, etc.
But also
• Pathway analysis
• Genome analysis
• Literature searches
name 2 genome browsers
UCSC: UCSC genome browser
EMBL: ENSMEBL Genome browser
what are genome browsers?
curated databases of all the annotated genomes that we have (like PubMed for genes)
Array processing steps
- Experimental Design
- Image Analysis – scan to intensity measures (raw data)
a. You can get many different types of image data files - Normalization – “clean” data
- More “low level” analysis -fold change, ANOVA, data filtering
- Data mining-how to interpret > 6000 measures
- Validation: repeat with another technology
what do you have to consider in your experimental design in an array experiment?
Sample size
Biological (different mouse) and technical (repeating sequencing) replicates
etc
what are the different ways you can data mine in an array experiment?
a. Databases
b. Software
c. Techniques-clustering, pattern recognition etc.
d. Comparing to prior studies, across platforms?
How do we do array image analysis?
softwares use algorithms to look at gene abundance estimates (expression) and can make e.g. volcano plots
Why do we need to normalise data in array experiments
“Normalizing” data allows comparisons ACROSS different array
○ Intensity of fluorescent markers might be different from one batch to the other due to differences in experiments, machines, etc.
○ Normalization allows us to compare those chips without altering the interpretation of changes in GENE EXPRESSION: technical variation can hide real biological variation
How are most “low level” analysis on array experiments done?
There is no standard way
pairwise (usually)
list of up and down regulated genes are made and determine the cutoffs (by fold increase, t-statistic [p-value], or a combination)
what is the usuall fold cutoff for significance in array experiments of down or up regulation?
3-fold
what should not be forgotten during array experiment “low level” analysis ?
multiple test correction (some things may have very large change in fold but low significance)
What are the 3 stages of NGS data analysis?
- Primary analysis (run/sample quality): Raw data, images, signals –> basecalling –> bases/colours, quality values
- Secondary analysis (sample quality/info) +/- reference –> allignment and assembly
- Tertiaty analysis (science): comparison –> statistical analysis and database searches
what is basecalling?
tells you which base is present or not
moore’s law
computing power doubles every two years
- NGS is getting cheaper faster but computing power is following moores law
what is a read?
how many bits of genomic data you have
what do Fastq files tell you?
identifier
sequence
+ or - sign (which strand)
quality score
Outline NGS data analysis (not 3 steps)
- Base calling
- QC
- Allignment
- Variant calling
- Annotation
- Filtering
- Reporting
what does NGS reporting tell you?
if the mutation is deleterious or not
what trend do you see in a FASQC file?
base quality decreases along read because of lack of synchronisation as the clusters grow
what is done during aligning?
We are checking the percentage of reads properly or uniquely mapped
checking for 5’ or 3’ bias
Among the mapped reads, the percentage of reads in exon, intron, and intergenic regions.
what do we use to align our genome?
Integrated genomeviewer (IGV)
What does IGV do?
Visualises reads in the genome: highlights potential variants (SNPs, hetero/homozygous)
Visualises number of reads in a region: if it only spans exons = mRNA, splice variants, expression levels
what is the number of reads roughly proportional to?
the length of the gene and the total nr. of reads in the library
If Gene A: 200 and Gene B: 300 reads, is expression of Gene A < Expression of Gene B?
Gene B could be a larger transcript than A and thus make more reads
Depends on analysis
How do we get over the problem of the length of the gene interfering with read analysis?
we make RPKM (reads per kilobase of transcript per Million mapped reads) to find out expression
name some variant calling method categories
allele counting
probablaistic methods: Baysian model
Heuristic approach
How do the variant calling methods differ?
Allele counting just counts how many of each allele e.g. A or G you have
Probabilistic quantifies statistical uncertainty
Heuristic is basically a fancy allele counting with a threshold for read depth, base quality, variant allele frequency, stat significance
what is the most common variant calling method?
VarScan2
What kinds of variants can we look for in variant calling?
somatic mutations structural ones: deletions insertions (novel or mobile-element) tandem or interspersed duplication inversion translocation
name 3 programmes for variant annotations
SeattleSeq
Onconator
Annovar
What does SeattleSeq do?
Annotates SNVs and small indels, both known and novel (everything from functions, protein positions to HapMap frequencies)
What does the onconator do?
annotates human genomic point mutations and indels with data relevant to cancer researchers
What does Annovar do?
annotates genetic variants detected from diverse genomes including human genome
gives word cloud
what is GATK?
The genomic analysis tool kit is a way of standardising the pipeline for genomic analysis made by the Broad.
What is Cosmic?
Tells you which cancer is associated with what mutations and information about the mutations
What is IntOGen?
Identifies the ‘driver’ genes in your tumour sample
and makes word clouds
What is DAVID and what does the same thing?
It’s a molecular signature database:
Can put gene expression data, top hits, mutation data
Tells you what your top genes do and what functions are associated with your condition
GSEA can do the same thing
What do cufflinks and cuffdiff do?
cufflinks: assembles transcripts, estimates their abundances (FRKM),
cuffdiff: tests for differential expression and regulation in RNA-Seq samples
FPKM equation
counts of mapped fragments / (total mapped fragments (million)*exon length of transcript(KB))
outline pipeline for RNA seq
- Raw data (FASTQ)
- Reads mapping: unspliced mapping (BWA, Bowtie) or spliced mapping (tophat mapsplice)
- SAM/BAM files QC via RNA-SeQC - Expression quantifiction: summarize read counts or FPKM/RPKM (cufflinks)
- Differential expression testing: DEseq, edgeR, etc. or Cuffdiff
- Functional interpretation: functional enrichment, infer networks, integrate with other data
- Biological insights and hypothesis