Bioinformatics Flashcards
hva er en tool innenfor bioinformatics?
et program som gjør en spesifikk oppgave eks: sam, bam eller fastq
hvordan er fastq-formate oppbygd?
Each sequence represented with four lines:
* Line 1 always begins with a ‘@’ character and is followed by a
sequence identifier and various information from the sequencing.
(This information is optional and vary between datasets)
* Line 2 is the raw sequence letters.
* Line 3 begins with a ‘+’ character and is optionally followed by the
same sequence identifier (and any description) again.
* Line 4 encodes the quality values for the sequence in Line 2, and
must contain the same number of symbols as letters in the
sequence.
hvilke problemer kan man støte på i transkriptom alignment?
Har man exoner som spleiser sammen i mrna får du sekvenser som ikke er i genomet, ender opp med at man ikke finner alignment i kromosomet og man for gaps i alignmenten.
hvordan kan man unngå gaps i alignmente av sekvenser?
Begynner å aligne en sekvens, der det ikke er alignemt vil den stoppe og dele alignmenten i 2 slik at den kan gå og finne den andre delen et annet sted.
hvorfor indexerer man et genom?
Indexing a genome can be explained similar to indexing a book. If you want to know on which page a certain word appears or a chapter begins, it is much more efficient/faster to look it up in a pre-built index than going through every page of the book until you found it. Same goes for alignments. Indices allow the aligner to narrow down the potential origin of a query sequence within the genome, saving both time and memory.
hvordan gjør man Sequence alignment to reference?
- Align to reference genome
(DNA or RNA) or
transcriptome (RNA) - Find out where your
sequences match the
reference. - Analyse the genomic regions
where sequences
accummulate
hva er featurecounts?
featureCounts:
* A software program developed for counting reads to genomic features
such as genes, exons promoters and genomic bins
hva er viktig å huske på ved Experimental design
- Typical experiment has two conditions:
Control and Experimental (for example
samples from healthy and diseased
individuals) - Technical replicate: Same biological
sample in different runs - Biological replicate: Sample from
different biological source (for example
different patient) in different runs - Due to the good technical
reproducability in RNA-Seq, biological
replicates are more important than
technical - At least three biological replicates is
recommended for proper statistical
testing
hva er borrow varians?
- Use the Negative Binomial Distribution
(NBD) - Problem: In the NBD the variance
cannot be directly estimated from the
mean - Trick: Use the variance of other
features in the dataset with similar
expression level to estimate the
variance for each feature - Borrow variance from similar features
- Solution to the overdispersion problem
for datasets with few replicates - Leads to more robust estimates of
significance. Fewer false positives.
hva er PCA?
Multi-dimensional experimental design
Find the directions with most variation in the data (PC – Principal Components)
Transfrom data to plane (axes) defined by Principal Components (PC1 and PC2)
Map of your data and the relations between samples and variables
hva er en GSEA test?
Single sample GSEA (ssGSEA): GSEA
performed on each individual sample in a
dataset
* Rank genes in each sample according to
expression value (highest first)
* Positive score: Geneset genes enriched at the
top of ranked list
* Negative score: Geneset genes enriched at the
bottom of ranked list
* A positive score indicates that the sample has
the property the geneset represent
* Here: Enriched for Non-Canonical Wnt-pathway
hva er chipseq konseptet?
- Crosslink cells using formaldehyde (”freeze” the
protein-DNA interactions) - Fragment DNA (200-500bp fragment length)
- Use antibody towards TF of interest to ”fish” for
DNA-fragments bound by the TF - Wash of the proteins and isolate the DNA fragments
- Sequence the DNA fragments.
- Single end sequencing
- 75-150 bp current standard
ChIP-seq profiles - Library of tags with constant lengths
- Typical number of tags in current studies: 5-50m
- Sequenced tags are aligned to reference genome
- There will be enrichment of tags (peaks) at genomic positions where the DNA was bound by the TF (transcription factors) of interest.
- Can generate 100 to 50 000 peaks depending on factor and experiment