SU Genome assembly & functional genomics Flashcards
Genome Assembly
involves reconstructing the original genome sequence from short, overlapping fragments from sequencing technologies.
- challenging because of high complexity and repetitive nature of genomes, especially in large and complex organisms.
The assembly process involves the creation of contigs (a set of DNA segments or sequences that overlap in a way that provides a contiguous representation of a genomic region.). : becomes |
The quality of an assembly is often measured using metrics such as N50, which represents the length of the shortest contig for which the sum of lengths is at least 50% of the total assembly length.
Shotgun Genome Assembly
Definition: Shotgun genome assembly reconstructs a genome by randomly fragmenting it, sequencing the fragments, and reassembling them using computational tools.
Key Steps:
1. DNA Fragmentation: Genomic DNA is randomly broken into small pieces using physical or enzymatic methods.
2. Sequencing: Each fragment is sequenced to produce millions of short reads, using technologies like Illumina or PacBio.
3. Read Alignment: Computational tools identify overlapping regions between reads, suggesting they are part of the same sequence.
4. Contig Construction: Overlapping reads are merged to form contigs, which are contiguous DNA sequences.
5. Scaffolding: Contigs are linked into scaffolds, which represent larger genomic structures, often containing gaps.
6. Gap Filling & Error Correction: Additional reads are used to fill gaps and improve assembly accuracy through error correction algorithms.
7. Quality Assessment: Metrics such as N50 and coverage are used to evaluate assembly quality.
There are two main types of assembly: de novo assembly, which constructs genomes from scratch without a reference, and reference-based assembly, which uses a known genome as a guide. This technique is widely applied in whole genome sequencing (WGS), metagenomics, and transcriptomics.
Three Main Problems in Genome Assembly
There are three main problems that complicate genome assembly:
1. Sequencing Limitations: Sequencing technologies typically produce relatively short reads compared to the full length of chromosomes, making it difficult to reconstruct the entire genome sequence accurately. -> shotgun assembly
2. Technical Errors: Sequencing errors and biases, especially in high or low GC-content regions, can result in low coverage and misassemblies.
3. Genomic Architecture: The presence of repetitive sequences and structural variations (e.g., duplications, translocations) can lead to ambiguities in the assembly and make it challenging to infer the correct genomic structure.
Terminology
Contig: contiguous sequence
Scaffold: multiple ordered contigs, perhaps with gaps (NNNs)
Coverage: sampling depth of the genome, typically >30×
N50: typical contig/scaffold/fragment length, describes how good the assembly is -> Length-weighted median length
Calculated by sort all fragments, record the cumulative assembly length, N50 is equal to the fragment length at which the cumulative length is 50% of the total length.
Functional genomics
focuses on understanding how genes lead to specific phenotypes. It involves studying gene expression (transcriptomics), regulation of gene expression (epigenetics), to link genetic sequences to functional outcomes.
Epigentics: non hereditary changes to the genome regulate its activity
Chromatin is mixture dna and protein in cell
Major techniques include:
ChIP-seq: Chromatin immunoprecipitation (antibody binds protein of interest) followed by sequencing, used to identify DNA binding sites for proteins, DNA-protein interaction
ATAC-seq: Assay for Transposase-Accessible Chromatin, used to map open chromatin regions using transposase that are likely involved in gene regulation.
RNA-seq: Sequencing of cDNA derived from RNA transcripts, used to quantify gene expression levels across different conditions or cell types.
RNA Sequencing
Single-Cell RNA Sequencing (scRNA-seq): is a technique that enables the profiling of gene expression at the resolution of individual cells
Bulk RNA-seq, provides an average expression level for all cells in a sample, scRNA-seq allows researchers to examine the heterogeneity between cells, revealing distinct cell types, states, and functions within complex tissues
Spatial RNA seq, provides expression level per individual cell but without being removed from the tissue
Barcoding
is a technique used to label DNA or RNA fragments with unique short sequences, enabling the identification of multiple samples during sequencing.
Disadvantage scRNA-seq
can damage cells and lead to altered representation of cellular abundance and potentially skewing the results regarding the diversity and quantity of cell types in a sample. -> use cellular nuclei instead (single-nuclei sequencing).