Bioinformatics Flashcards

1
Q

How has Biolog become deluged with data:

A
  • Nucleotides:

cDNAs (including ESTs)

Whole genomes

DNA polymorphisms

Copy number variations

DNA mutations

Microarray results

Epigenetics

  • Proteins:

Amino acid sequences

Protein structures

Proteomics

Protein interaction maps

  • Cells:

Cell signalling

Microscopy of living cells

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is Bioinformatics?

A
  • Bioinformatics provides the tools for accessing
  • and managing large amounts of biological data
  • as well as the
  • algorithms to assess relationships among members of the data sets.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is Genebank?

A

Genbank is Huge, Growing, and Freely Available

Genbank is the NIH’s (National Institute of Health, USA) annotated collection of all publicly available sequences.)

  • 218,642,238 sequences for traditional GenBank records
  • 1,901,329,611 sequences
  • for WGS
  • August 2020
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is the Goal of Bioinformatics?

A

A goal of bioinformatics is to provide new insights into biological questions by enabling a more global view of a research question.

  • More genes
  • More proteins
  • More sequences
  • More genomes
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is PAX 6?

A
  • The transcription factor PAX6 is the focus of the bioinformatics practicals.
  • PAX6 is conserved across a wide range of species (flies to us).
  • Lots of interesting mutations affecting eye development.
  • Known crystal structures of the two DNA- binding domains of PAX6:

Homeobox and PAX domain

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Transcription factors such as PAX6 recruit RNA Polymerase II via mediators

A
  • The promoter is the DNA sequence where the general transcription factors and the polymerase assemble
  • The cis-regulatory sequences are binding sites for transcription regulators, whose presence on the DNA affects the rate of transcription initiation.
  • These sequences can be located adjacent to the promoter, far upstream of it, or even within introns or entirely downstream of the gene.
  • The broken stretches of DNA signify that the length of DNA between the cis-regulatory sequences and the start of transcription varies, sometimes reaching tens of thousands of nucleotide pairs in length.
  • The TATA box is a DNA recognition sequence for the general transcription factor TFID.
  • As shown in the lower panel, DNA looping allows transcription regulators bound at any of these positions to interact with the proteins that assemble at the promoter.
  • Many transcription regulators act through Mediator , while some interact with the general transcription factors and RNA polymerase directly.
  • Transcription regulators also act by recruiting proteins that alter the chromatin structure of the promoter
  • Whereas Mediator and the general transcription factors are the same for all RNA polymerase II-transcribed genes, the transcription regulators and the locations of their binding sites relative to the promoter differ for each gene
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

PAX6 homologs

A
  • Drosophila: ey (eyeless)
  • Mouse: Sey (Small eye)
  • Human: PAX6

All three genes (ey, Sey, PAX6) are homologs. They are descended from a common ancestral gene.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Conservation of PAX6

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

In humans, some PAX6 mutations are associated with aniridia.

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

PAX6 is a “toolkit” protein for development

A
  • The amino acid sequence of PAX6 is conserved across species.
  • Biochemical activity is conserved across species (interaction with DNA and other proteins involved in transcription). Mouse PAX6 will work in flies.
  • PAX6 regulates about 300 genes.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Domain Structure of PAX6

A
  • A domain is a sequence of amino acids which can fold by itself (without the rest of the protein).
  • PAX6 contains a pax domain and a homeobox domain.
  • Both bind specific DNA sequences.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Structure of the Homeobox Domain of PAX6

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

The homeobox domain is a common domain for binding DNA.

A

Transcription Factors can “read” the DNA to bind in the correct place

  • On the left, a single contact is shown between a transcription regulator and DNA; such contacts allow the protein to “read” the DNA sequence. On the right, the complete set of contacts between a transcription regulator (a member of the homeodomain family and its cis-regulatory sequence is shown.
  • The DNA-binding portion of the protein is 60 amino acids long. Although the interactions in the major groove are the most important, the protein is also seen to contact both the minor groove and phosphates in the sugar-phosphate DNA backbone.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

The Paired Box Domain (PAX) binds DNA

A

The paired box domain (PAX) is a common to many DNA-binding proteins.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

How can CRISPR repair Pax6 and Treat Aniridia in Mice?

A
  • An optimized Cas9 ribonucleoprotein complex and a single-stranded oligodeoxynucleotide containing the 3xFLAG sequence were microinjected into one-cell mouse zygotes to generate transgenic animals in one step.
  • (A) Slit lamp analysis of the transgenic mouse eyes. As expected, images of Fey showed small eyes and corneal opacity, not significantly different from the Sey (Pax6 small eye) mouse model (two-tailed test, p > 0.999).
  • Slit lamp images of Fax (FLAG-tagged Pax6) mouse eyes showed normal iris and clear cornea, which were not significantly different from WT eyes
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Overview of Making mRNA

A
  • The final level of a protein in the cell depends on the efficiency of each step and on the rates of degradation of the RNA and protein molecules.
  • (A) In eukaryotic cells, the mRNA molecule resulting from transcription contains both coding (exon) and noncoding (intron) sequences.
  • Before it can be translated into protein, the two ends of the RNA are modified, the introns are removed by an enzymatically catalyzed RNA splicing reaction, and the resulting mRNA is transported from the nucleus to the cytoplasm.
  • For convenience, the steps in this figure are depicted as occurring one at a time; in reality, many occur concurrently. For example, the RNA cap is added and splicing begins before transcription has been completed. Because of the coupling between transcription and RNA processing, intact primary transcripts—the full-length RNAs that would, in theory, be produced if no processing had occurred—are found only rarely.
  • (B) In prokaryotes, the production of mRNA is much simpler. The 5’ end of an mRNA molecule is produced by the initiation of transcription, and the 3’ end is produced by the termination of transcription.
  • Since prokaryotic cells lack a nucleus, transcription and translation take place in a common compartment, and the translation of a bacterial mRNA often begins before its synthesis has been completed.
17
Q

The Sequence of the Sense Strand Matches the mRNA

A

Note: When an mRNA/cDNA appears on a database, the sequence of the (coding) sense DNA strand is given as ATGC (not AUGC).

The nucleotide sequence of transcribed RNA is normally identical to that of the sense strand, except that U replaces T, and is complementary to that of the template strand.

The nucleotide at the extreme 5′ end of a primary RNA transcript carries a 5′ triphosphate group that may later undergo modification; the 3′ end has a free hydroxyl group.

18
Q

Describe Polyadenylation

A

The pre-mRNA is capped and polyadenylated before export from the nucleus. Polyadenylation confers stability.

A, B) As RNA polymerase II advances to transcribe a gene it carries at its rear two multiprotein complexes required for polyadenylation: CPSF (cleavage and polyadenylation specificity factor) and CStF (cleavage and stimulation factor) that cooperate to identify a polyadenylation signal downstream of the termination codon in the RNA transcript and to cut the transcript.

The polyadenylation signal comprises an AAUAAA sequence or close variant and some poorly understood downstream signals. (C) Cleavage occurs normally about 15–30 nucleotides downstream of the AAUAAA element, and (D) AMP residues are subsequently added by poly(A) polymerase to form a poly(A) tail.

19
Q

Describe splicing

A

Splice sites can only be “predicted” from the raw DNA sequence—need to validate with actual mRNAs.

Only the GU at the start of the intron and AG at the end of the intron are absolutely required. Others are preferred.

The three blocks of nucleotide sequences shown are required to remove an intron sequence.

  • Here A, G, U, and C are the standard RNA nucleotides; R stands for purines (A or G); and Y stands for pyrimidines (C or U). The A highlighted in red forms the branch point of the lariat produced by splicing
  • Only the GU at the start of the intron and the AG at its end are invariant nucleotides in the splicing consensus sequences. Several different nucleotides can occupy the remaining positions, although the indicated nucleotides are preferred.
  • The distances along the RNA between the three splicing consensus sequences are highly variable; however, the distance between the branch point and 3′ splice junction is typically much shorter than that between the 5′ splice junction and the branch point.
20
Q

Alternate splicing

A

Most human mRNA transcripts are heavily spliced and often in

a tissue-specific manner. Splicing can change both translated and untranslated sequences.

  • α-Tropomyosin is a coiled-coil protein (see Figure 3–9) that carries out several tasks, most notably the regulation of contraction in muscle cells.
  • The primary transcript can be spliced in different ways, as indicated in the figure, to produce distinct mRNAs, which then give rise to variant proteins.
  • Some of the splicing patterns are specific for certain types of cells. For example, the α-tropomyosin made in striated muscle is different from that made from the same gene in smooth muscle.
  • The arrowheads in the top part of the figure mark the sites where cleavage and poly-A addition form the 3′ ends of the mature mRNAs.
21
Q

A Subset of the mRNA Transcripts Associated with PAX6

A
22
Q

Untranslated Regions (UTRs)

A

Mature mRNAs include both 5’ and 3’ untranslated regions (UTR) encoded by exons. Most eukaryotic proteins begin with methionine as the first amino acid. Some proteins are subsequently processed.

(A) The human insulin gene comprises three exons and two introns. The coding sequence (the part that will be used to make polypeptide) is shown in deep blue. It is confined to the 3′ sequence of exon 2 and the 5′ sequence of exon 3.

(B) Exon 1 and the 5′ part of exon 2 specify the 5′ untranslated region (5′ UTR), and the 3′ end of exon 3 specifies the 3′ UTR. The UTRs are transcribed and so are present at the ends of the mRNA.

(C) A primary translation product, preproinsulin, has 110 residues and is cleaved to give

(D) a 24-residue N-terminal leader sequence (that is required for the protein to cross the cell membrane but is thereafter discarded) plus an 86-residue proinsulin precursor.

(E) Proinsulin is cleaved to give a central segment (the connecting peptide) that may maintain the conformation of the A and B chains of insulin before the formation of their interconnecting covalent disulfide bridges

23
Q

3’UTR Sequences Sometimes Regulate Stability

A

(AU rich sequences in 3’ UTRs bind proteins that enhance the rate of poly-A shortening)

  • A critical threshold of poly-A tail length induces rapid 3′-to-5′ degradation, which may be triggered by the loss of the poly-Abinding proteins. As shown in Figure 7–70, a deadenylase associates with both the 3′ poly-A tail and the 5′ cap, and this connection may be involved in signaling decapping after poly-A shortening.
  • Although 5′-to-3′ and 3′-to-5′ degradation are shown here on separate RNA molecules, these two processes can occur together on the same molecule.
24
Q

Pax6 mRNAs

A
  • Translated exons are in dark green
  • Untranslated exons are in light green
  • Introns are thin and horizontal
  • Purple is derived from Genomic Sequence (as opposed to cDNA)
25
Q

mRNA Transcripts in the Nucleotide Databases

A
  • To some extent, transcripts can be predicted from genomic sequences, but it is essential to validate their existence by actually finding them.
  • mRNAs cannot be sequenced directly, they must be first converted to cDNAs.
26
Q

Why “T” instead of “U”

A

The sequence of a cDNA in the database matches that of the mature mRNA (except that you will see “T” instead of “U”).

(If a gene is alternately spliced in different tissues, the cDNAs for the same gene will be different in different tissues. It is probably best to think of a family of cDNAs coming from a single gene).

27
Q

Reference Sequences (Ref Seq)

A
  • The NCBI examines all of the cDNA and EST sequences for a given gene, and manually curates the best sequence, called the reference sequence.
  • There are 51 Ref Seq for PAX6, since there are 51 well documented splice variants
  • Ref Seq are continuously updated so you will see both a unie number and a version
28
Q

Ref Seq as the Gold Standard

A

A Ref Seq (or multiple Ref Seq for each gene) lets us define what is a variant.

The variant could be

  • a disease-associated mutation
  • a desirable mutation we introduced
  • a mutation that occurred during cloning
  • an SNP in a population

In addition to the sequence, other information is provided—coding region, UTRs, splice variants. . .

29
Q

Ref Seq Are Especially Useful in Identifying
SNPs

A

When you align small fragments of sequence data to a Ref Seq you can easily spot small variations SNPs (Single Nucleotide Polymorphisms).

30
Q

Aligning to a Ref Seq Is Less Useful for Identifying Large Genomic Changes

A