Bioinformatik Flashcards

1
Q

Flat file

A

term used to refer to when data is stored in a plain ordinary file on the hard disk. Example RefSEQ.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Bioinformatics

A

Application of information technology to the storage, management and analysis of biological information (Facilitated by the use of computers)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Nanopore seq

A

When a molecule goes through the hole it is measured. Proteins in the hole that pull it through, 800 nucleotides per minute Read length up to 300 000 —> Able to do phasing/haplotyping. If you have hetereozygote in two spots in the genome.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Examples of location descriptors

A

Location Description

476 Points to a single base in the presented sequence

340..565 Points to a continuous range of bases bounded by and
including the starting and ending bases

<345..500 The exact lower boundary point of a feature is unknown.

(102.110) Indicates that the exact location is unknown but that it
is one of the bases between bases 102 and 110.

(23.45)..600 Specifies that the starting point is one of the bases
between bases 23 and 45, inclusive, and the end base 600

123^124 Points to a site between bases 123 and 124

145^177 Points to a site anywhere between bases 145 and 177

J00193:hladr Points to a feature whose location is described in
another entry: the feature labeled ‘hladr’ in the
entry (in this database) with primary accession ‘J00193’

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Sequencing file format tips

a) When saving a sequence for use in an email message or pasting into a web page…

b) When retrieving from a database or exchanging between programs…

c)When using sequence again with the same program…

A

a) …use an unannotated text format such as FASTA

b) …use an annotated text format such as Genbank

c) …use that program’s annotated binary format (or annotated text if binary not available)
Asn-1 (NCBI)
Gbff (sanger)
XML

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Phred

A

*base calling
*vector trimming
*end of sequence read trimming
*assigns quality values (qv) of bases in the sequence

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Phrap

A

*Phrap uses Phred’s base calling scores to determine the consensus sequences. *Examines all individual sequences at a given position, and uses the highest scoring sequence (if it exists) to extend the consensus sequence

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Consend

A

graphical interface extension that controls both Phred and Phrap

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Poor data at seq end

A

This is due to the difficulties in resolving larger fragment ~1kb (it is easier to resolve 21bp from 20bp than it is to resolve 1001bp from 1000bp)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Cis- and transsplicing for ORF

A

Cis-splicing - splice a intron and join exons on the same site
trans splice - splice and join from different sites, able to do between sense and antisense strand.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Swissprot

A

SWISS-PROT is an annotated protein sequence database. Continuously updated (daily).

Format follows as closely as possible that of EMBL’s
Curated protein sequence database

Three differences:
- Strives to provide a high level of annotations
- Minimal level of redundancy
- High level of integration with other databases

Behind a paywall..

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

TREMBL

A

Translated EMBL sequences not (yet) in Swissprot. Updated faster than SWISS-PROT.

TREMBL - two parts
1. SP-TREMBL
Will eventually be incorporated into Swissprot
Divided into FUN, HUM, INV, MAM, MHC, ORG, PHG, PLN, PRO,ROD, UNC, VRL and VRT.

  1. REM-TREMBL (remaining)
    Will NOT be incorporated into Swissprot
    Divided into:Immunoglobins and T-cell receptors,Synthetic sequences,Patent application sequences,Small fragments,CDS not coding for real proteins
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Protein searching
3 levels

A

1.Swissprot - Little noise, annotated entries
2.Swissprot + TREMBL - More noise, all probable entries
3.Translated EMBL - blast or tfasta - Most noisy, all possible entries

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

PDB

A

3D structure of proteins. AI is able to read the information from AA to predict the datamodel.
>10 000 structures of proteins

Also contains structures of DNA, carbohydrates and protein-DNA complexes

Structures determined principally by X-ray crystallography but other methods are electron microscopy and NMR.

Each entry identified by unique 4-letter code

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

4 most used databanks in bioinformatics

A

gene ontology - defines the terms
pfam - protein families, identifies functional parts in proteins
smart - visual presentation of protein families
kegg - pathway database, which enzymes work together in biosynthesis pathway

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Problem with flat files:

A

Wasted storage space
Wasted processing time
Data control problems
Problems caused by changes to data structures
Access to data difficult
Data out of date
Constraints are system based
Limited querying eg. all single exon GPCRs (<1000 bp)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Relational databases

A

A set of tables and links. A language to query the database. A program to manage the data.

Has existed for 50 years. Main stream in bioinformatics.

Very well known and proven underlying mathematical theory, a simple one that makes possible. Relational model is very mature and has strong knowledge on how to make a relational back-end fast and reliable and how to exploit different technologies.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Pros with databases

A

+Redundancy can be reduced
+Inconsistency can be avoided
+Conflicting requirements can be balanced
+Standards can be enforced
+Data can be shared
+Data independence
+Integrity can be maintained
+Security restrictions can be applied

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Cons with databases

A

-Size
-Complexity
-Cost
-Additional hardware costs
-Higher impact of failure
-Recovery more difficult

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Identity

A

Extent to which two (nucleotide or amino acid) sequences are invariant

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Homology

A

Similarity attributed to descent from common ancestor

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Orthologous

A

Homologous sequences in different species that arose from a common ancestral gene during speciation; may or may not be responsible for a similar function

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Paralogous

A

Homologous sequences within a single species that arouse by gene duplication.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Empirical finding

A

If two biological sequences are sufficiently similar, almost invariably they have similar biological functions and will be descended from a common ancestor.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

Scoring matrix

A

A tool to quantify how well a certain model is represented in the alignment of two sequences, and any result obtained by its application is meaningful exclusively in the context of that model. All subsequent results depend critically on just how this is done and what model lies at the basis for the construction of a specific scoring matrix.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

Nucleic acid scoring matrices (examples)

A

Are not performed that much
Identity matrix
BLAST matrix
Transition/Transversion matrix

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

Transition

A

Mutation that conserves the ring number of the nucleotide

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

Transversion

A

Mutation that does not conserve the ring number of the nucleotide

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

Genetic Code matrix

A

Used to define the evolutionary distance between two aa by the minimal number of nucleotide changes required.

The probability that an observed aa pair is related by chance rather than inheritance should depend on amount of point mutations needed to transform one codon to the other.

From the matrix it has been seen that the genetic code appears to have evolved to minimize the effects of point mutations. Mutations often give aa with similar properties.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

Hydrophobic aliphatic amino acids

A

Side chains consist of nonpolar methyl or methylene-groups. A

A usually located on the interior of the protein because of their hydrophobicity.

All except alanine are bifurcated.

For Val and Ile the bifurcation is close to main chain and can therefore restrict the conformation of the polypeptide by steric hindrance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

Hydrophobic-aromatic aa side chains

A

Only phenylalanine is totally non-polar.

Tyrosine’s phenolic side chain has a hydroxyl substituent and tryptophan has a nitrogen atom in its indole ring system. These residues are almost always found largely buried in hydrophobic interior of proteins which are normally predominantly non-polar naturally.

But, polar atoms of tyrosine and tryptophan allow hydrogen bonding interaction with other residues or even solvent molecules.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

Neutral-polar side chains

A

Small aliphatic side chains with polar groups that cannot ionize readily.

Serine and threonine possess hydroxyl groups in their side chains and as these polar groups are close to the main chain they can form hydrogen bonds with it. This can influence the local conformation of the polypeptide.

Residues such as serine and asparagine are known to adopt conformations which most other amino acids cannot.

The amino acids asparagine and glutamine posses amide groups in their side chains which are usually hydrogen-bonded whenever they occur in the interior of a protein.

Substitution ser <-> thr most common in nature.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

Acidic amino acids

A

Aspartate and glutamate have carboxyl side chains and are therefore negatively charged at physiological pH.

Strong polar nature of the residues means they are often found on the surface of globular proteins - able to interact with solvent molecules.

Residues can also partake in electrostatic interactions with positively charged basic aa.

Aspartate and glutamate can also take on catalytic roles in the active site of enzymes, well known for their metal ion binding abilities.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

Basic amino acids

A

Histidine has the lowest pKa (around 6) - neutral at around physiological pH.

Occurs often in enzyme active sites as it can function as a very efficient general acid-base catalyst.

Also acts as metal ion ligand in many cases. Lysine and arginine are more strongly basic, + at physiological pH.

Generally solvated but occasionally occur inside proteins involved with electrostatic interactions with - groups.

Lys and Arg are important for anion-binding proteins because able to interact electrostatically with ligand.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

Conformationally important aa residues

A

Glycine and proline - unique, appear to influence conformation of the polypeptide.

Gly lacks a side chain and is very flexible in conformation. Occurs abundantly in certain fibrous proteins because of its flexibility and since small size allows adjacent polypeptide chains to pack together closely.

Proline on the other hand is the most rigid aa because the side chain is covalently linked with main chain nitrogen.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

Hydrophobicity matrix

A

If you want to predict which part of a protein is going through a membrane.

An attempt to quantify some physical or chemical attribute of the residues and assign weights based on similarities of the residues in this chosen property

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

Dayhoff PAM

A

A family of matrices that scores aa pairs on the basis of the expected frequency of substitutions of one aa for the other during protein evolution.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
38
Q

PAM - stands for…

A

Percent accepted mutation, one accepted point mutation on the path between two sequences per 100 residues

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
39
Q

7 steps of constructing a scoring matrix

A
  1. Find accepted mutations
  2. Frequencies of occurrence
  3. Relative mutabilities
  4. Mutation probability matrix
  5. The evolutionary distance
  6. Relatedness odds
  7. Log-odds matrix
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
40
Q

Properties of aa going into the makeup of PAM matrices..

A

Size
Shape
Local concentrations of electric charge
van der Waals surface
Ability to form salt bridges
Hydrophobic interactions
Hydrogen bonds

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
41
Q

What two aspects can cause the evolutionary distance to be unequal in general to the number of observed differences between the sequences?

A

*Chance that a certain residue may have mutated, then reverted, hiding the effect of the mutation

*Specific residues may have mutated more than once → number of mutations likely to be larger than the number of differences between the two sequences.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
42
Q

PAM matrix; twilight zone

A

When the PAM distance value between two distantly related proteins nears the value 250 it becomes difficult to tell whether the two proteins are homologous, or if they are two randomly taken proteins that can be aligned by chance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
43
Q

Low PAM

A

Closely related sequences. High scores for identity and low scores for substitutions, closer to the identity matrix.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
44
Q

High PAM

A

Distant sequences. At PAM200 all information is degenerate except for cysteins.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
45
Q

PAM error sources

A

*Many sequences depart from average composition.
*Rare replacements were observed too infrequently to resolve relative probabilities accurately (for 36 pairs no replacements observed!)
*Errors in 1PAM are magnified in the extrapolation to 250PAM.
*Distantly related sequences usually have islands (blocks) of conserved residues → Replacement is not equally probable over entire sequence.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
46
Q

BLOSUM

A

Blocks substitution matrix. Scores aa pairs based on frequency of aa substitutions in aligned sequence motifs called blocks that are found in protein families. Comes to the same conclusion as PAM.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
47
Q

BLOSUM method

A

A. Observed pairs
B. Expected pairs
C. Summary (A/B)

High BLOSUM: Closely related sequences
Low BLOSUM: Distant sequences
BLOSUM45 <-> PAM250
BLOSUM62 <->PAM160. Blosum62 is the most popular matrix.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
48
Q

High BLOSUM

A

High BLOSUM: Closely related sequences

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
49
Q

Low BLOSUM

A

Distant sequences

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
50
Q

Which is the best matrix to use?

A

No single matrix is the complete answer for all sequence comparisons. It is probably best to compliment the BLOSUM62 matrix with comparisons using 250PAMs and Overington structurally derived matrices.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
51
Q

Dotplot

A

Graphical representation using two orthogonal axes and “dots” for regions of similarity. In a bioinformatics context two sequence are used on the axes and dots are plotted when a given threshold is met in a given window.

Dot plotting is the best way to see all of the structures in common between two sequences or to visualize all of the repeated or inverted structures in one sequence.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
52
Q

Causes of noise in dotplots

A

Nucleic acids: 1 of 4 bases will match at random. Removing self alignments will reduce noise.

Stringency: Window size is considered, percentage of bases matching in the window is set as threshold.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
53
Q

Pairwise sequence alignment

A

Can be global or local. Local alignment look at a portion that align optimally, while global alignment looks at everything (and we are allowed to make gaps to make it fit).

Works for basically every sequence. However, cannot run multiple. Is not scalable in size and numbers of sequences.

Global: Sequences are completely aligned
Local: Only the best sub-regions are aligned. BLAST uses this

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
54
Q

Algorithm

A

Method or a process followed to solve a problem. A recipe. An algorithm takes the input to a problem (function) and transforms it to the output. A mapping of input to output. A problem can have many algorithms.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
55
Q

Multiple sequence alignment

A

A process of aligning multiple sequences of nucleic acids or proteins to identify similarities and differences among them.

The sequences being aligned can be DNA, RNA, or proteins, and they may come from different organisms.

The goal of multiple sequence alignment is to identify conserved regions among the sequences, which can provide insight into their evolutionary relationships and functional significance.

If we have more than 2 sequences. 3D matrices formed. Will use more computational power.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
56
Q

Aryabhata-Euclid’s algorithm

A

How to find gcd(a,b) - the greatest common divisor of a and b. Based on a single observation. if a = b q + r, then any divisor of a and b is also a divisor of r and any divisor of b and r is also a divisor of a, so gcd(a,b) = gcd(b,r)

Use the division algorithm repeatedly to reduce the problem to one you can solve.
Example: gcd(55,35)
55 = 351 + 20 so gcd(55,35) = gcd(35,20)
35 = 20
1 + 15 so gcd(35,20) = gcd(20,15)
20 = 15*1 + 5 done gcd(55,35) = 5

57
Q

Bubble sort algorithm

A

One of the most simple sorting algorithms proceeds by walking down the list, comparing adjacent elements and swapping them if they are in the wrong order. The process is continued until the list is sorted.

58
Q

Properties of an algorithm (5)

A

1.It must be correct: Compute the correct function
2.It must be composed of a series of concrete steps: Steps executable by the machine in question
3.There can be no ambiguity as to which step will be performed next
4.It must be composed of a finite number of steps
5.It must terminate

59
Q

The best alignment

A

The best alignment is the one with the maximum total score

60
Q

Point of Dynamic programming

A

Reduce the problem: The solution to a large problem is to simplify… if we first know the solution to a smaller problem that is a subset of the larger problem.

Make a big problem into a small problem. What is the optimal next character instead of what is optimal whole sequence, then combine at last.

61
Q

Needleman-Wunsch

A

Compare two sequences, filling the score matrix from top to bottom left to right. One line at a time.

62
Q

Sensitivity vs specfificty

A

Sensitivity: ability to find true positives

Specificity: ability to minimize false positives

There is always a trade-off, you cannot have both 100% sensitivity and specificity

63
Q

Local alignment - Smith-Waterman

A

Alignment between parts of the two sequences.

With a global alignment we will have many matches in the high similarity section and a lot of mismatches and gaps outside this region. Therefore it makes sense to find the best local alignment instead.

64
Q

Multiple seq alignment method

A

Most practical and widely used: Hierarchical extensions of pairwise alignment methods. Works by principle that multiple alignments are achieved by successive application of pairwise methods.

65
Q

ClustalW

A

General purpose multiple alignment program for DNA or proteins. Improves the sensitivity of progressive sequence alignment through sequence weighting, positions-specific gap penalties and weight matrix choice.

66
Q

BWT

A

BWT is very compact, about ½ byte/base. Can fit onto a std computer with 2Gb of memory. Linear-time search algorithm. Indexing of the database is the basis of the technique. Algorithm then searches the index, which goes much fater than doing millions of pairwise alignments.

67
Q

BWT method

A

First to the left, input sequence: Dummy character (dollar sign)Dummy character does not occur in the sequence. Used to keep track of rotation
€acaacg

Secondly, alphabetical sorting; dummy character will always be first when doing this. Two operations; rotation and sorting. Make all possible rotations: “acaacg€ if ac is put to the end we will get aacg€ac, can put 1,2,3,4 or 5 characters in the end, all possible rotations” then these are sorted. When sorted the € will form different placements. Sorting will give us interesting properties in the outcome. We are sorting characters depending on their context

https://www.youtube.com/watch?v=gqM3j2IRQH4

68
Q

T-ranking

A

T-ranking is a method of ranking the positions of a character within a string. It involves assigning a rank to each character based on its position in the sorted order of all the characters in the string.

The T-ranking of a character in the BWT can be used to efficiently locate the character in the original string, which can be useful in various string search and pattern matching tasks.

69
Q

LF-mapping

A

The i-th occurence of character c in L(last column) and i-th occurence of character c in F(first column) corrrespond to the same occurence in T.

70
Q

SAM file

A

SAM files are a type of text file format that contains the alignment information of various sequences that are mapped against reference sequences. These files can also contain unmapped sequences. Since SAM files are a text file format, they are more readable by humans

71
Q

BAM file

A

BAM files contain the same information as SAM files, except they are in binary file format which is not readable by humans.

On the other hand, BAM files are smaller and more efficient for software to work with than SAM files, saving time and reducing costs of computation and storage.

Alignment data is almost always stored in BAM files and most software that analyzes aligned reads expects to ingest data in BAM format.

72
Q

Info in BAM and SAM file

A

The header section may contain information about the entire file and additional information for alignments. The alignments then associate themselves with specific header information.

The alignment section contains the information for each sequence about where/how it aligns to the reference genome.

Each alignment has:
*query name, QNAME (SAM)/read_name (BAM). It is used to group/identify alignments that are together, like paired alignments or a read that appears in multiple alignments.

*bitwise set of information describing the alignment, FLAG. Provides the following information:
-are there multiple fragments?
-are all fragments properly aligned?
-is this fragment unmapped?
-is the next fragment unmapped?
-is this query the reverse strand?
-is the next fragment the reverse strand?
-is this the 1st fragment?
-is this the last fragment?
-is this a secondary alignment?
-did this read fail quality controls?
-is this read a PCR or optical duplicate?

73
Q

CIGAR

A

The sequence being aligned to a reference may have additional bases that are not in the reference or may be missing bases that are in the reference.

The CIGAR string is a sequence of base lengths and the associated operation. They are used to indicate things like which bases align (either a match/mismatch) with the reference, are deleted from the reference, and are insertions that are not in the reference.

https://genome.sph.umich.edu/wiki/SAM

74
Q

FastA

A

In fasta, a hit is something similar in the database to the query. Similar: Short stretch of sequence is shared. Different definitions of the stretch.

75
Q

FastA issues

A

*For proteins, similar seq does not have to share identical residues.
*For nucleic acids due to codon “wobble”, DNA sequences may look like XXyXXyXXy where X’s are conserved and y’s are not.

76
Q

BLAST

A

BLAST searches a large target set of sequences for hits to a query seq and return the alignments and scores from those hits. This process is done fast. BLAST programs are designed for fast database searching with minmal sacrifice of sensitivei to distant related sequences.

77
Q

BLAST method

A

Approach: find segment pairs by first finding word pairs that score above a threshold, i.e., find word pairs of fixed length w with a score of at least T

Key concept “Neigbourhood”: Seems similar to FASTA, but we are searching for words which score above T rather than that match exactly

Calculate neigborhood (T) for substrings of query (size W)

78
Q

Blast parameters

A

Lowering the neighborhood word threshold (T) allows more distantly related sequences to be found, at the expense of increased noise in the results set.
High T = Everything has to be very similar, very specific but not very sensitive.
Low T = more sensitive but less specific. Typically start with high T and lower it as you move forward.

Choosing a value for w
small w: many matches to expand
big w: many words to be generated
w=4 is a good compromise

Lowering the segment extension cutoff (S) returns longer extensions for each hit.

Changing the minimum E-value changes the threshold for reporting a hit.

79
Q

BLAST critical parameters

A

The proper value of T depends on both the values in the scoring matrix and balance between speed and sensitivity

Higher values of T progressively remove more word hits and reduce the search space.

Word size (W) of 1 will produce more hits than a word size of 10. In general, if T is scaled uniformly with W, smaller word sizes increase sensitivity and decrease speed.

The interplay between W,T and the scoring matrix is criticial and choosing them wisely is the most effective way of controlling the speed and sensiviy of blast. For protein w=3 is the most common.

80
Q

Mathematical basis of BLAST

A

Doing Blast is doing an experiment. A key to the utility of BLAST is the ability to calculate expected probabilities of occurrence of Maximum Segment Pairs (MSPs) given w and T. This allows Blast to rank matchin sequences in order of “significance” and to cut off listings at a user-specified probability.

The background distribution of scores must be turned into p-values. For example, the chance of seeing a score of 200, what is the chance given the background distribution? When value goes higher, the p-value will become lower and lower.

81
Q

Erdös-Renyi

A

The Erdös-Renyi model, also known as the random graph model, is a statistical model for generating random graphs with a given number of nodes and edges. It is based on the idea of randomly connecting nodes with a certain probability, resulting in a graph that exhibits certain probabilistic properties.

Example;
p is probability of “head” when tossing a coin. p=0.5

For n throws, expected length R of the longest run of heads is:
R = log1/p(n)

Want to model aa seq alignment as coin tosses

82
Q

Karlin-Alschul Statistics

A

A set of mathematical formulas that are used to evaluate the statistical significance of sequence alignments obtained through the use of heuristics.

Are used to calculate the probability that an alignment occurred by chance, allowing researchers to determine the likelihood that the alignment is biologically meaningful.

Widely used in bioinformatics to assess the reliability of sequence alignments and to help identify significant matches in large databases.

83
Q

P-values

A

Probability that alignment is no better than random .
P=100E-100 perfect match
P>10E-1 match probably insignificant

84
Q

E-values

A

Expected amount of seq that give the same Z- valueor better if database is probed with random seq.

E = multiply P with size of database probed

85
Q

Z-value

A

a measure of the statistical significance of a particular match between a query sequence and a database of sequences.

The z-score is calculated based on the alignment score and the distribution of scores for a large number of random alignments.

A higher z-score indicates a more statistically significant match, and a z-score threshold can be used to determine which matches are considered significant and should be reported.

Z-scores are commonly used in bioinformatics to evaluate the statistical significance of sequence alignments obtained through database searches.

86
Q

Blast pros vs fasta

A

BLAST’s major advantage is its speed. 2-3 minutes for BLAST versus several hours for a sensitive FastA search of the whole of GenBank.

When both programs use their default setting, BLAST is usually more sensitive than FastA for detecting protein sequence similarity. Since it doesn’t require a perfect sequence match in the first stage of the search.

87
Q

Blast weaknesses vs fasta

A

The long word size it uses in the initial stage of DNA sequence similarity searches was chosen for speed, and not sensitivity.
For a thorough DNA similarity search, FastA is the program of choice, especially when run with a lowered KTup value.

FastA is also better suited to the specialised task of detecting genomic DNA regions using a cDNA query sequence, because it allows the use of a gap extension penalty of 0. BLAST, which only creates ungapped alignments, will usually detect only the longest exon, or fail altogether.

In general, a BLAST search using the default parameters should be the first step in a database similarity search strategy. In many cases, this is all that may be required to yield all the information needed, in a very short time.

88
Q

PSI-blast

A

Position Specific Iterated Blast. The best algorithm to find distantly related sequences.

89
Q

Score assignment PSI-blast

A

For each position in the derived pattern, every amino acid is assigned a score.
(1) Highly conserved residue at a position: that residue is assigned a high positive score, and others are assigned high negative scores.
(2) Weakly conserved positions: all residues receive scores near zero.
(3) Position-specific scores can also be assigned to potential insertions and deletions.

90
Q

PSI-BLAST pitfalls

A

Avoid too close sequences: overfit! Want to compromise between PSSM and overfitting.
Do not use PSSM where you suspect to use overfitting you instead use normal score matrix - where you don’t need to be position specific.

Can include false homologous! Therefore check the matches carefully: include or exclude sequences based on biological knowledge. If you look for a family in which not that much is known, risk that you put too much emphasis in a database which you perhaps should not.

The E-value reflects the significance of the match to the previous training set not to the original sequence!

Choose carefully your query sequence.

Try reverse experiment to certify.

91
Q

PHI BLAST

A

Pattern-Hit Initiated Blast.

Look into the database, everything said to be a hit has to have a certain conserved pattern and be homologus. Doing a fasta inside a blast search.

92
Q

BLAT

A

BLAST-Like Alignment Tool. Aligns the input sequence to the Human Genome. Connected to several databases.

93
Q

BLAT compared to existing tools

A

-more accurate
-500 times faster in mRNA/DNA alignment
-50 times faster in protein/protein alignment

94
Q

Phylogenetic trees

A

Phylogenetic trees are about visualising evolutionary relationships with the purpose to illustrate how a group of objects are related to one another.

95
Q

Clade

A

Set of species that include all of the species derived from a single common ancestor

96
Q

Morphological species

A

Smallest group that is consistently and persistently distinct. Species recognized initially on appearance; individuals of one species look different from the individuals from another. For plant species.

97
Q

Biological species

A

a set of interbreeding or potentially interbreeding individuals that are separated from other species by reproductive barriers. Species are unable to interbreed.

98
Q

Phylogenetic species

A

the boundary between reticulate (among interbreeding individuals) and divergent relationships (between lineages with no gene exchange). If a stable gene pool can be maintained.

99
Q

Phylogenomics species

A

ability to transmit (and maintain) a (stable) gene pool. Adresses the Anopheles genome topology variations

100
Q

Use of phylogenetic methods

A

-solve crimes

-test product purity

-determine if endangered species have been smuggled or mislabeled

-Epidemiologists use phylogenetic methods to understand the development of pandemics, pattterns of disease transmission and developement of antimicrobial resistance or pathogenicity.

-Conservation biologists may use the techniques to determine which populations are in greatest need of protection, and other questions of population structure.

-Pharmaceutical researchers may use the methods to determine which species are most closely related to other medicinal species, thus perhaps sharing the medicinal qualities

101
Q

Which seq to use for phylogenic research?

A

To infer relationships that span the diversity of known life, it is necessary to look at genes conserved through the billions of years of evolutionary divergence.

The gene must display an appropriate level of sequence conservation for the divergences of interest.

If there is too much change, then the sequences become randomized, and there is a limit to the depth of the divergences that can be accurately inferred.

If there is too little change (if the gene is too conserved), then there may be little or no change between the evolutionary branchings of interest, and it will not be possible to infer close (genus or species level) relationships.

An example of genes in this category are those that define the ribosomal RNAs (rRNAs). Most prokaryotes have three rRNAs, called the 5S, 16S and 23S rRNA.

102
Q

Molecular clock

A

Rate of evolution = rate of mutation. Rate of evolution for any macromolecule is approximately constant over time (Neutral Theory of evolution)

one amino acid subst. 14.5 My
1.3 10-9 substitutions/nucleotide site/year

Proteins evolve at highly different rates, depending on type of genes. The lowest are related to protein turnover (quite conserved) while psuedogenes (typically refers to protein with premature stop, so no full protein is translated, no pressure to keep them)

103
Q

Distance matrix methods to determine phylogeny pros

A

-Easy to perform
-Quick calculation
-Fit for sequences having high similarity scores

104
Q

Distance matrix methods to determine phylogeny cons

A

-Sequences not considered as such
-All sites equally treated (do not take differences in substitution rates into account)
-Not applicable to distantly divergent sequences

105
Q

Maximum-likelyhood (ML) methods

A

Able to keep mutations as status quo.

The bases of all sequences at each site considered separately and the log-likelihood of having these bases are computed for a given topology by using a particular probability model.

Log-likelihood is added for all sites, sum of log-likelihood maximized to estimate branch length of the tree.

Procedure repeated for all possible topologies, topology showing highest likelihood is chosen as final tree.

106
Q

Drawback ML methods

A

need long computation time to construct a tree.
You can get a terrible amount of possible trees - model does not work for most problems

107
Q

Parsimony criterion

A

Consists of determining the minimum amount of changes (substitutions) required to transform a sequence to its nearest neighbour

108
Q

Maximum parsimony

A

Searches for minimum amount of genetic events to infer the most parsimonious tree from a set of sequences.

The best tree is the one that requires the least number of substitutions.

109
Q

Drawbacks maximum parsimony

A

-If the evolutionary clock is not constant, the procedure generates results which can be misleading ;
-within practical computational limits, this often leads in the generation of tens or more “equally most parsimonious trees” which make it difficult to justify the choice of a particular tree ;
-long computation time to construct a tree.

110
Q

Tree rooting

A

In an unrooted tree the direction of evolution is unknown

The root is the hypothesized ancestor of the sequences in the tree

The root can either be placed on a branch or at a node

You should start by viewing an unrooted tree

Many software packages will root trees
automatical (e.g. mid-point rooting in NJPlot)

Sometimes two trees may look very different but, in fact, differ only in the position of the root

This normally involves assumptions… BEWARE!

111
Q

Bootstrapping - one of the most common exam questions

A

Bootstrapping is a statistical method that is used to assess the reliability of a phylogenetic tree, which is a tree showing the evolutionary relationships among a group of organisms.

The basic idea behind bootstrapping is to create a large number of trees based on different samples of the data used to construct the original tree.

To do this you take a random block of the alignment (including gaps and such) and copy it a number of times and add a second block and copy it a number of times as well, and this is continued until this new ”alignment” has same length as the alignment.
This process is done N times, and the tree-method is made based on all of these. Based on the thus generated N trees you make a consensus tree. you should choose N to be at least 10x that of the length of the alignment.

112
Q

Bootstrap value

A

A bootstrap value is a measure of how often a particular branch appears in the bootstrap sample. For example, if a particular branch appears in 90% of the trees in the bootstrap sample, its bootstrap value would be 90.

There is no simple mapping between bootstrap values and confidence intervals. There is no agreement about what constitutes a ‘good’ bootstrap value (> 70%, > 80%, > 85% ????)

113
Q

Jack-knifing

A

Jack-knifing is very similar to bootstrapping and differs only in the character resampling strategy
Jack-knifing is not as widely available or widely used as bootstrapping
Tends to produce broadly similar results

114
Q

Half-Jacknife

A

This technique resamples half of the sequence sites considered and eliminates the rest. The final sample has half the number of initial number of sites without duplication. Half-jacknife is allmost never done, this is horizontal (wheras bootstrapping is vertical), so you take out some of the sequencing instead of taking parts of the allignments out.

115
Q

Describe the different levels of protein strcuture

A

0: Zeroth amino acid composition (proteomics, %cysteine, %glycine). cysteine - cysteine bridges. glycine - spacers, to make functional domains in the proteins

1: Primary This is simply the order of covalent linkages along the polypeptide chain, I.e. the sequence itself

2: Secondary Local organization of the protein backbone: alpha-helix, Beta-strand (which assemble into Beta-sheets) turn and interconnecting loop.

3: Teritary
Packing of secondary structure elements into a compact spatial unit
Fold or domain – this is the level to which structure is currently possible

4: Quaternary structure
Assembly of homo- or heterodimeric protein chains
Hard to predict

116
Q

Ramachandran / Phi-Psi plot

A

Able to see the psi and phi angles, go from -180 to +180. Looking at known structures enable us to estimate the angles

Nature has a very high expressive alphabet for primary sequences, but due to the nature of the peptide bond, certain angles are observed preferentially.

117
Q

CHOU-FASMAN

A

2ndary structure prediction
The method uses a set of empirical rules that consider aa seq of a protein and physical and chemical properties of individual aa. Rules used to predict likelihood that particular aa will be part of an alpha helix, beta sheet or a loop region.

Widely used method in protein structure prediction, but is not as accurate as some recent methods. But is still useful to understand basic principles of protein structure and identifying potentially important parts of a protein.

Method consists of assigning set of prediction values to a residue, based on statistic analysis of 15 proteins and applying a simple algorithm to those numbers.

118
Q

Sander-Schneider - SUPERNCOMMON ON EXAM

A

A plot, x-axis is length of alignment and y-axis is % identical residues

Naturally occurring sequences with >20% sequence identity over 80 or more residues always adopt the same basic structure

The line of the plot is Important because it tells us that if the alignment is sufficently long and we have 30% identical residues –> the structures are the same.
Remarkably low percentage needed to say that the structure is the same.

119
Q

Domain

A

Compact folding unit of protein structure, usually associated with a function. Is usually a “fold” in the case of monomeric soluble proteins. Comprises normally only one protein chain. Domains can be shared between different proteins.

120
Q

GPCR

A

Membrane bound receptors

A very large number of different domains both to bind their ligand and to activate G proteins.

Pharmaceutically the most important class

121
Q

X-ray crystallography

A

X-ray crystallography is an experimental technique that exploits the fact that X-rays are diffracted by crystals.

X-rays have the proper wavelength (in the Ångström range, ~10-8 cm) to be scattered by the electron cloud of an atom of comparable size.

uses protein crystals

122
Q

NMR

A

NMR uses protein in solution
– Can look at the dynamic properties of the protein structure
– Can look at the interactions between the protein and ligands,
substrates or other proteins
– Can look at protein folding
– Sample is not damaged in any way
– The maximum size of a protein for NMR structure determination is ~30 kDa.This elliminates ~50% of all proteins
– High solubility is a requirement

123
Q

Modelling steps

A

a) Finding a structural homologue
b) Extract “template” sequences and align with query
c) Input for model building
d)Methods
e) Model evaluation (How good is the prediction, how much can the algorithm rely/extract on the provided templates)

124
Q

CASP (Critical Assessment of Structure Prediction)

A

CASP is a biennial experiment that aims to evaluate and compare the accuracy of different methods for predicting the 3D structure of proteins from their amino acid sequences.

During the experiment, participating groups submit predictions for a set of proteins whose structures are not yet known (referred to as “targets”). The structures of these proteins are later determined experimentally and the predictions are evaluated for their accuracy. The results of the CASP experiment provide a benchmark for the current state of the art in protein structure prediction and help researchers identify areas for improvement in their methods.

125
Q

Comparative genomics - keywords

A

genome structure
gene-organisation
known promoter regions
known critical amino acid residues.

126
Q

Ehrlich’s chemoreceptor idea

A

All cells have “sidechains” or molecules hanging outside of them that recognize specific extracellular chemicals

127
Q

Langley’s “receptive substance” idea

A

Cells have receptive substances on them that can be affected by agonist molecules or blocked by antagonist molecules

128
Q

Fischers lock and key mechanims

A

Enzymes have an active site (LOCK) where substrate (KEY) binds. Enzymes action on the substrate make the key ill-fitting and the product leaves the active site

129
Q

screening library

A

a large collection of compounds with different chemical properties or shapes, generated either by combinatorial chemistry or some other process or by collecting samples with interesting biological properties.

130
Q

screening

A

the automated examination and testing of libraries of synthetic and/or organic compounds and extracts to identify potential drug leads, based on the compound’s binding affinity for a target molecule.

131
Q

IC50

A

conc where 50% of the enzyme activity is inhibited. Activity can be saturated. Need to be sure that you have a single compound binding a single target and not multiple compounds or multiple target. Used to double check that everything made up to this point is correct.

132
Q

lead compound

A

a potential drug candidate emerging from a screening process of a large library of compounds

133
Q

LEAD - properties

A

-Basically affects specifically a biological process. Mechanism of activity (reversible/ irreversible, kinetics) established
-Its is effective at a low concentration: usually nanomolar activity
-It is not toxic to live cells
-It has been shown to have some in vivo activity
-It is chemically feasible. Specificity of key compound(s) from each lead series against selected number of receptors/enzymes
-Preliminary PK in vivo (rodent) to establish benchmark for in vitro SAR
-In vitro PK data good predictor for in vivo activity
-Its is of course New and Original.

134
Q

Lipinski rule of 5

A

Poor absorption or permeation is more likely when;
1.There are < 5 H-bond donors (expressed as the sum of OHs and NHs);
2.The MWT < 500;
3.The LogP <5 (or MLogP is < 4.15);
4.There are less than 10 H-bond acceptors (expressed as the sum of Ns and Os)

135
Q

Challenge trials

A

new concept. Trial were the participants get infected to fast track data acquisition to get the vaccine faster. Of course not for all diseases. Could work if you are young and healthy. Much smaller trial than normal when all participants get infected.

With 50/50 CT of patients with and without recieving the drug. If different companies do the same trial they should have shared control arms, unnecessary to let that many people be without the drug?

136
Q

PGMv2

A

Personal genomics manifesto

137
Q

Motifs

A

Clusters of conserved residues. Carry out particular function/form particular structure important for conserved protein

138
Q

Hydropathy index (Kyte-Dolittle)

A

For amino acids, a number representing the hydrophobic or hydrophilic properties of its side-chain. The larger the number is, the more hydrophobic the amino acid. The most hydrophobic amino acids are isoleucine (4.5) and valine (4.2). The most hydrophilic ones are arginine (-4.5) and lysine (-3.9). This is very important in protein structure; hydrophobic amino acids tend to be internal in the protein 3D structure, while hydrophilic amino acids are more commonly found towards the protein surface.

For Kyte-Dolittle plot, a window size of 19 with peaks >1.8 indicate possible transmembrane region whereas window size 9 indicate possible surface regions of globular proteins.