Bioinformatik Flashcards

Question 1

Q

Flat file

Answer

A

term used to refer to when data is stored in a plain ordinary file on the hard disk. Example RefSEQ.

Question 2

Q

Bioinformatics

Answer

A

Application of information technology to the storage, management and analysis of biological information (Facilitated by the use of computers)

Question 3

Q

Nanopore seq

Answer

A

When a molecule goes through the hole it is measured. Proteins in the hole that pull it through, 800 nucleotides per minute Read length up to 300 000 —> Able to do phasing/haplotyping. If you have hetereozygote in two spots in the genome.

Question 4

Q

Examples of location descriptors

Answer

A

Location Description

476 Points to a single base in the presented sequence

340..565 Points to a continuous range of bases bounded by and
including the starting and ending bases

<345..500 The exact lower boundary point of a feature is unknown.

(102.110) Indicates that the exact location is unknown but that it
is one of the bases between bases 102 and 110.

(23.45)..600 Specifies that the starting point is one of the bases
between bases 23 and 45, inclusive, and the end base 600

123^124 Points to a site between bases 123 and 124

145^177 Points to a site anywhere between bases 145 and 177

J00193:hladr Points to a feature whose location is described in
another entry: the feature labeled ‘hladr’ in the
entry (in this database) with primary accession ‘J00193’

Question 5

Q

Sequencing file format tips

a) When saving a sequence for use in an email message or pasting into a web page…

b) When retrieving from a database or exchanging between programs…

c)When using sequence again with the same program…

Answer

A

a) …use an unannotated text format such as FASTA

b) …use an annotated text format such as Genbank

c) …use that program’s annotated binary format (or annotated text if binary not available)
Asn-1 (NCBI)
Gbff (sanger)
XML

Question 6

Q

Phred

Answer

A

*base calling
*vector trimming
*end of sequence read trimming
*assigns quality values (qv) of bases in the sequence

Question 7

Q

Phrap

Answer

A

*Phrap uses Phred’s base calling scores to determine the consensus sequences. *Examines all individual sequences at a given position, and uses the highest scoring sequence (if it exists) to extend the consensus sequence

Question 8

Q

Consend

Answer

A

graphical interface extension that controls both Phred and Phrap

Question 9

Q

Poor data at seq end

Answer

A

This is due to the difficulties in resolving larger fragment ~1kb (it is easier to resolve 21bp from 20bp than it is to resolve 1001bp from 1000bp)

Question 10

Q

Cis- and transsplicing for ORF

Answer

A

Cis-splicing - splice a intron and join exons on the same site
trans splice - splice and join from different sites, able to do between sense and antisense strand.

Question 11

Q

Swissprot

Answer

A

SWISS-PROT is an annotated protein sequence database. Continuously updated (daily).

Format follows as closely as possible that of EMBL’s
Curated protein sequence database

Three differences:
- Strives to provide a high level of annotations
- Minimal level of redundancy
- High level of integration with other databases

Behind a paywall..

Question 12

Q

TREMBL

Answer

A

Translated EMBL sequences not (yet) in Swissprot. Updated faster than SWISS-PROT.

TREMBL - two parts
1. SP-TREMBL
Will eventually be incorporated into Swissprot
Divided into FUN, HUM, INV, MAM, MHC, ORG, PHG, PLN, PRO,ROD, UNC, VRL and VRT.

REM-TREMBL (remaining)
Will NOT be incorporated into Swissprot
Divided into:Immunoglobins and T-cell receptors,Synthetic sequences,Patent application sequences,Small fragments,CDS not coding for real proteins

Question 13

Q

Protein searching
3 levels

Answer

A

1.Swissprot - Little noise, annotated entries
2.Swissprot + TREMBL - More noise, all probable entries
3.Translated EMBL - blast or tfasta - Most noisy, all possible entries

Question 14

Q

PDB

Answer

A

3D structure of proteins. AI is able to read the information from AA to predict the datamodel.
>10 000 structures of proteins

Also contains structures of DNA, carbohydrates and protein-DNA complexes

Structures determined principally by X-ray crystallography but other methods are electron microscopy and NMR.

Each entry identified by unique 4-letter code

Question 15

Q

4 most used databanks in bioinformatics

Answer

A

gene ontology - defines the terms
pfam - protein families, identifies functional parts in proteins
smart - visual presentation of protein families
kegg - pathway database, which enzymes work together in biosynthesis pathway

Question 16

Q

Problem with flat files:

Answer

A

Wasted storage space
Wasted processing time
Data control problems
Problems caused by changes to data structures
Access to data difficult
Data out of date
Constraints are system based
Limited querying eg. all single exon GPCRs (<1000 bp)

Question 17

Q

Relational databases

Answer

A

A set of tables and links. A language to query the database. A program to manage the data.

Has existed for 50 years. Main stream in bioinformatics.

Very well known and proven underlying mathematical theory, a simple one that makes possible. Relational model is very mature and has strong knowledge on how to make a relational back-end fast and reliable and how to exploit different technologies.

Question 18

Q

Pros with databases

Answer

A

+Redundancy can be reduced
+Inconsistency can be avoided
+Conflicting requirements can be balanced
+Standards can be enforced
+Data can be shared
+Data independence
+Integrity can be maintained
+Security restrictions can be applied

Question 19

Q

Cons with databases

Answer

A

-Size
-Complexity
-Cost
-Additional hardware costs
-Higher impact of failure
-Recovery more difficult

Question 20

Q

Identity

Answer

A

Extent to which two (nucleotide or amino acid) sequences are invariant

Question 21

Q

Homology

Answer

A

Similarity attributed to descent from common ancestor

Question 22

Q

Orthologous

Answer

A

Homologous sequences in different species that arose from a common ancestral gene during speciation; may or may not be responsible for a similar function

Question 23

Q

Paralogous

Answer

A

Homologous sequences within a single species that arouse by gene duplication.

Question 24

Q

Empirical finding

Answer

A

If two biological sequences are sufficiently similar, almost invariably they have similar biological functions and will be descended from a common ancestor.

Question 25

Q

Scoring matrix

Answer

A

A tool to quantify how well a certain model is represented in the alignment of two sequences, and any result obtained by its application is meaningful exclusively in the context of that model. All subsequent results depend critically on just how this is done and what model lies at the basis for the construction of a specific scoring matrix.

Question 26

Q

Nucleic acid scoring matrices (examples)

Answer

A

Are not performed that much
Identity matrix
BLAST matrix
Transition/Transversion matrix

Question 27

Q

Transition

Answer

A

Mutation that conserves the ring number of the nucleotide

Question 28

Q

Transversion

Answer

A

Mutation that does not conserve the ring number of the nucleotide

Question 29

Q

Genetic Code matrix

Answer

A

Used to define the evolutionary distance between two aa by the minimal number of nucleotide changes required.

The probability that an observed aa pair is related by chance rather than inheritance should depend on amount of point mutations needed to transform one codon to the other.

From the matrix it has been seen that the genetic code appears to have evolved to minimize the effects of point mutations. Mutations often give aa with similar properties.

Question 30

Q

Hydrophobic aliphatic amino acids

Answer

A

Side chains consist of nonpolar methyl or methylene-groups. A

A usually located on the interior of the protein because of their hydrophobicity.

All except alanine are bifurcated.

For Val and Ile the bifurcation is close to main chain and can therefore restrict the conformation of the polypeptide by steric hindrance.

Question 31

Q

Hydrophobic-aromatic aa side chains

Answer

A

Only phenylalanine is totally non-polar.

Tyrosine’s phenolic side chain has a hydroxyl substituent and tryptophan has a nitrogen atom in its indole ring system. These residues are almost always found largely buried in hydrophobic interior of proteins which are normally predominantly non-polar naturally.

But, polar atoms of tyrosine and tryptophan allow hydrogen bonding interaction with other residues or even solvent molecules.

Question 32

Q

Neutral-polar side chains

Answer

A

Small aliphatic side chains with polar groups that cannot ionize readily.

Serine and threonine possess hydroxyl groups in their side chains and as these polar groups are close to the main chain they can form hydrogen bonds with it. This can influence the local conformation of the polypeptide.

Residues such as serine and asparagine are known to adopt conformations which most other amino acids cannot.

The amino acids asparagine and glutamine posses amide groups in their side chains which are usually hydrogen-bonded whenever they occur in the interior of a protein.

Substitution ser <-> thr most common in nature.

Question 33

Q

Acidic amino acids

Answer

A

Aspartate and glutamate have carboxyl side chains and are therefore negatively charged at physiological pH.

Strong polar nature of the residues means they are often found on the surface of globular proteins - able to interact with solvent molecules.

Residues can also partake in electrostatic interactions with positively charged basic aa.

Aspartate and glutamate can also take on catalytic roles in the active site of enzymes, well known for their metal ion binding abilities.

Question 34

Q

Basic amino acids

Answer

A

Histidine has the lowest pKa (around 6) - neutral at around physiological pH.

Occurs often in enzyme active sites as it can function as a very efficient general acid-base catalyst.

Also acts as metal ion ligand in many cases. Lysine and arginine are more strongly basic, + at physiological pH.

Generally solvated but occasionally occur inside proteins involved with electrostatic interactions with - groups.

Lys and Arg are important for anion-binding proteins because able to interact electrostatically with ligand.

Question 35

Q

Conformationally important aa residues

Answer

A

Glycine and proline - unique, appear to influence conformation of the polypeptide.

Gly lacks a side chain and is very flexible in conformation. Occurs abundantly in certain fibrous proteins because of its flexibility and since small size allows adjacent polypeptide chains to pack together closely.

Proline on the other hand is the most rigid aa because the side chain is covalently linked with main chain nitrogen.

Question 36

Q

Hydrophobicity matrix

Answer

A

If you want to predict which part of a protein is going through a membrane.

An attempt to quantify some physical or chemical attribute of the residues and assign weights based on similarities of the residues in this chosen property

Question 37

Q

Dayhoff PAM

Answer

A

A family of matrices that scores aa pairs on the basis of the expected frequency of substitutions of one aa for the other during protein evolution.

Question 38

Q

PAM - stands for…

Answer

A

Percent accepted mutation, one accepted point mutation on the path between two sequences per 100 residues

Question 39

Q

7 steps of constructing a scoring matrix

Answer

A

Find accepted mutations
Frequencies of occurrence
Relative mutabilities
Mutation probability matrix
The evolutionary distance
Relatedness odds
Log-odds matrix

Question 40

Q

Properties of aa going into the makeup of PAM matrices..

Answer

A

Size
Shape
Local concentrations of electric charge
van der Waals surface
Ability to form salt bridges
Hydrophobic interactions
Hydrogen bonds

Question 41

Q

What two aspects can cause the evolutionary distance to be unequal in general to the number of observed differences between the sequences?

Answer

A

*Chance that a certain residue may have mutated, then reverted, hiding the effect of the mutation

*Specific residues may have mutated more than once → number of mutations likely to be larger than the number of differences between the two sequences.

Question 42

Q

PAM matrix; twilight zone

Answer

A

When the PAM distance value between two distantly related proteins nears the value 250 it becomes difficult to tell whether the two proteins are homologous, or if they are two randomly taken proteins that can be aligned by chance.

Question 43

Q

Low PAM

Answer

A

Closely related sequences. High scores for identity and low scores for substitutions, closer to the identity matrix.

Question 44

Q

High PAM

Answer

A

Distant sequences. At PAM200 all information is degenerate except for cysteins.

Question 45

Q

PAM error sources

Answer

A

*Many sequences depart from average composition.
*Rare replacements were observed too infrequently to resolve relative probabilities accurately (for 36 pairs no replacements observed!)
*Errors in 1PAM are magnified in the extrapolation to 250PAM.
*Distantly related sequences usually have islands (blocks) of conserved residues → Replacement is not equally probable over entire sequence.

Question 46

Q

BLOSUM

Answer

A

Blocks substitution matrix. Scores aa pairs based on frequency of aa substitutions in aligned sequence motifs called blocks that are found in protein families. Comes to the same conclusion as PAM.

Question 47

Q

BLOSUM method

Answer

A

A. Observed pairs
B. Expected pairs
C. Summary (A/B)

High BLOSUM: Closely related sequences
Low BLOSUM: Distant sequences
BLOSUM45 <-> PAM250
BLOSUM62 <->PAM160. Blosum62 is the most popular matrix.

Question 48

Q

High BLOSUM

Answer

A

High BLOSUM: Closely related sequences

Question 49

Q

Low BLOSUM

Answer

A

Distant sequences

Question 50

Q

Which is the best matrix to use?

Answer

A

No single matrix is the complete answer for all sequence comparisons. It is probably best to compliment the BLOSUM62 matrix with comparisons using 250PAMs and Overington structurally derived matrices.

Question 51

Q

Dotplot

Answer

A

Graphical representation using two orthogonal axes and “dots” for regions of similarity. In a bioinformatics context two sequence are used on the axes and dots are plotted when a given threshold is met in a given window.

Dot plotting is the best way to see all of the structures in common between two sequences or to visualize all of the repeated or inverted structures in one sequence.

Question 52

Q

Causes of noise in dotplots

Answer

A

Nucleic acids: 1 of 4 bases will match at random. Removing self alignments will reduce noise.

Stringency: Window size is considered, percentage of bases matching in the window is set as threshold.

Question 53

Q

Pairwise sequence alignment

Answer

A

Can be global or local. Local alignment look at a portion that align optimally, while global alignment looks at everything (and we are allowed to make gaps to make it fit).

Works for basically every sequence. However, cannot run multiple. Is not scalable in size and numbers of sequences.

Global: Sequences are completely aligned
Local: Only the best sub-regions are aligned. BLAST uses this

Question 54

Q

Algorithm

Answer

A

Method or a process followed to solve a problem. A recipe. An algorithm takes the input to a problem (function) and transforms it to the output. A mapping of input to output. A problem can have many algorithms.

Question 55

Q

Multiple sequence alignment

Answer

A

A process of aligning multiple sequences of nucleic acids or proteins to identify similarities and differences among them.

The sequences being aligned can be DNA, RNA, or proteins, and they may come from different organisms.

The goal of multiple sequence alignment is to identify conserved regions among the sequences, which can provide insight into their evolutionary relationships and functional significance.

If we have more than 2 sequences. 3D matrices formed. Will use more computational power.

Question 56

Q

Aryabhata-Euclid’s algorithm

Answer

A

How to find gcd(a,b) - the greatest common divisor of a and b. Based on a single observation. if a = b q + r, then any divisor of a and b is also a divisor of r and any divisor of b and r is also a divisor of a, so gcd(a,b) = gcd(b,r)

Use the division algorithm repeatedly to reduce the problem to one you can solve.
Example: gcd(55,35)
55 = 351 + 20 so gcd(55,35) = gcd(35,20)
35 = 201 + 15 so gcd(35,20) = gcd(20,15)
20 = 15*1 + 5 done gcd(55,35) = 5

Question 57

Q

Bubble sort algorithm

Answer

A

One of the most simple sorting algorithms proceeds by walking down the list, comparing adjacent elements and swapping them if they are in the wrong order. The process is continued until the list is sorted.

Question 58

Q

Properties of an algorithm (5)

Answer

A

1.It must be correct: Compute the correct function
2.It must be composed of a series of concrete steps: Steps executable by the machine in question
3.There can be no ambiguity as to which step will be performed next
4.It must be composed of a finite number of steps
5.It must terminate

Question 59

Q

The best alignment

Answer

A

The best alignment is the one with the maximum total score

Question 60

Q

Point of Dynamic programming

Answer

A

Reduce the problem: The solution to a large problem is to simplify… if we first know the solution to a smaller problem that is a subset of the larger problem.

Make a big problem into a small problem. What is the optimal next character instead of what is optimal whole sequence, then combine at last.

Question 61

Q

Needleman-Wunsch

Answer

A

Compare two sequences, filling the score matrix from top to bottom left to right. One line at a time.

Question 62

Q

Sensitivity vs specfificty

Answer

A

Sensitivity: ability to find true positives

Specificity: ability to minimize false positives

There is always a trade-off, you cannot have both 100% sensitivity and specificity

Question 63

Q

Local alignment - Smith-Waterman

Answer

A

Alignment between parts of the two sequences.

With a global alignment we will have many matches in the high similarity section and a lot of mismatches and gaps outside this region. Therefore it makes sense to find the best local alignment instead.

Question 64

Q

Multiple seq alignment method

Answer

A

Most practical and widely used: Hierarchical extensions of pairwise alignment methods. Works by principle that multiple alignments are achieved by successive application of pairwise methods.

Answer 65

A

General purpose multiple alignment program for DNA or proteins. Improves the sensitivity of progressive sequence alignment through sequence weighting, positions-specific gap penalties and weight matrix choice.

Answer 66

A

BWT is very compact, about ½ byte/base. Can fit onto a std computer with 2Gb of memory. Linear-time search algorithm. Indexing of the database is the basis of the technique. Algorithm then searches the index, which goes much fater than doing millions of pairwise alignments.

Answer 67

A

First to the left, input sequence: Dummy character (dollar sign)Dummy character does not occur in the sequence. Used to keep track of rotation
€acaacg

Secondly, alphabetical sorting; dummy character will always be first when doing this. Two operations; rotation and sorting. Make all possible rotations: “acaacg€ if ac is put to the end we will get aacg€ac, can put 1,2,3,4 or 5 characters in the end, all possible rotations” then these are sorted. When sorted the € will form different placements. Sorting will give us interesting properties in the outcome. We are sorting characters depending on their context

https://www.youtube.com/watch?v=gqM3j2IRQH4

Answer 68

A

T-ranking is a method of ranking the positions of a character within a string. It involves assigning a rank to each character based on its position in the sorted order of all the characters in the string.

The T-ranking of a character in the BWT can be used to efficiently locate the character in the original string, which can be useful in various string search and pattern matching tasks.

Answer 69

A

The i-th occurence of character c in L(last column) and i-th occurence of character c in F(first column) corrrespond to the same occurence in T.

Answer 70

A

SAM files are a type of text file format that contains the alignment information of various sequences that are mapped against reference sequences. These files can also contain unmapped sequences. Since SAM files are a text file format, they are more readable by humans

Answer 71

A

BAM files contain the same information as SAM files, except they are in binary file format which is not readable by humans.

On the other hand, BAM files are smaller and more efficient for software to work with than SAM files, saving time and reducing costs of computation and storage.

Alignment data is almost always stored in BAM files and most software that analyzes aligned reads expects to ingest data in BAM format.

Answer 72

A

The header section may contain information about the entire file and additional information for alignments. The alignments then associate themselves with specific header information.

The alignment section contains the information for each sequence about where/how it aligns to the reference genome.

Each alignment has:
*query name, QNAME (SAM)/read_name (BAM). It is used to group/identify alignments that are together, like paired alignments or a read that appears in multiple alignments.

*bitwise set of information describing the alignment, FLAG. Provides the following information:
-are there multiple fragments?
-are all fragments properly aligned?
-is this fragment unmapped?
-is the next fragment unmapped?
-is this query the reverse strand?
-is the next fragment the reverse strand?
-is this the 1st fragment?
-is this the last fragment?
-is this a secondary alignment?
-did this read fail quality controls?
-is this read a PCR or optical duplicate?

Answer 73

A

The sequence being aligned to a reference may have additional bases that are not in the reference or may be missing bases that are in the reference.

The CIGAR string is a sequence of base lengths and the associated operation. They are used to indicate things like which bases align (either a match/mismatch) with the reference, are deleted from the reference, and are insertions that are not in the reference.

https://genome.sph.umich.edu/wiki/SAM

Answer 74

A

In fasta, a hit is something similar in the database to the query. Similar: Short stretch of sequence is shared. Different definitions of the stretch.

Answer 75

A

*For proteins, similar seq does not have to share identical residues.
*For nucleic acids due to codon “wobble”, DNA sequences may look like XXyXXyXXy where X’s are conserved and y’s are not.

Answer 76

A

BLAST searches a large target set of sequences for hits to a query seq and return the alignments and scores from those hits. This process is done fast. BLAST programs are designed for fast database searching with minmal sacrifice of sensitivei to distant related sequences.

Answer 77

A

Approach: find segment pairs by first finding word pairs that score above a threshold, i.e., find word pairs of fixed length w with a score of at least T

Key concept “Neigbourhood”: Seems similar to FASTA, but we are searching for words which score above T rather than that match exactly

Calculate neigborhood (T) for substrings of query (size W)

Answer 78

A

Lowering the neighborhood word threshold (T) allows more distantly related sequences to be found, at the expense of increased noise in the results set.
High T = Everything has to be very similar, very specific but not very sensitive.
Low T = more sensitive but less specific. Typically start with high T and lower it as you move forward.

Choosing a value for w
small w: many matches to expand
big w: many words to be generated
w=4 is a good compromise

Lowering the segment extension cutoff (S) returns longer extensions for each hit.

Changing the minimum E-value changes the threshold for reporting a hit.

Answer 79

A

The proper value of T depends on both the values in the scoring matrix and balance between speed and sensitivity

Higher values of T progressively remove more word hits and reduce the search space.

Word size (W) of 1 will produce more hits than a word size of 10. In general, if T is scaled uniformly with W, smaller word sizes increase sensitivity and decrease speed.

The interplay between W,T and the scoring matrix is criticial and choosing them wisely is the most effective way of controlling the speed and sensiviy of blast. For protein w=3 is the most common.

Answer 80

A

Doing Blast is doing an experiment. A key to the utility of BLAST is the ability to calculate expected probabilities of occurrence of Maximum Segment Pairs (MSPs) given w and T. This allows Blast to rank matchin sequences in order of “significance” and to cut off listings at a user-specified probability.

The background distribution of scores must be turned into p-values. For example, the chance of seeing a score of 200, what is the chance given the background distribution? When value goes higher, the p-value will become lower and lower.

Answer 81

A

The Erdös-Renyi model, also known as the random graph model, is a statistical model for generating random graphs with a given number of nodes and edges. It is based on the idea of randomly connecting nodes with a certain probability, resulting in a graph that exhibits certain probabilistic properties.

Example;
p is probability of “head” when tossing a coin. p=0.5

For n throws, expected length R of the longest run of heads is:
R = log1/p(n)

Want to model aa seq alignment as coin tosses

Answer 82

A

A set of mathematical formulas that are used to evaluate the statistical significance of sequence alignments obtained through the use of heuristics.

Are used to calculate the probability that an alignment occurred by chance, allowing researchers to determine the likelihood that the alignment is biologically meaningful.

Widely used in bioinformatics to assess the reliability of sequence alignments and to help identify significant matches in large databases.

Answer 83

A

Probability that alignment is no better than random .
P=100E-100 perfect match
P>10E-1 match probably insignificant

Answer 84

A

Expected amount of seq that give the same Z- valueor better if database is probed with random seq.

E = multiply P with size of database probed

Answer 85

A

a measure of the statistical significance of a particular match between a query sequence and a database of sequences.

The z-score is calculated based on the alignment score and the distribution of scores for a large number of random alignments.

A higher z-score indicates a more statistically significant match, and a z-score threshold can be used to determine which matches are considered significant and should be reported.

Z-scores are commonly used in bioinformatics to evaluate the statistical significance of sequence alignments obtained through database searches.

Answer 86

A

BLAST’s major advantage is its speed. 2-3 minutes for BLAST versus several hours for a sensitive FastA search of the whole of GenBank.

When both programs use their default setting, BLAST is usually more sensitive than FastA for detecting protein sequence similarity. Since it doesn’t require a perfect sequence match in the first stage of the search.

Answer 87

A

The long word size it uses in the initial stage of DNA sequence similarity searches was chosen for speed, and not sensitivity.
For a thorough DNA similarity search, FastA is the program of choice, especially when run with a lowered KTup value.

FastA is also better suited to the specialised task of detecting genomic DNA regions using a cDNA query sequence, because it allows the use of a gap extension penalty of 0. BLAST, which only creates ungapped alignments, will usually detect only the longest exon, or fail altogether.

In general, a BLAST search using the default parameters should be the first step in a database similarity search strategy. In many cases, this is all that may be required to yield all the information needed, in a very short time.

Answer 88

A

Position Specific Iterated Blast. The best algorithm to find distantly related sequences.

Answer 89

A

For each position in the derived pattern, every amino acid is assigned a score.
(1) Highly conserved residue at a position: that residue is assigned a high positive score, and others are assigned high negative scores.
(2) Weakly conserved positions: all residues receive scores near zero.
(3) Position-specific scores can also be assigned to potential insertions and deletions.

Answer 90

A

Avoid too close sequences: overfit! Want to compromise between PSSM and overfitting.
Do not use PSSM where you suspect to use overfitting you instead use normal score matrix - where you don’t need to be position specific.

Can include false homologous! Therefore check the matches carefully: include or exclude sequences based on biological knowledge. If you look for a family in which not that much is known, risk that you put too much emphasis in a database which you perhaps should not.

The E-value reflects the significance of the match to the previous training set not to the original sequence!

Choose carefully your query sequence.

Try reverse experiment to certify.

Answer 91

A

Pattern-Hit Initiated Blast.

Look into the database, everything said to be a hit has to have a certain conserved pattern and be homologus. Doing a fasta inside a blast search.

Answer 92

A

BLAST-Like Alignment Tool. Aligns the input sequence to the Human Genome. Connected to several databases.

Answer 93

A

-more accurate
-500 times faster in mRNA/DNA alignment
-50 times faster in protein/protein alignment

Answer 94

A

Phylogenetic trees are about visualising evolutionary relationships with the purpose to illustrate how a group of objects are related to one another.

Answer 95

A

Set of species that include all of the species derived from a single common ancestor

Answer 96

A

Smallest group that is consistently and persistently distinct. Species recognized initially on appearance; individuals of one species look different from the individuals from another. For plant species.

Answer 97

A

a set of interbreeding or potentially interbreeding individuals that are separated from other species by reproductive barriers. Species are unable to interbreed.

Answer 98

A

the boundary between reticulate (among interbreeding individuals) and divergent relationships (between lineages with no gene exchange). If a stable gene pool can be maintained.

Answer 99

A

ability to transmit (and maintain) a (stable) gene pool. Adresses the Anopheles genome topology variations

Answer 100

A

-solve crimes

-test product purity

-determine if endangered species have been smuggled or mislabeled

-Epidemiologists use phylogenetic methods to understand the development of pandemics, pattterns of disease transmission and developement of antimicrobial resistance or pathogenicity.

-Conservation biologists may use the techniques to determine which populations are in greatest need of protection, and other questions of population structure.

-Pharmaceutical researchers may use the methods to determine which species are most closely related to other medicinal species, thus perhaps sharing the medicinal qualities

Answer 101

A

To infer relationships that span the diversity of known life, it is necessary to look at genes conserved through the billions of years of evolutionary divergence.

The gene must display an appropriate level of sequence conservation for the divergences of interest.

If there is too much change, then the sequences become randomized, and there is a limit to the depth of the divergences that can be accurately inferred.

If there is too little change (if the gene is too conserved), then there may be little or no change between the evolutionary branchings of interest, and it will not be possible to infer close (genus or species level) relationships.

An example of genes in this category are those that define the ribosomal RNAs (rRNAs). Most prokaryotes have three rRNAs, called the 5S, 16S and 23S rRNA.

Answer 102

A

Rate of evolution = rate of mutation. Rate of evolution for any macromolecule is approximately constant over time (Neutral Theory of evolution)

one amino acid subst. 14.5 My
1.3 10-9 substitutions/nucleotide site/year

Proteins evolve at highly different rates, depending on type of genes. The lowest are related to protein turnover (quite conserved) while psuedogenes (typically refers to protein with premature stop, so no full protein is translated, no pressure to keep them)

Answer 103

A

-Easy to perform
-Quick calculation
-Fit for sequences having high similarity scores

Answer 104

A

-Sequences not considered as such
-All sites equally treated (do not take differences in substitution rates into account)
-Not applicable to distantly divergent sequences

Answer 105

A

Able to keep mutations as status quo.

The bases of all sequences at each site considered separately and the log-likelihood of having these bases are computed for a given topology by using a particular probability model.

Log-likelihood is added for all sites, sum of log-likelihood maximized to estimate branch length of the tree.

Procedure repeated for all possible topologies, topology showing highest likelihood is chosen as final tree.

Answer 106

A

need long computation time to construct a tree.
You can get a terrible amount of possible trees - model does not work for most problems

Answer 107

A

Consists of determining the minimum amount of changes (substitutions) required to transform a sequence to its nearest neighbour

Answer 108

A

Searches for minimum amount of genetic events to infer the most parsimonious tree from a set of sequences.

The best tree is the one that requires the least number of substitutions.

Answer 109

A

-If the evolutionary clock is not constant, the procedure generates results which can be misleading ;
-within practical computational limits, this often leads in the generation of tens or more “equally most parsimonious trees” which make it difficult to justify the choice of a particular tree ;
-long computation time to construct a tree.

Answer 110

A

In an unrooted tree the direction of evolution is unknown

The root is the hypothesized ancestor of the sequences in the tree

The root can either be placed on a branch or at a node

You should start by viewing an unrooted tree

Many software packages will root trees
automatical (e.g. mid-point rooting in NJPlot)

Sometimes two trees may look very different but, in fact, differ only in the position of the root

This normally involves assumptions… BEWARE!

Answer 111

A

Bootstrapping is a statistical method that is used to assess the reliability of a phylogenetic tree, which is a tree showing the evolutionary relationships among a group of organisms.

The basic idea behind bootstrapping is to create a large number of trees based on different samples of the data used to construct the original tree.

To do this you take a random block of the alignment (including gaps and such) and copy it a number of times and add a second block and copy it a number of times as well, and this is continued until this new ”alignment” has same length as the alignment.
This process is done N times, and the tree-method is made based on all of these. Based on the thus generated N trees you make a consensus tree. you should choose N to be at least 10x that of the length of the alignment.

Answer 112

A

A bootstrap value is a measure of how often a particular branch appears in the bootstrap sample. For example, if a particular branch appears in 90% of the trees in the bootstrap sample, its bootstrap value would be 90.

There is no simple mapping between bootstrap values and confidence intervals. There is no agreement about what constitutes a ‘good’ bootstrap value (> 70%, > 80%, > 85% ????)

Answer 113

A

Jack-knifing is very similar to bootstrapping and differs only in the character resampling strategy
Jack-knifing is not as widely available or widely used as bootstrapping
Tends to produce broadly similar results

Answer 114

A

This technique resamples half of the sequence sites considered and eliminates the rest. The final sample has half the number of initial number of sites without duplication. Half-jacknife is allmost never done, this is horizontal (wheras bootstrapping is vertical), so you take out some of the sequencing instead of taking parts of the allignments out.

Answer 115

A

0: Zeroth amino acid composition (proteomics, %cysteine, %glycine). cysteine - cysteine bridges. glycine - spacers, to make functional domains in the proteins

1: Primary This is simply the order of covalent linkages along the polypeptide chain, I.e. the sequence itself

2: Secondary Local organization of the protein backbone: alpha-helix, Beta-strand (which assemble into Beta-sheets) turn and interconnecting loop.

3: Teritary
Packing of secondary structure elements into a compact spatial unit
Fold or domain – this is the level to which structure is currently possible

4: Quaternary structure
Assembly of homo- or heterodimeric protein chains
Hard to predict

Answer 116

A

Able to see the psi and phi angles, go from -180 to +180. Looking at known structures enable us to estimate the angles

Nature has a very high expressive alphabet for primary sequences, but due to the nature of the peptide bond, certain angles are observed preferentially.

Answer 117

A

2ndary structure prediction
The method uses a set of empirical rules that consider aa seq of a protein and physical and chemical properties of individual aa. Rules used to predict likelihood that particular aa will be part of an alpha helix, beta sheet or a loop region.

Widely used method in protein structure prediction, but is not as accurate as some recent methods. But is still useful to understand basic principles of protein structure and identifying potentially important parts of a protein.

Method consists of assigning set of prediction values to a residue, based on statistic analysis of 15 proteins and applying a simple algorithm to those numbers.

Answer 118

A

A plot, x-axis is length of alignment and y-axis is % identical residues

Naturally occurring sequences with >20% sequence identity over 80 or more residues always adopt the same basic structure

The line of the plot is Important because it tells us that if the alignment is sufficently long and we have 30% identical residues –> the structures are the same.
Remarkably low percentage needed to say that the structure is the same.

Answer 119

A

Compact folding unit of protein structure, usually associated with a function. Is usually a “fold” in the case of monomeric soluble proteins. Comprises normally only one protein chain. Domains can be shared between different proteins.

Answer 120

A

Membrane bound receptors

A very large number of different domains both to bind their ligand and to activate G proteins.

Pharmaceutically the most important class

Answer 121

A

X-ray crystallography is an experimental technique that exploits the fact that X-rays are diffracted by crystals.

X-rays have the proper wavelength (in the Ångström range, ~10-8 cm) to be scattered by the electron cloud of an atom of comparable size.

uses protein crystals

Answer 122

A

NMR uses protein in solution
– Can look at the dynamic properties of the protein structure
– Can look at the interactions between the protein and ligands,
substrates or other proteins
– Can look at protein folding
– Sample is not damaged in any way
– The maximum size of a protein for NMR structure determination is ~30 kDa.This elliminates ~50% of all proteins
– High solubility is a requirement

Answer 123

A

a) Finding a structural homologue
b) Extract “template” sequences and align with query
c) Input for model building
d)Methods
e) Model evaluation (How good is the prediction, how much can the algorithm rely/extract on the provided templates)

Answer 124

A

CASP is a biennial experiment that aims to evaluate and compare the accuracy of different methods for predicting the 3D structure of proteins from their amino acid sequences.

During the experiment, participating groups submit predictions for a set of proteins whose structures are not yet known (referred to as “targets”). The structures of these proteins are later determined experimentally and the predictions are evaluated for their accuracy. The results of the CASP experiment provide a benchmark for the current state of the art in protein structure prediction and help researchers identify areas for improvement in their methods.

Answer 125

A

genome structure
gene-organisation
known promoter regions
known critical amino acid residues.

Answer 126

A

All cells have “sidechains” or molecules hanging outside of them that recognize specific extracellular chemicals

Answer 127

A

Cells have receptive substances on them that can be affected by agonist molecules or blocked by antagonist molecules

Answer 128

A

Enzymes have an active site (LOCK) where substrate (KEY) binds. Enzymes action on the substrate make the key ill-fitting and the product leaves the active site

Answer 129

A

a large collection of compounds with different chemical properties or shapes, generated either by combinatorial chemistry or some other process or by collecting samples with interesting biological properties.

Answer 130

A

the automated examination and testing of libraries of synthetic and/or organic compounds and extracts to identify potential drug leads, based on the compound’s binding affinity for a target molecule.

Answer 131

A

conc where 50% of the enzyme activity is inhibited. Activity can be saturated. Need to be sure that you have a single compound binding a single target and not multiple compounds or multiple target. Used to double check that everything made up to this point is correct.

Answer 132

A

a potential drug candidate emerging from a screening process of a large library of compounds

Answer 133

A

-Basically affects specifically a biological process. Mechanism of activity (reversible/ irreversible, kinetics) established
-Its is effective at a low concentration: usually nanomolar activity
-It is not toxic to live cells
-It has been shown to have some in vivo activity
-It is chemically feasible. Specificity of key compound(s) from each lead series against selected number of receptors/enzymes
-Preliminary PK in vivo (rodent) to establish benchmark for in vitro SAR
-In vitro PK data good predictor for in vivo activity
-Its is of course New and Original.

Answer 134

A

Poor absorption or permeation is more likely when;
1.There are < 5 H-bond donors (expressed as the sum of OHs and NHs);
2.The MWT < 500;
3.The LogP <5 (or MLogP is < 4.15);
4.There are less than 10 H-bond acceptors (expressed as the sum of Ns and Os)

Answer 135

A

new concept. Trial were the participants get infected to fast track data acquisition to get the vaccine faster. Of course not for all diseases. Could work if you are young and healthy. Much smaller trial than normal when all participants get infected.

With 50/50 CT of patients with and without recieving the drug. If different companies do the same trial they should have shared control arms, unnecessary to let that many people be without the drug?

Answer 136

A

Personal genomics manifesto

Answer 137

A

Clusters of conserved residues. Carry out particular function/form particular structure important for conserved protein

Answer 138

A

For amino acids, a number representing the hydrophobic or hydrophilic properties of its side-chain. The larger the number is, the more hydrophobic the amino acid. The most hydrophobic amino acids are isoleucine (4.5) and valine (4.2). The most hydrophilic ones are arginine (-4.5) and lysine (-3.9). This is very important in protein structure; hydrophobic amino acids tend to be internal in the protein 3D structure, while hydrophilic amino acids are more commonly found towards the protein surface.

For Kyte-Dolittle plot, a window size of 19 with peaks >1.8 indicate possible transmembrane region whereas window size 9 indicate possible surface regions of globular proteins.

Brainscape's Knowledge GenomeTM

Bioinformatik Flashcards

Brainscape's Knowledge Genome^TM