Bioinformatics Flashcards

0
Q

Given gene how would you find info about it

A
Find sequence (EMBL, DDBJ)
Literature database
Genomics database (MIM)
Gene expression database (NCBIGEO)
Interaction databases (intact, BIND)
Metabolic pathway (ENZYME, KEGG, reactome)
Mutation/ polymorphism databases (dbSNP)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
1
Q

What is a database

A

Data collection that is structured, searchable, updatable, cross-linked and publicly available.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Why does BLAST work?

A

Similar sequences tend to have similar function

Similar sequences tend to be evolutionarily related

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

How can you be sure our blast match is significant

A

E score (roughly equal to probability of chance)

E = mn2s
M - #nucleotides your sequence was compared against
N - #nucleotides in your sequence
2s - 2 to the power of match score (smaller as sequence get more similar.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Blast aa

Blast nucleotide

A

BLASTp

BLASTn

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is BLASTx

A

BLAST a translated nucleotide sequence in all 6 frames against aa sequence database

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What does tBLASTn do

A

BLASTs aa sequence against nucleotide sequence that has been translated in all 6 frames

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What does tBLASTx do

A

Your nucleotide in six frames translated into aa against database nucleotides in six frames translated into aa

Good for distantly related sequences.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

MegaBLAST

A

Quicker than BLASTn but less sensitive

Use this for everything unless looking for distantly related sequences (use tBLASTx for that)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

PSI-BLAST

A

Very sensitive blast that takes into account that some regions are more conserved than others. Takes LONG.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is special about multiple sequence alignments

A

Can reveal subtle conservation of genome features as these areas evolve/change slower. >3 sequence alignments can show evolutionary relationships.
Eg. Demographic and ecological histories of pops - gene flow, size changes, nat selection, migrations.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Local vs global alignments

A

Global - end to end alignments

Local - specific regions of sequence

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Common mismatch scoring schemes

A
Nucleotide mismatch 
Aa mismatch (BLOSSM, PAM)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

How are most multiple alignments done

A

Build multiple alignments from pair wise alignments. Use mismatch scores to find best score. Use a technique called Dynamic Programming.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Pair wise alignment methods

A

ClustlW - global alignment 20kb long
MUSCLE - global and local 100kb long
MAUVE - global 10Mb long

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Uses of sequence databases in bioinformatics

A
Retrieve known gene sequence
Finding info on gene
Compare sequence to others in DB
Submit sequence to be stored with rest
Find how many genes an organism has
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Why is it harder to do gene prediction in humans vs bacteria

A

Bacteria have specific and well understood proctor sequences (easy to identify) Protein coding sequences one contiguous ORF.

Human promotors less well understood and complex (harder) Protein coding is divided into exons and spliced variably.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Why want to know GC content of sequence

A

Higher GC generally = longer protein coding region.
Melting temp for PCR.
Different orgs have varying GC content
Useful in mapping exon rich regions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Which genes are more homologous this or that

A

You can’t quantify homology. It is a conceptual framework to define the evolutionary relationship between two genes. You can quantify similarity. If they come from dif species you can look at orthology.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Why bioinformatics needed

A
Small and large scale analysis
New lab techniques
Single -> whole genome
Collection/storage of data
Manipulation of data
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Egs of sequence databases

A

EMBL
DDBJ
GeneBank

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What do genomics databases contain

A

Info about gene chromosomal location
Nomenclature
Links to sequence databases

Eg MIM

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What is an isoform

A

Alternative to a sequence

23
Q

Egs of gene expression databases

A

NCBIGEO

24
Q

How to remove vector sequence from DNA sample sequence

A

Run against vector sequence database eg. UniVec

25
Q

How to chose most likely translation result

A
Usually longest ORF
Starting with Met
Ending in stop
No stops wonton sequence
Confirm with promoter prediction
26
Q

Egs of gene prediction software

A

GeneMark

GENSCAN

27
Q

Translators and promoter prediction software

A

NCBI ORF Finder

Promotor 2.0 prediction server

28
Q

Protein sequence databases

A

UniProt
GenPept
RefSeq

29
Q

Database of 3D structures

A

Protein Data Bank (PDB)

30
Q

Protein domain / family databases integrated into what site

A

InterPro

31
Q

What is a motif

A

Sequence of aa encoding for a certain molecular function

Short = motif
Long = functional domain
32
Q

Short linear motifs

A

Unrelated proteins sharing a functional feature like to contain similar motifs
Etc

33
Q

Classification of motifs

A

Modification
Ligand
Targeting
Cleavage

34
Q

What is a regular expression

A

Determines what aa is allowed in each position

Used by PROSITE

35
Q

BioEdit analysis for cloning

A
Nucleotide composition
Six frame translation
Determine ORF
Length of insert/DNA 
RE mapping
36
Q

Transition vs transversion

A

Transition is purine to purine or pyrimidine to pyrimidine (eg A to G , T to C)

Transversions are opposite

(twice as many transversions possible but twice as many transitions occur)

37
Q

Types of sequence formates

A

Fasta
Genbank
Nexus
Phylip

38
Q

Types of sequence viewers

A

Sea view
Aliview
Mesquite
MEGA

39
Q

What is an open reading frame?

A

A string of in-frame codons that specify an amino acid
Starts with ATG (meth) or Val
Ends with stop codon

40
Q

Gene prediction software

A

GeneMark
GENSCAN
microbial Gene Prediction Systm
Glimmer

41
Q

What are promoters?

A

DNA sequence involved in regulating transcription

42
Q

Types of promoters

A
  • core
  • proximal
  • distal
43
Q

Functions of promoters

A
  • integrate info about cell conditions and alter rate of transcription in response
  • different components responsible for different parts of expression pattern
44
Q

Tasks of bioinformatics

A
  • identify promoter regions
  • find TFBS and TFBS modules in a sequence
  • discover novel TFBS motifs
  • construct TFBS and their motifs
  • analysis of expression data
45
Q

How to represent TFBS motifs

A
  • consensus sequence

- position weight matrix

46
Q

Databases of TFBS motifs

A

Transfac

Jaspar

47
Q

What is phylogenetic foot printing?

A

Use of comparative genomics to infer functional genomic regions from conservation

48
Q

What does phylogenetic foot printing require?

A
  • comparison of correctly identified orthologous promoter regions
  • conserved function across species
  • species sufficiently diverged to reduce passive conservation
49
Q

POSSUM workflow

A
  • set of co-expressed genes
  • automated sequence retrieval from ensembl
  • phylogenetic foot printing
  • detection of TFBS
  • statistical significance of binding sites
50
Q

What are methods of miRNA identification based on.

A
  • targets tend to be located in 3’UTR

- some are complementary to the target RNA

51
Q

What is a motif ?

A

A sequence of amino acids encoding a particular molecular function

52
Q

What is PROSITE

A

Library of regular expressions describing each enzyme active site

53
Q

Advantages of regular expressions

A
  • memorable to humans
  • computationally fast
  • standardized in scripting languages
  • can describe a motif very well
54
Q

Disadvantages of regular expressions

A
  • over predict
  • motif may vary in other lineages
  • do not capture weaker preferences
  • easy to make poor representation
55
Q

Example methods if protein functional domains

A

Matrix/profile
Hidden Markov model
Sequence clustering