Introduction Flashcards
Week 1 Lecture 1 - Introduction
How often does DNA data double?
Every 18-24 months
How often does structure data double?
Every 6 years
What is the rough composition of the human genome?
3.2 Gbp, 2-5% coding, >50% repeated sequences, ~35% genes have alternative splicing
What is bioinformatics?
The application of computers to biological problems:
- Aiding the biologist in creating, storing, and analysing biological data (mainly sequences/structures)
- Presenting it in a way biologists can use
- Applying the analysis to make predictions
What is fragment assembly?
Searching sequence fragments for overlapping regions to join them in a continuous sequence
How do we conduct fragment assembly?
- Enforce a minimum overlap size to reduce the probability of a chance match
- Fuzzy matches account for errors in sequencing
- Apply a confidence score
- 50% of the genome is repeated sequences so there may be problems with sequence repeats
What is a moving window?
Take an odd number of residues and calculate some average property (typically between 7-21). Slide the window along each residue and calculate the averages.
What information can we predict from a DNA sequence?
- Membrane regions
- Secondary structure
- Accessibility
- Flexibility
- Antigenicity
What is an algorithm?
A complete and precise set of steps that will solve a problem and achieve an identical result when given the same set of data to a defined level of accuracy
What does computer programming enable us to do?
- Automation of tasks
- Manipulation of data
- Advanced analysis of data
- Tools to make predictions
What are machine learning methods?
A general class of computer software which learns from examples and is then able to make predictions
How do MLMs work?
- Train a learning method with real examples of data
- The method learns features of real examples
- Apply the trained system to make predictions
What are examples of MLMs?
- Neural networks
- Decision trees
- Naive Bayesian classifiers
- Support vector machines
What is a database?
A structured collection of data with some tool enabling it to be queried.
What is a databank?
A collection of data (normally in a simple text file) without an associated query tool. It allows you to use whatever software you like to analyse the data.
What are the types of databanks?
- Primary
- Secondary
- Composite
- Gateways
What is a primary databank? Give examples.
Raw data deposition and curation. e.g. Genbank, PDB, UniProtKB
What is a secondary databank? Give examples.
Derived data, patterns, annotations. e.g. Prosite, Pfam, Cath
What is a composite databank? Give examples.
Non-redundant sets of data derived from primary databases. e.g. OWL, NRDB
What is a gateway? Give examples.
Gateways give access to data. e.g. NCBI, Expasy, EBI
What is a gene ontology? Give examples.
Controlled vocabulary to describe gene and gene product attributes. e.g. molecular function, biological process, cellular components.
In what ways can you search databases?
- Text searches
- Sequence similarity searches
- Structure similarity searches
What are the different sequence alignment methods?
- Automatic pairwise
- Consensus
- Profile
- Structure prediction
What does annotation include?
- Authors
- References
- Methods
- Cross-links to other databases
- Feature tables
What are some probelms with databanks?
The data might be unreliable.
- Multiple names of the same gene
- Multiple proteins with the same name
- Spelling errors
- Changes in annotations
What does bioinformatics enable us to do?
- Create data
- Make predictions
- Provide tools to store and search data
- Create 3D models
- Transfer of annotations