Bioinformatics - investigating gene regulation Flashcards

Question

What did the patterns show within the blood differentiation project?

Answer 1

That gene expression defines the cell types ``` The cells progressively: Lose stem cell character Gain and then lose endothelial and vascular character Gain blood cell like characteristics Become macrophages (immune system cells) ```

Answer 2

The enhancers and other regulatory elements that control cell type specific gene expression Analyse the DHS to find the specific ones to particular cell types

Answer 3

TEAD transcription factor We discovered that TEAD TF sequence binding patterns are found in DHS sites that are specific to early blood cell precursors First indication that TEAD is involved in blood cell development CACATTCC - common in blood cell development

Answer 4

A major feat of computational engineering Allow you to view a whole range of data: Genes and transcripts Epigenetic marks (often from ENCODE cell lines) Data on genetic variants in human populations Data on evolutionary conservation (of genome regions in related mammals) You can upload your own data ‘tracks’ and use the browser to visualize your own epigenetic data UCSC – University of California at Santa Cruz - European equivalent – Ensembl - ensembl.org Each horizontal data plot is called a ‘track'

Answer 5

'Next generation’ DNA sequencing produces short (ish) sequences that are first mapped to the genome Many next generation sequencing problems concern counting the number of sequences mapping to genome regions ATAC or DNase seq – high numbers of sequences mapping identify open chromatin ChIP-seq – high numbers of sequences mapping identify protein binding sites RNA-seq – the number of sequences mapping to each gene increases with increasing gene expression level

Answer 6

Over the entire genome even if your sequences were just randomly distributed (i.e. All chromatin equally open) then chance would mean that some regions have more sequences mapping than others If something happens more than you expect ‘by chance’ then this can be evidence of some real and possibly interesting effect For example - open chromatin, protein binding Its important to understand what ‘by chance’ means With a coin ‘by chance’ means - we assume the probability of heads is 0.5 = 100 tosses to be close to 50 heads With some small random variation, perhaps not exactly 50, but not 95 This is a statistical model, and is often called the NULL (no interesting effect) model

Answer 7

``` We need to calculate probabilities and therefore require probability distributions For example: T and normal distributions The binomial distribution The Poisson distribution ```

Answer 8

The probability of a success or failure outcome in an experiment or survey that is repeated multiple times The binomial is a type of distribution that has two possible outcomes The binomial distribution, therefore, represents the probability for x successes in n trials, given a success probability p for each trial P - the probability of success for each trial R - the probability of the successes in N trials N - number of trials Criteria The number of observations or trials is fixed Each observation/trial is independent = no effect on the probability of the next trial

Answer 9

This is a probability distribution that can be used to show how many times an event is likely to occur within a specified period of time - given the average number of times the event occurs in this time P - The Poisson probability that exactly r successes occur in a Poisson experiment, when the mean number of successes is E E - The mean number of successes that occur in a specified region R - The actual number of successes that occur in a specified region

Answer 10

These are closely related, but have different parameters Binomial – number of trials = n, probability of success in a trial = p Poisson – just the expected number of events = E If you set E=np, then the Poisson approximates the binomial The approximation gets better as n gets bigger and p gets smaller We often use the Poisson because It’s easier to calculate Parameterization in terms of E is often convenient

Answer 11

Sliding windows are genomic intervals that literally "slide" across the genome, almost always by some constant distance These windows are mapped to files containing signal or annotations of interest, such as: SNPs, motif binding site calls, DNaseI tags, conservation scores, etc. Sliding windows can overlap or be disjoint Overlapping windows are often used to "smooth" signal, to remove or reduce the impact of signal noise Example - TFs would bind in the same 'windows'

Answer 12

Do two ChIP-seq experiments identify binding window sets that overlap more than you would expect by chance We need to determine if they are just overlapping by chance Example - TF These calculations allow us to determine whether in genome scale data there is evidence that transcription factors bind ‘together’ to function rather than independently

Answer 13

In practice it is more complex for a range of reasons including Not all genome regions are equally mappable by short reads (particularly repetitive regions) The ‘background’ rate E may vary between chromosomes and chromosome regions Some technologies, e.g. ChIP-seq, require modifications and different considerations for reads on positive and negative DNA strands ‘Paired-end’ sequencing introduces further complications It may be necessary to account for sequencing artefacts caused by PCR, such as high rates of occurrence of duplicate sequences Some experiments use ‘control’ samples, e.g. ChIP with non-specific antibody

Answer 14

To solve this with binomial instead of Poisson = gives p=0.11 (cf. 0.12 from the Poisson) We have made an approximation in both binomial and Poisson Our analysis assumes that two blue windows could overlap the same red window In reality they can’t, but the approximation is good so long as the number of red and blue windows is much smaller than the number windows on the genome = true for TF binding data To avoid that approximation we can solve this exactly with the hypergeometric distribution - this gives p=0.09

Bioinformatics - investigating gene regulation Flashcards

(38 cards)