About
MotifMogul is a software tool for
predicting transcription factor binding sites using experimentally
verified Position Weight Matrices (PWMs). The included sets of
PWMs are from the
TRANSFAC Public 7.0 database and from the
JASPAR database (PHYLOFACTS).
The software performs two
different functions:
- It combines 3 existing matrix scanning software tools
within a single interface:
Clover,
MotifLocator
and
MotifScanner.
-
A method to reduce false postive predictions is also implemented,
whereby 3 stringent threshold values have been pre-calculated for each
included matrix. Each stringent threshold value is based on assessing
every matrix against scans on random sequences and choosing threshold
values that represent the confidence of seeing a prediction at quantile
levels of 0.1%, 0.5% and 0.01%. Only the algorithm
MotifLocator was used to do this. This approach was used in:
MotifMogul is part of a larger
project, that includes transcription factor binding site prediction
algorithms other than matrix scanners. Work on this project
is on-going.
|
Clover
is a program for identifying functional sites
in DNA sequences. If you give it a set of DNA sequences
that share a common function, it will compare them to
a library of sequence motifs (e.g. transcription factor
binding patterns), and identify which if any of the
motifs are statistically overrepresented in the sequence set.
Version: Mar 3 2004
Web link:
Clover Homepage
Reference:
-
Martin C Frith, Yutao Fu, Liqun Yu, Jiang-Fan Chen, Ulla Hansen,
Zhiping Weng (2004). Detection of functional DNA motifs via
statistical over-representation. Nucleic Acids Research 32(4):1372-81.
|
|
MotifLocator
is an algorithm to find pre-defined motifs in DNA sequences
using a adapted position-weigth matrix scoring scheme. Individual sites
are scored by the motif model and a higher-order background model. The
score is then computed as the normalized ratio of the motif score and
the background score.
Version: 3.1
Web link:
MotifLocator Help Page
References:
- Thijs G., Lescot M., Marchal K., Rombauts S., De Moor B., Rouze P., Moreau Y., 2001.
A higher order background model improves the detection of regulatory elements by Gibbs Sampling,
Bioinformatics, 17(12),1113-1122.
- Thijs G., Marchal K., Lescot M., Rombauts S., De Moor B., Rouze P., Moreau Y., 2002.
A Gibbs Sampling method to detect over-represented motifs in upstream regions of coexpressed genes,
Journal of Computational Biology (special issue Recomb'2001), 9(2), 447-464.
- Thijs G., Moreau Y., De Smet F., Mathys J., Lescot M., Rombauts S., Rouze P., De Moor B.,
Marchal K., 2002. INCLUSive: INtegrated Clustering, Upstream sequence retrieval and motif Sampling,
Bioinformatics, 18(2), 331-332.
|
|
MotifScanner
is a program that can be used to screen DNA sequences
with precompiled motif models. The algorithm is based on a probabilistic
sequence model in which motif are assumed to be hidden in a noisy
background sequence. To model the background we use higher-order Markov models.
Version: 3.0
Web link:
MotifScanner Homepage
References: See MotifLocator above.
|
[Back to top of page]
Algorithm Parameters
[Back to top of page]
Algorithm Background Models
All 3 algorithms use sets of DNA sequences to generate a
statistical model against which to select PWM predictions that
are likely to be significant, in comparision to this statistical
background.
MotifLocator and MotifScanner
Clover
- Mouse: First 200 genes on chromosome 17 from
upstream5000.fa from UCSC_mm4 (Oct2003).
Only sequences that are fully 5K long are used i.e. sequences with Ns
were skipped over. A total of 200 sequences gives a total background
set of 1M basepairs.
- Human: 200 upstream sequences are sliced randomly from
the file
upstream5000.fa from UCSC_hg17 (May2004).
Only sequences that are fully 5K long are used i.e. sequences with Ns
were skipped over. A total of 200 sequences gives a total background
set of 1M basepairs.
[Back to top of page]
The TRANSFAC matrices used in the scans are from
the
Public 7.0 release. Matrices were divided into
sets based on the species they are derived from:
187 mouse matrices
and 209 human matrices.
The set of stringent matrices used in the
ATF3 paper is a hand-curated list of 78 matrices
derived from the TRANSFAC Professional release 8.3, with
a focus on mouse specific matrices. Due to the
TRANSFAC licence agreement, only those matrices within
the Public release can be used in analyses via a web
service. Therefore, the stringent list of 78 matrices
was analyzed to identify those matrices present in the
Public 7.0 list. This resulted in:
28 mouse matrices
and 27 human matrices.
Full list of included TRANSFAC Public 7.0 matrices: TRANSFACmatrices.txt.
[Back to top of page]
The JASPAR matrices used in the scans are from
the open-source
JASPAR database. The set specifically incorporated into the
website are the 174 matrices from the PHYLOFACTS set, which is derived from the paper:
Xie et al., Systematic discovery of regulatory motifs in human promoters and 3' UTRs by
comparison of several mammals, Nature 434, 338-345 (2005) and supplementary material
The following paragraph is the description of the matrices from the
JASPAR help page:
In short, the authors used the following strategy. Promoters (defined as the 4-kb region
around the TSS) of human genes from the RefSeq database were aligned against the genomes
of mouse, rat and dog. Every consensus sequence of length between 6 and 26, defined over
an alphabet of 4 unique (A,C,G,T) and 7 degenerate (R, Y, K, M, S, W, N) nucleotides, was
scanned over the alignments. A motif is regarded as conserved when it appears in the alignment
both for the human and for the other three mammalian species. The conservation rate p is
defined as the number of times a motif is conserved divided by the number of times it occurs
in man only. This conservation rate is compared to the expected conservation rate p0, estimated
from random motifs, which gives the motif conservation score MCS. Only motifs with an MCS>6
were retained, resulting in a list of 174 highly conserved motifs (see supplementary Table S2
of Xie et al.). The count matrices for these 174 motifs were extracted from the downloaded
alignments. They were further annotated according to their resemblance with TRANSFAC and
JASPAR CORE motifs. For TRANSFAC, the annotation of Xie et al. was used. For comparing to
the JASPAR CORE matrices, the Pearson Correlation Coefficient (PCC) was used to define
matrix similarity. All PHYLOFACTS matrices were scanned against the JASPAR CORE matrices,
and matrices were regarded as being similar when PCC>0.8. When multiple hits were found,
only the one with the highest PCC was retained.
Full list of included JASPAR PHYLOFACTS matrices and their mapping
to TRANSFAC Public 7.0: JASPARmatrices.txt.
[Back to top of page]
During initial, exploratory analyses using MotifLocator, on the odd occasion a known
binding site would be missed, even though there was a labelled matrix that should
have hit. Examining these individual cases it appeared that the specific matrix
did not report the hit as it was too stringent in a single position i.e., the matrix
expected an A at 100%, whereas a C was observed.
In order to reduce this particular stringency, a low-level of noise was added to the
matrices used by MotifLocator and MotifScanner. Specifically, if the expectation of a
nucleotide is 0 then 0.01 is added to that position and the other expectations are adjusted
accordingly. For example,
Before: A C G T
1.00 0.00 0.00 0.00
After: A C G T
0.97 0.01 0.01 0.01
In essence this is akin to the methodology of adding a pseudocount to the
nucleotide expectations. Since Clover implements this concept, the
matrices used by Clover were NOT modified prior to scanning.
[Back to top of page]
To reduce the number of potential false positive predictions made
by the matrix scanning algorithms, we implemented a method to select
only the most stringent predictions. The composition of matrices
can vary greatly from those that are very tightly defined (i.e. a position
that is always 100% an A) to those that are very loose (i.e. any of the
four base nucleotides and so an N), so a single global threshold value is
not always appropriate. Therefore we designed a scanning procedure that
produced a distribution for each indiviudal matrix against random sequence
and selected threshold values that gaves us the topmost, significant quantiles.
It is described in more detail below.
To assess the statistical significance of MotifLocator scores, we evaluated
scores on randomized sequence. We obtained the random distributions for each
binding site matrix model individually, as the distributions and thresholds
were sensitive to the choice of matrix (see below). The random sequences were
constructed as follows. 200kBs of sequence was sampled from upstream regions
of ~100 immune related genes. This sequence underwent a random shuffling and
was then split into four equal seqments of 50kB each, to be accommodated by
the scanning algorithm. Scanning was then performed, on both + and - strands.
The randomization procedure was repeated twice for a total of 4x(2x50)x(1+2)=1200
random scores per matrix. An example distribution is given below. This distribution
allows us to convert any MotifLocator score on true sequence to a p-value
reflecting the probability that that score was obtained by chance. These distributions
can also be used to set thresholds for filtering scanning results, for example
corresponding to defined quantiles of the random distribution. Quantile values
vary considerably between matrices. For example, the 0.1% topmost quantile ranges
from 0.711 for TRANSFAC PAX1_B, to 1.087 for Transfac GATA_01.
[Back to top of page]
FASTA files: FASTA is probably the simplest of formats for
unaligned sequences. FASTA files are easily created in a text editor.
Each sequence is preceded by a line starting with >. The first word on
this line is the name of the sequence. The rest of the line is a description
of the sequence (free format). The remaining lines contain the sequence itself.
You can put as many letters on a sequence line as you want. Examples are
shown below:
>sequenceOne The first example sequence.
GATGGATGGGCTAGATGATCGGATAGAGAGAGAGAGATTGTAG
GATGGTATTTTAGATAGATAGAGAGAG
>sequenceTwo The second example sequence.
ATGGATTGATAGATAGGCTAGCTCCGCATCAGCTACGACTCAG
AGTCATCGATCTGCTAGCATCCTCGACTACTGG
FASTA files are conventionally named with a .fa extension.
[Back to top of page]
If you have an problems using the web server please read the
FAQ first to see whether there
is an answer to your problem there.
Otherwise if you have any
comments or questions regarding
MotifMogul you may email:
[Back to top of page]
People
- Colin Frith (Univeristy of Queensland, Australia):
For allowing the usage of Clover.
- Reinhard Engels (Broad Institutre, MIT,USA):
Lead developer of the Argo genome browser.
- Stephen Ramsey (ISB): Made the perl backbone code
more robust and reliable. Wrote and debugged the first
incarnation of the webserver.
- Vesteinn Thorsson (ISB): Implemented the stringent
scanning process.
- Martin Korb (ISB): Code and valuable advice
in getting Argo to JavaWebStart nicely.
- Bin Li (ISB): For hand-curating the
set of mouse specific matrices.
- Ryan Pelan (ISB): Created the great looking
'skin' for the website.
- Andrew Peabody (ISB): Managed all the necessary
server issues.
- And Hamid Bolouri (ISB) for kicking off the whole project!!
- Alistair Rust (ISB): Creator and primary
developer of MotifMogul.
[Back to top of page]
[Back to top of page]
|