About FAQ Contact Acknowledgements Changes


MotifMogul is a software tool for predicting transcription factor binding sites using experimentally verified Position Weight Matrices (PWMs). The included sets of PWMs are from the TRANSFAC Public 7.0 database and from the JASPAR database (PHYLOFACTS).

The software performs two different functions:

  1. It combines 3 existing matrix scanning software tools within a single interface: Clover, MotifLocator and MotifScanner.

  2. A method to reduce false postive predictions is also implemented, whereby 3 stringent threshold values have been pre-calculated for each included matrix. Each stringent threshold value is based on assessing every matrix against scans on random sequences and choosing threshold values that represent the confidence of seeing a prediction at quantile levels of 0.1%, 0.5% and 0.01%. Only the algorithm MotifLocator was used to do this. This approach was used in:

MotifMogul is part of a larger project, that includes transcription factor binding site prediction algorithms other than matrix scanners. Work on this project is on-going.

Algorithm Descriptions

Clover is a program for identifying functional sites in DNA sequences. If you give it a set of DNA sequences that share a common function, it will compare them to a library of sequence motifs (e.g. transcription factor binding patterns), and identify which if any of the motifs are statistically overrepresented in the sequence set.

Version: Mar 3 2004

Web link: Clover Homepage

  • Martin C Frith, Yutao Fu, Liqun Yu, Jiang-Fan Chen, Ulla Hansen, Zhiping Weng (2004). Detection of functional DNA motifs via statistical over-representation. Nucleic Acids Research 32(4):1372-81.

MotifLocator is an algorithm to find pre-defined motifs in DNA sequences using a adapted position-weigth matrix scoring scheme. Individual sites are scored by the motif model and a higher-order background model. The score is then computed as the normalized ratio of the motif score and the background score.

Version: 3.1

Web link: MotifLocator Help Page

  • Thijs G., Lescot M., Marchal K., Rombauts S., De Moor B., Rouze P., Moreau Y., 2001. A higher order background model improves the detection of regulatory elements by Gibbs Sampling, Bioinformatics, 17(12),1113-1122.
  • Thijs G., Marchal K., Lescot M., Rombauts S., De Moor B., Rouze P., Moreau Y., 2002. A Gibbs Sampling method to detect over-represented motifs in upstream regions of coexpressed genes, Journal of Computational Biology (special issue Recomb'2001), 9(2), 447-464.
  • Thijs G., Moreau Y., De Smet F., Mathys J., Lescot M., Rombauts S., Rouze P., De Moor B., Marchal K., 2002. INCLUSive: INtegrated Clustering, Upstream sequence retrieval and motif Sampling, Bioinformatics, 18(2), 331-332.

MotifScanner is a program that can be used to screen DNA sequences with precompiled motif models. The algorithm is based on a probabilistic sequence model in which motif are assumed to be hidden in a noisy background sequence. To model the background we use higher-order Markov models.

Version: 3.0

Web link: MotifScanner Homepage

References: See MotifLocator above.

Back to top of page]

Algorithm Parameters


Taken from Clover Homepage


See also MotifLocator Helpage


See also MotifScanner Homepage

[Back to top of page]

Algorithm Background Models

All 3 algorithms use sets of DNA sequences to generate a statistical model against which to select PWM predictions that are likely to be significant, in comparision to this statistical background.

MotifLocator and MotifScanner Clover
  • Mouse: First 200 genes on chromosome 17 from upstream5000.fa from UCSC_mm4 (Oct2003). Only sequences that are fully 5K long are used i.e. sequences with Ns were skipped over. A total of 200 sequences gives a total background set of 1M basepairs.
  • Human: 200 upstream sequences are sliced randomly from the file upstream5000.fa from UCSC_hg17 (May2004). Only sequences that are fully 5K long are used i.e. sequences with Ns were skipped over. A total of 200 sequences gives a total background set of 1M basepairs.
[Back to top of page]

TRANSFAC Public 7.0 Matrices

The TRANSFAC matrices used in the scans are from the Public 7.0 release. Matrices were divided into sets based on the species they are derived from: 187 mouse matrices and 209 human matrices.

The set of stringent matrices used in the ATF3 paper is a hand-curated list of 78 matrices derived from the TRANSFAC Professional release 8.3, with a focus on mouse specific matrices. Due to the TRANSFAC licence agreement, only those matrices within the Public release can be used in analyses via a web service. Therefore, the stringent list of 78 matrices was analyzed to identify those matrices present in the Public 7.0 list. This resulted in: 28 mouse matrices and 27 human matrices.

Full list of included TRANSFAC Public 7.0 matrices: TRANSFACmatrices.txt.

[Back to top of page]


The JASPAR matrices used in the scans are from the open-source JASPAR database. The set specifically incorporated into the website are the 174 matrices from the PHYLOFACTS set, which is derived from the paper:

  • Xie et al., Systematic discovery of regulatory motifs in human promoters and 3' UTRs by comparison of several mammals, Nature 434, 338-345 (2005) and supplementary material
  • The following paragraph is the description of the matrices from the JASPAR help page:

    In short, the authors used the following strategy. Promoters (defined as the 4-kb region around the TSS) of human genes from the RefSeq database were aligned against the genomes of mouse, rat and dog. Every consensus sequence of length between 6 and 26, defined over an alphabet of 4 unique (A,C,G,T) and 7 degenerate (R, Y, K, M, S, W, N) nucleotides, was scanned over the alignments. A motif is regarded as conserved when it appears in the alignment both for the human and for the other three mammalian species. The conservation rate p is defined as the number of times a motif is conserved divided by the number of times it occurs in man only. This conservation rate is compared to the expected conservation rate p0, estimated from random motifs, which gives the motif conservation score MCS. Only motifs with an MCS>6 were retained, resulting in a list of 174 highly conserved motifs (see supplementary Table S2 of Xie et al.). The count matrices for these 174 motifs were extracted from the downloaded alignments. They were further annotated according to their resemblance with TRANSFAC and JASPAR CORE motifs. For TRANSFAC, the annotation of Xie et al. was used. For comparing to the JASPAR CORE matrices, the Pearson Correlation Coefficient (PCC) was used to define matrix similarity. All PHYLOFACTS matrices were scanned against the JASPAR CORE matrices, and matrices were regarded as being similar when PCC>0.8. When multiple hits were found, only the one with the highest PCC was retained.

    Full list of included JASPAR PHYLOFACTS matrices and their mapping to TRANSFAC Public 7.0: JASPARmatrices.txt.

    [Back to top of page]

    Preprocessing of MotifLocator and MotifScanner Matrices

    During initial, exploratory analyses using MotifLocator, on the odd occasion a known binding site would be missed, even though there was a labelled matrix that should have hit. Examining these individual cases it appeared that the specific matrix did not report the hit as it was too stringent in a single position i.e., the matrix expected an A at 100%, whereas a C was observed.

    In order to reduce this particular stringency, a low-level of noise was added to the matrices used by MotifLocator and MotifScanner. Specifically, if the expectation of a nucleotide is 0 then 0.01 is added to that position and the other expectations are adjusted accordingly. For example,

    Before:  A     C     G     T
           1.00  0.00  0.00  0.00
    After:   A     C     G     T
           0.97  0.01  0.01  0.01

    In essence this is akin to the methodology of adding a pseudocount to the nucleotide expectations. Since Clover implements this concept, the matrices used by Clover were NOT modified prior to scanning.

    [Back to top of page]

    Stringent MotifLocator Scans

    To reduce the number of potential false positive predictions made by the matrix scanning algorithms, we implemented a method to select only the most stringent predictions. The composition of matrices can vary greatly from those that are very tightly defined (i.e. a position that is always 100% an A) to those that are very loose (i.e. any of the four base nucleotides and so an N), so a single global threshold value is not always appropriate. Therefore we designed a scanning procedure that produced a distribution for each indiviudal matrix against random sequence and selected threshold values that gaves us the topmost, significant quantiles. It is described in more detail below.

    To assess the statistical significance of MotifLocator scores, we evaluated scores on randomized sequence. We obtained the random distributions for each binding site matrix model individually, as the distributions and thresholds were sensitive to the choice of matrix (see below). The random sequences were constructed as follows. 200kBs of sequence was sampled from upstream regions of ~100 immune related genes. This sequence underwent a random shuffling and was then split into four equal seqments of 50kB each, to be accommodated by the scanning algorithm. Scanning was then performed, on both + and - strands. The randomization procedure was repeated twice for a total of 4x(2x50)x(1+2)=1200 random scores per matrix. An example distribution is given below. This distribution allows us to convert any MotifLocator score on true sequence to a p-value reflecting the probability that that score was obtained by chance. These distributions can also be used to set thresholds for filtering scanning results, for example corresponding to defined quantiles of the random distribution. Quantile values vary considerably between matrices. For example, the 0.1% topmost quantile ranges from 0.711 for TRANSFAC PAX1_B, to 1.087 for Transfac GATA_01.

    Back to top of page]

    Sequence Requirements

    FASTA files: FASTA is probably the simplest of formats for unaligned sequences. FASTA files are easily created in a text editor. Each sequence is preceded by a line starting with >. The first word on this line is the name of the sequence. The rest of the line is a description of the sequence (free format). The remaining lines contain the sequence itself. You can put as many letters on a sequence line as you want. Examples are shown below:

    >sequenceOne The first example sequence.
    >sequenceTwo The second example sequence.

    FASTA files are conventionally named with a .fa extension.

    [Back to top of page]


    If you have an problems using the web server please read the FAQ first to see whether there is an answer to your problem there.

    Otherwise if you have any comments or questions regarding MotifMogul you may email:

    [Back to top of page]


    Other incorporated software tools


    • Colin Frith (Univeristy of Queensland, Australia): For allowing the usage of Clover.
    • Reinhard Engels (Broad Institutre, MIT,USA): Lead developer of the Argo genome browser.
    • Stephen Ramsey (ISB): Made the perl backbone code more robust and reliable. Wrote and debugged the first incarnation of the webserver.
    • Vesteinn Thorsson (ISB): Implemented the stringent scanning process.
    • Martin Korb (ISB): Code and valuable advice in getting Argo to JavaWebStart nicely.
    • Bin Li (ISB): For hand-curating the set of mouse specific matrices.
    • Ryan Pelan (ISB): Created the great looking 'skin' for the website.
    • Andrew Peabody (ISB): Managed all the necessary server issues.
    • And Hamid Bolouri (ISB) for kicking off the whole project!!
    • Alistair Rust (ISB): Creator and primary developer of MotifMogul.
    [Back to top of page]

    Change Log

    [Back to top of page]