|
Mogul Home PageInstitute for Systems Biology |

Mogul is a software tool that integrates multiple transcription factor binding site prediction algorithms into a common framework. Mogul provides access to a number of the most widely-used and popular predictions algorithms, organising them based on the predcitions that they make:
Scanning Algorithms: These algorithms analysze intergenic sequences using matrices dervied from experimentally verified transcription factor binding sites.
Ab-initio Algorithms: These algorithms scan sequences without any prior knowledge of known sites. They typically identify over- and under-represented motifs.
Co-regulated/Co-expressed: This class incorporates algorithms that analyze multiple sequences, whose genes share similar expression profiles. The algorithms use the hypothesis that the analyzed set of intergenic sequences share common binding sites.
Comparative: These algorithms seek to identify conserved blocks of intergeniuc regions for pairs of sequences.
Phylogenetic: Using the increasingly large number of sequenced genomes, these algorithms incoporate phylogenetic intformation to make their predictions.
(A complete list of the 22 incorporated algorithms is detailed in the next section.)
The different classes of algorithms are intergrated in a highly-configurable web interface, shown above.
Mogul is configured to perform analyses on a number of organisms: yeast, sea-urchin, mouse and human. The organisms are differentiated based on the used species specific matrices and background statistical data, derived from DNA sequence for that organism. Known binding site data is imported from a number of sources including TRANSFAC (Professional Version 7.3), JASPAR and personal communications with lab biologists.
Due to licensing agreements with authors and companies of included algorithms and databases, Mogul is NOT publicly accessible. A majority of the algorithms are downloaded under the specific terms that they are to be used internally with an organisation. Hence as described below Mogul is ONLY available to researchers within the ISB.
In the near future, the Perl source code that wraps the algorithms and generates the common format output results will be made open source. Users may then install Mogul on locally and then independently approach the authors of the included algorithms executables to fully configure the system.
|
Algorithm |
Description |
|---|---|
|
Ahab is an algorithm which was designed to detect enhancers and binding sites for transcription factors in the genomes of multicellular organisms. Several Ahab predictions have been successfully tested experimentally.
Rajewsky N, Vergassola M, Gaul U and Siggia E D, Computational detection of genomic cis regulatory modules, applied to body patterning in the early Drosophila embryo, BMC Bioinformatics, 3:30, 2002 |
|
|
Clover is a program for identifying functional sites in DNA sequences. If you give it a set of DNA sequences that share a common function, it will compare them to a library of sequence motifs (e.g. transcription factor binding patterns), and identify which if any of the motifs are statistically overrepresented in the sequence set.
Frith M C, Fu Y, Yu L, Chen J-F, Hansen U, Weng Z, Detection of functional DNA motifs via statistical over-representation, Nucleic Acids Research, 32(4):1372-81, 2004 |
|
|
fuzznuc uses PROSITE style patterns to search nucleotide sequences. Patterns are specifications of a (typically short) length of sequence to be found. They can specify a search for an exact sequence or they can allow various ambiguities, matches to variable lengths of sequence and repeated subsections of the sequence. fuzznuc intelligently selects the optimum searching algorithm to use, depending on the complexity of the search pattern specified. fuzznuc is one program in the EMBOSS suite of biological software tools distributed by the HGMP, UK. |
|
|
MotifScanner is a program that can be used to screen DNA sequences with precompiled motif models. The algorithm is based on a probabilistic sequence model in which motif are assumed to be hidden in a noisy background sequence. To model the background we use higher-order Markov models.
Thijs G, Marchal K, Lescot M, Rombauts S, De Moor B, Rouzé P and Moreau Y, A Gibbs Sampling method to detect over-represented motifs in upstream regions of coexpressed genes, Proceedings Recomb'2001, 305-312, 2001 |
|
Algorithm |
Description |
|---|---|
|
DREAM |
DREAM uses a higher-order Markov model to analyse DNA sequences for statistically under- and over-represented patterns of binding site motifs. Predicted patterns are further refined by considering the spatial relationship of pattern clusters and reporting putative motif modules. |
|
The PARS search algorithm utilises a number of complemenntary approaches to identify potential cis-regulatory binding sites. In combination we are hopeful that this analysis will prove to be a powerful tool for analysing cis-regulatory regions and elucidating the topology of genetic regulatory networks. |
|
|
R'MES is a set of programs to detect words that appear in a given DNA sequence with an unexpected frequency. Two classes of Markov chain models are used for the sequence: either stationary Markov chains of order m (m >= 0) or Markov chains of order m with 3-periodic transition probabilities. This last class of models is particularly adapted for coding DNA sequences because the reading frame (phase) is taken into account.
A word W appears with an unexpected frequency in a sequence if the number of occurrences N(W) of W is significantly different from an estimator of the expected count under the considered model. A significant difference between these two counts is obtained by using a Gaussian approximation or a compound Poisson approximation (for long words) of the distribution of N(W). In each case, R'MES provides a statistic indicating whether the word is under or over-represented. R'MES provides also a statistic that tests whether the number of clumps of W occurs with an unexpected frequency in the DNA sequence, using a Poisson approximation.
Schbath S, An efficient statistic to detect over- and under-represented words in DNA sequences, J. Comp. Biol., 4, 189-192, 1997 |
|
|
The problem of characterizing and detecting over- or under-represented words in sequences arises ubiquitously in diverse applications and has been studied rather extensively in Computational Molecular Biology. In most approaches to the detection of unusual frequencies of words in sequences, the words (up to a certain length) are enumerated more or less exhaustively and individually checked in terms of observed and expected frequencies, variances, and scores of discrepancy and significance thereof. We take instead the global approach of annotating a suffix trie or automaton of a sequence with some such values and scores, with the objective of using it as a collective detector of all unexpected behaviors, or perhaps just as a preliminary filter for words suspicious enough to warrant further and more accurate scrutiny.
Ref: Apostolico A, Bock M E, Lonardi S, Xu X, Efficient Detection of Unusual Words, Journal of Computational Biology, vol.7, no.1/2, 2000 |
|
|
YMF is a program that detects statistically over-represented words (motifs) in DNA sequences. The user may specify the characteristics of the motifs to be detected. A motif here is a short string of nucleotides, degenerate symbols, and spacers.
Sinha S and Tompa M, Discovery of Novel Transcription Factor Binding Sites by Statistical Overrepresentation, Nucleic Acids Research, vol. 30, no. 24, 5549-5560, 2002 Sinha S and Tompa M, A Statistical Method for Finding Transcription Factor Binding Sites, Eighth International Conference on Intelligent Systems for Molecular Biology, San Diego, CA, 344-354, 2000 |
|
Algorithm |
Description |
|---|---|
|
AlignACE (Aligns Nucleic Acid Conserved Elements) is a program which finds sequence elements conserved in a set of DNA sequences.
Hughes JD, Estep PW, Tavazoie S, and Church GM, Computational identification of cis-regulatory elements associated with groups of functionally related genes in Saccharomyces cerevisiae, Journal of Molecular Biology, Mar 10;296(5):1205-14, 2000
Roth FR, Hughes JD, Estep PE and Church GM, Finding DNA Regulatory Motifs within Unaligned Non-Coding Sequences Clustered by Whole-Genome mRNA Quantitation, Nature Biotechnology,Oct;16(10):939-45, 1998 |
|
|
BioProspector, a C program using a Gibbs sampling strategy, examines the upstream region of genes in the same gene expression pattern group and looks for regulatory sequence motifs. BioProspector uses zero to third-order Markov background models whose parameters are either given by the user or estimated from a specified sequence file. The significance of each motif found is judged based on a motif score distribution estimated by a Monte Carlo method.
Liu X, Brutlag DL and Liu JS, BioProspector: discovering conserved DNA motifs in upstream regulatory regions of co-expressed genes, Pac. Symp. Biocomput.,127-38, 2001 |
|
|
To pinpoint the interaction sites down to the base pair level, we introduce a novel computational method, Motif Discovery scan (MDscan), that examines the ChIP-array selected sequences and searches for DNA sequence motifs representing the protein-DNA interaction sites. MDscan combines the advantages of two widely adopted motif search strategies, word enumeration and position-specific weight matrix updating, and incorporates the ChIP-array ranking information to accelerate the search and enhance its success rate.
Liu XS, Brutlag DL and Liu JS, An algorithm for finding protein-DNA binding sites with applications to chromatin immunoprecipitation microarray experiments, Nat. Biotechnol., 20(8):835-9, 2002 |
|
|
MEME may be used to discover motifs (highly conserved regions) in groups of related DNA or protein sequences.
Bailey TL and Elkan C, Fitting a mixture model by expectation maximization to discover motifs in biopolymers, Proceedings of the Second International Conference on Intelligent Systems for Molecular Biology, 28-36, AAAI Press, 1994 |
|
|
The MotifSampler tries to find over-represented motifs in the upstream region of a set of co-regulated genes. This motif finding algorithm uses Gibbs sampling to find the position probability matrix that represents the motif.
Thijs G, Lescot M, Marchal K, Rombauts S, De Moor B, Rouze P, Moreau Y, A higher order background model improves the detection of regulatory elements by Gibbs Sampling, Bioinformatics, 17(12),1113-1122, 2001 Thijs G, Marchal K, Lescot M, Rombauts S, De Moor B, Rouze P and Moreau Y, A Gibbs Sampling method to detect over-represented motifs in upstream regions of coexpressed genes, Journal of Computational Biology (special issue Recomb'2001), 9(2), 447-464, 2002 |
|
|
Motsa is part of the Java-based "Netmotsa" motif sampling library, implementing basic and network-based motif finding.
To be presented at ISMB 2004. |
|
|
Consensus is a pattern recognition program that can be used in the identifying consensus patterns in a set of unaligned DNA, RNA and protein sequences. The algorithm is based on a matrix representation of a consensus pattern and the statistical significance of the corresponding alignment. |
|
Algorithm |
Description |
|---|---|
|
Bayesaligner uses phylogenetic footprinting to make predictions of putative transcription factor binding sites.
Zhu, J, Liu JS and Lawrence CE, Bayesian adaptive sequence alignment algorithms, Bioinformatics, 14:25-39, 1998 |
|
|
Dna Block Aligner (DBA) aligns two sequences under the assumption that the sequences share a number of colinear blocks of conservation separated by potentially large and varied lengths of DNA in the two sequences. The aim was that this was a very sensible thing to do with syntenous regions of non coding DNA between say mouse and human, for example, the upstream regions of a gene from mouse and human, or the conserved intron of a human - chicken gene. The conserved blocks may be regions important for regulation of the gene.
Promoterwise compares two DNA sequences allowing for inversions and translocations, ideal for promoters. |
|
|
Seqcomp is a simple C program that takes 2 fasta format sequence files and performs the basic analysis that "dot-plot" DNA comparison programs do. |
|
Algorithm |
Description |
|---|---|
|
Phylogenetic footprinting is a method that identifies putative regulatory elements in DNA sequences. It identifies regions of DNA that are unusually well conserved across a set of orthologous sequences.
Blanchette M and Tompa M, FootPrinter: a program designed for phylogenetic footprinting, Nucleic Acids Research, vol. 31, no. 13, 3840-3842, 2003 |
|
|
|
MAVID is a multiple alignment program that is suitable for alignments of large numbers of DNA sequences. The sequences can be small mitochondrial genomes or large genomic regions up to megabases long. The MAVID server integrates MAVID with various phylogenetic tree construction programs and visualization tools to allow biomedical researchers who have a collection of related genomic sequences to rapidly identify conserved regions for further analysis.
Bray N and Pachter L, MAVID: Constrained ancestral alignment of multiple sequences, Genome Research, 14:693-699, 2004 |
|
AlignACE-Phylo: see description in Co-regulated table. |
|
|
BioProspector-Phylo: see description in Co-regulated table. |
|
|
Mdscan-Phylo: see description in Co-regulated table. |
|
|
MEME-Phylo: see description in Co-regulated table. |
|
|
MotifSampler-Phylo: see description in Co-regulated table. |
|
|
Motsa-Phylo: see description in Co-regulated table. |
|
|
Wconsensus-Phylo: see description in Co-regulated table. |
Mogul expects sequence to be input from the user in FASTA format.
FASTA files: FASTA is probably the simplest of formats for unaligned sequences. FASTA files are easily created in a text editor. Each sequence is preceded by a line starting with >. The first word on this line is the name of the sequence. The rest of the line is a description of the sequence (free format). The remaining lines contain the sequence itself. You can put as many letters on a sequence line as you want. Examples are shown below:
>sequenceOne The first example sequence. GATGGATGGGCTAGATGATCGGATAGAGAGAGAGAGATTGTAG GATGGTATTTTAGATAGATAGAGAGAG >sequenceTwo The second example sequence. ATGGATTGATAGATAGGCTAGCTCCGCATCAGCTACGACTCAG AGTCATCGATCTGCTAGCATCCTCGACTACTGG
FASTA files are conventionally named with a .fa extension.
Reference sequence: This is the sequence for which the motifs are predicted upon. It must be in FASTA format and is assumed to have a .fa extension. The pipeline software uses the name of the file to report the results of the analyses run on this reference sequence and not that taken from the line beginning with > in the file itself. For example, results for a reference file GAL1.fa will be reported with the prefix GAL1.*
Co-regulated sequences: This is a FASTA file that contains a list of sequences that are thought to be co-regulated alongwith the reference sequence. This file MUST also contain the reference sequence as well. For example
>referenceSequence GATGGATGGGCTAGATGATCGGATAGAGAGAGAGAGATTGTAG GATGGTATTTTAGATAGATAGAGAGAG >sequenceOne The first co-regulated sequence. ATGGATTGATAGATAGGCTAGCTCCGCATCAGCTACGACTCAG AGTCATCGATCTGCTAGCATCCTCGACTACTGG >sequenceTwo The second co-regulated sequence. GATCGATGCTAGGCTTAGGATATCGGATCGACGAGCTACGACG ATCGTACGTTTACGATCGACTACGACGACTAGC
Comparative sequences: Again this is a FASTA file that contains the sequence of an intergenic region from the same gene as the reference sequence, but from a different species.
Evolutionary sequences: Similar to the co-regulated sequence file, this a multiple sequence FASTA file. The difference is that the intergenic sequences contained in this file are for the same intergenic region as the reference sequence. Again this is a FASTA file that contains the sequence of an intergenic region from the same gene as the reference sequence, but from a different species. This file MUST also contain the reference sequence as well. For example
>referenceSequence GATGGATGGGCTAGATGATCGGATAGAGAGAGAGAGATTGTAG GATGGTATTTTAGATAGATAGAGAGAG >sequenceOne The intergenic sequence from species one. ATGGATTGATAGATAGGCTAGCTCCGCATCAGCTACGACTCAG AGTCATCGATCTGCTAGCATCCTCGACTACTGG >sequenceTwo The intergenic sequence from species two. GATCGATGCTAGGCTTAGGATATCGGATCGACGAGCTACGACG ATCGTACGTTTACGATCGACTACGACGACTAGC
File Format
Each motif should be listed on a separate line in this file, in
the format:
Motif_name tab Actual_Motif
You can
include comments provided those lines start with #.
Defining a motif
The standard IUPAC one-letter codes for the nucleotides are used. The symbol `n' is used for a position where any nucleotide is accepted.
Ambiguities are indicated by listing the acceptable nucleotides for a given position, between square parentheses `[ ]'. For example: [ACG] stands for A or C or G. Ambiguities are also indicated by listing between a pair of curly brackets `{ }' the nucleotides that are not accepted at a given position. For example: {AG} stands for any nucleotides except A and G.
Repetition of an element of the pattern can be indicated by following that element with a numerical value or a numerical range between parenthesis. Examples: N(3) corresponds to N-N-N, N(2,4) corresponds to N-N or N-N-N or N-N-N-N.
For example, [CG](5)TG{A}N(1,5)C
Matrices should be defined in the following format.
>
Name=Motif_Name
Consensus=ACGT
Width=4
10space1space0space1
1space10space0space1
0space1space9space2
1space2space2space7
<
>
Name=Next_Motif
etc.
Each row or line in a matrix corresponds to a basepair position in the motif. The columns are in the order a c g and t respectively. The frequency counts are integer numbers and should not be floating points.
Mogul reports results back in mutliple ways:
GFF text file. This file can be viewed using the Apollo Genome Browser. A separate document detailing how to configure Apollo to correctly read Mogul predictions may be found here. Displaying results using is the the best method as it allows greater interactions with the results.
PS and PDF versions of the GFF file.
A paper on Mogul is currently under preparation. An early, brief description of the package was presented as a poster at ISMB2003.
Rust AG, Ramsey S, Robinson M and Bolouri H, Reconstructing Transcriptional Regulatory Networks Via the Integration and Optimisation of Multiple Binding Site Prediction Algorithms, Int. Conf. on Systems Biology (ISMB2003), 2003 [PDF]
Hamid Bolouri: project inception and direction.
Stephen Ramsey: development and support of Mogul codebase.
Daehee Hwang & Larissa Kamenkovich: integration of Mogul with Pointillist.
Mark Robinson & Christophe Battail: early users and testers of Mogul with yeast and sea urchin sequences.
Vesteinn Thorsson, Bin Li & Mark Gilchrist: suggested improvements from testing on mouse sequence and motifs.
Bill Longabaugh: handy HTML/web-authoring tips.
Alistair Rust 19th May 2004
Institute for Systems Biology
Seattle, WA 98103, USA
(This document was prepared using OpenOffice 1.0.1 and converted to HTML using the built-in convertor.)
Mogul Overview Document: 19th May 2004 Ver.1.0
Last updated: 2004/05/19 42:42:42
Please e-mail comments or corrections regarding this document to: MotifMogul you may email: