ISB Logo

Mogul Home Page

Institute for Systems Biology


1 About






1.1 Description

Mogul is a software tool that integrates multiple transcription factor binding site prediction algorithms into a common framework. Mogul provides access to a number of the most widely-used and popular predictions algorithms, organising them based on the predcitions that they make:

(A complete list of the 22 incorporated algorithms is detailed in the next section.)

The different classes of algorithms are intergrated in a highly-configurable web interface, shown above.



Mogul is configured to perform analyses on a number of organisms: yeast, sea-urchin, mouse and human. The organisms are differentiated based on the used species specific matrices and background statistical data, derived from DNA sequence for that organism. Known binding site data is imported from a number of sources including TRANSFAC (Professional Version 7.3), JASPAR and personal communications with lab biologists.



1.2 Links

1.3 Access to Mogul

Due to licensing agreements with authors and companies of included algorithms and databases, Mogul is NOT publicly accessible. A majority of the algorithms are downloaded under the specific terms that they are to be used internally with an organisation. Hence as described below Mogul is ONLY available to researchers within the ISB.



In the near future, the Perl source code that wraps the algorithms and generates the common format output results will be made open source. Users may then install Mogul on locally and then independently approach the authors of the included algorithms executables to fully configure the system.



2 Incorporated Algorithms

2.1 Scanning Algorithms

Algorithm

Description

Ahab

Ahab is an algorithm which was designed to detect enhancers and binding sites for transcription factors in the genomes of multicellular organisms. Several Ahab predictions have been successfully tested experimentally.



Rajewsky N, Vergassola M, Gaul U and Siggia E D, Computational detection of genomic cis regulatory modules, applied to body patterning in the early Drosophila embryo, BMC Bioinformatics, 3:30, 2002

Clover

Clover is a program for identifying functional sites in DNA sequences. If you give it a set of DNA sequences that share a common function, it will compare them to a library of sequence motifs (e.g. transcription factor binding patterns), and identify which if any of the motifs are statistically overrepresented in the sequence set.



Frith M C, Fu Y, Yu L, Chen J-F, Hansen U, Weng Z, Detection of functional DNA motifs via statistical over-representation, Nucleic Acids Research, 32(4):1372-81, 2004

fuzznuc

fuzznuc uses PROSITE style patterns to search nucleotide sequences. Patterns are specifications of a (typically short) length of sequence to be found. They can specify a search for an exact sequence or they can allow various ambiguities, matches to variable lengths of sequence and repeated subsections of the sequence. fuzznuc intelligently selects the optimum searching algorithm to use, depending on the complexity of the search pattern specified.

fuzznuc is one program in the EMBOSS suite of biological software tools distributed by the HGMP, UK.

MotifScanner

MotifScanner is a program that can be used to screen DNA sequences with precompiled motif models. The algorithm is based on a probabilistic sequence model in which motif are assumed to be hidden in a noisy background sequence. To model the background we use higher-order Markov models.



Thijs G, Marchal K, Lescot M, Rombauts S, De Moor B, Rouzé P and Moreau Y, A Gibbs Sampling method to detect over-represented motifs in upstream regions of coexpressed genes, Proceedings Recomb'2001, 305-312, 2001







2.2 Ab-initio Algorithms

Algorithm

Description

DREAM

DREAM uses a higher-order Markov model to analyse DNA sequences for statistically under- and over-represented patterns of binding site motifs. Predicted patterns are further refined by considering the spatial relationship of pattern clusters and reporting putative motif modules.

PARS

The PARS search algorithm utilises a number of complemenntary approaches to identify potential cis-regulatory binding sites. In combination we are hopeful that this analysis will prove to be a powerful tool for analysing cis-regulatory regions and elucidating the topology of genetic regulatory networks.

RMES Gaussian / Poisson

R'MES is a set of programs to detect words that appear in a given DNA sequence with an unexpected frequency. Two classes of Markov chain models are used for the sequence: either stationary Markov chains of order m (m >= 0) or Markov chains of order m with 3-periodic transition probabilities. This last class of models is particularly adapted for coding DNA sequences because the reading frame (phase) is taken into account.



A word W appears with an unexpected frequency in a sequence if the number of occurrences N(W) of W is significantly different from an estimator of the expected count under the considered model. A significant difference between these two counts is obtained by using a Gaussian approximation or a compound Poisson approximation (for long words) of the distribution of N(W). In each case, R'MES provides a statistic indicating whether the word is under or over-represented. R'MES provides also a statistic that tests whether the number of clumps of W occurs with an unexpected frequency in the DNA sequence, using a Poisson approximation.



Schbath S, An efficient statistic to detect over- and under-represented words in DNA sequences, J. Comp. Biol., 4, 189-192, 1997

Verbumculus

The problem of characterizing and detecting over- or under-represented words in sequences arises ubiquitously in diverse applications and has been studied rather extensively in Computational Molecular Biology. In most approaches to the detection of unusual frequencies of words in sequences, the words (up to a certain length) are enumerated more or less exhaustively and individually checked in terms of observed and expected frequencies, variances, and scores of discrepancy and significance thereof.

We take instead the global approach of annotating a suffix trie or automaton of a sequence with some such values and scores, with the objective of using it as a collective detector of all unexpected behaviors, or perhaps just as a preliminary filter for words suspicious enough to warrant further and more accurate scrutiny.



Ref: Apostolico A, Bock M E, Lonardi S, Xu X, Efficient Detection of Unusual Words, Journal of Computational Biology, vol.7, no.1/2, 2000

YMF

YMF is a program that detects statistically over-represented words (motifs) in DNA sequences. The user may specify the characteristics of the motifs to be detected. A motif here is a short string of nucleotides, degenerate symbols, and spacers.



Sinha S and Tompa M, Discovery of Novel Transcription Factor Binding Sites by Statistical Overrepresentation, Nucleic Acids Research, vol. 30, no. 24, 5549-5560, 2002

Sinha S and Tompa M, A Statistical Method for Finding Transcription Factor Binding Sites, Eighth International Conference on Intelligent Systems for Molecular Biology, San Diego, CA, 344-354, 2000





2.3 Co-regulated / Co-expressed Algorithms

Algorithm

Description

AlignACE

AlignACE (Aligns Nucleic Acid Conserved Elements) is a program which finds sequence elements conserved in a set of DNA sequences.



Hughes JD, Estep PW, Tavazoie S, and Church GM, Computational identification of cis-regulatory elements associated with groups of functionally related genes in Saccharomyces cerevisiae, Journal of Molecular Biology, Mar 10;296(5):1205-14, 2000



Roth FR, Hughes JD, Estep PE and Church GM, Finding DNA Regulatory Motifs within Unaligned Non-Coding Sequences Clustered by Whole-Genome mRNA Quantitation, Nature Biotechnology,Oct;16(10):939-45, 1998

BioProspector

BioProspector, a C program using a Gibbs sampling strategy, examines the upstream region of genes in the same gene expression pattern group and looks for regulatory sequence motifs. BioProspector uses zero to third-order Markov background models whose parameters are either given by the user or estimated from a specified sequence file. The significance of each motif found is judged based on a motif score distribution estimated by a Monte Carlo method.



Liu X, Brutlag DL and Liu JS, BioProspector: discovering conserved DNA motifs in upstream regulatory regions of co-expressed genes, Pac. Symp. Biocomput.,127-38, 2001

MDscan

To pinpoint the interaction sites down to the base pair level, we introduce a novel computational method, Motif Discovery scan (MDscan), that examines the ChIP-array selected sequences and searches for DNA sequence motifs representing the protein-DNA interaction sites. MDscan combines the advantages of two widely adopted motif search strategies, word enumeration and position-specific weight matrix updating, and incorporates the ChIP-array ranking information to accelerate the search and enhance its success rate.



Liu XS, Brutlag DL and Liu JS, An algorithm for finding protein-DNA binding sites with applications to chromatin immunoprecipitation microarray experiments, Nat. Biotechnol., 20(8):835-9, 2002

MEME

MEME may be used to discover motifs (highly conserved regions) in groups of related DNA or protein sequences.



Bailey TL and Elkan C, Fitting a mixture model by expectation maximization to discover motifs in biopolymers, Proceedings of the Second International Conference on Intelligent Systems for Molecular Biology, 28-36, AAAI Press, 1994

MotifSampler

The MotifSampler tries to find over-represented motifs in the upstream region of a set of co-regulated genes. This motif finding algorithm uses Gibbs sampling to find the position probability matrix that represents the motif.



Thijs G, Lescot M, Marchal K, Rombauts S, De Moor B, Rouze P, Moreau Y, A higher order background model improves the detection of regulatory elements by Gibbs Sampling, Bioinformatics, 17(12),1113-1122, 2001

Thijs G, Marchal K, Lescot M, Rombauts S, De Moor B, Rouze P and Moreau Y, A Gibbs Sampling method to detect over-represented motifs in upstream regions of coexpressed genes, Journal of Computational Biology (special issue Recomb'2001), 9(2), 447-464, 2002

Motsa

Motsa is part of the Java-based "Netmotsa" motif sampling library, implementing basic and network-based motif finding.



To be presented at ISMB 2004.

WConsensus

Consensus is a pattern recognition program that can be used in the identifying consensus patterns in a set of unaligned DNA, RNA and protein sequences. The algorithm is based on a matrix representation of a consensus pattern and the statistical significance of the corresponding alignment.




2.4 Comparative Algorithms

Algorithm

Description

Bayesaligner

Bayesaligner uses phylogenetic footprinting to make predictions of putative transcription factor binding sites.



Zhu, J, Liu JS and Lawrence CE, Bayesian adaptive sequence alignment algorithms, Bioinformatics, 14:25-39, 1998

Promoterwise / DBA

Dna Block Aligner (DBA) aligns two sequences under the assumption that the sequences share a number of colinear blocks of conservation separated by potentially large and varied lengths of DNA in the two sequences. The aim was that this was a very sensible thing to do with syntenous regions of non coding DNA between say mouse and human, for example, the upstream regions of a gene from mouse and human, or the conserved intron of a human - chicken gene. The conserved blocks may be regions important for regulation of the gene.



Promoterwise compares two DNA sequences allowing for inversions and translocations, ideal for promoters.

seqcomp

Seqcomp is a simple C program that takes 2 fasta format sequence files and performs the basic analysis that "dot-plot" DNA comparison programs do.



2.5 Phylogenetic Algorithms

Algorithm

Description

FootPrinter

Phylogenetic footprinting is a method that identifies putative regulatory elements in DNA sequences. It identifies regions of DNA that are unusually well conserved across a set of orthologous sequences.



Blanchette M and Tompa M, FootPrinter: a program designed for phylogenetic footprinting, Nucleic Acids Research, vol. 31, no. 13, 3840-3842, 2003

MAVID


MAVID is a multiple alignment program that is suitable for alignments of large numbers of DNA sequences. The sequences can be small mitochondrial genomes or large genomic regions up to megabases long. The MAVID server integrates MAVID with various phylogenetic tree construction programs and visualization tools to allow biomedical researchers who have a collection of related genomic sequences to rapidly identify conserved regions for further analysis.



Bray N and Pachter L, MAVID: Constrained ancestral alignment of multiple sequences, Genome Research, 14:693-699, 2004

AlignACE

AlignACE-Phylo: see description in Co-regulated table.

BioProspector

BioProspector-Phylo: see description in Co-regulated table.

MDscan

Mdscan-Phylo: see description in Co-regulated table.

MEME

MEME-Phylo: see description in Co-regulated table.

MotifSampler

MotifSampler-Phylo: see description in Co-regulated table.

Motsa

Motsa-Phylo: see description in Co-regulated table.

WConsensus

Wconsensus-Phylo: see description in Co-regulated table.



3 Usage Guide

3.1 Sequence Requirements

Mogul expects sequence to be input from the user in FASTA format.











3.2 Defining Motifs for the Scanning Algorithms

Each motif should be listed on a separate line in this file, in the format:

Motif_name tab Actual_Motif

You can include comments provided those lines start with #.

The standard IUPAC one-letter codes for the nucleotides are used. The symbol `n' is used for a position where any nucleotide is accepted.

Ambiguities are indicated by listing the acceptable nucleotides for a given position, between square parentheses `[ ]'. For example: [ACG] stands for A or C or G. Ambiguities are also indicated by listing between a pair of curly brackets `{ }' the nucleotides that are not accepted at a given position. For example: {AG} stands for any nucleotides except A and G.

Repetition of an element of the pattern can be indicated by following that element with a numerical value or a numerical range between parenthesis. Examples: N(3) corresponds to N-N-N, N(2,4) corresponds to N-N or N-N-N or N-N-N-N.

For example, [CG](5)TG{A}N(1,5)C

Matrices should be defined in the following format.

>

Name=Motif_Name

Consensus=ACGT

Width=4

10space1space0space1

1space10space0space1

0space1space9space2

1space2space2space7

<

>

Name=Next_Motif

etc.

Each row or line in a matrix corresponds to a basepair position in the motif. The columns are in the order a c g and t respectively. The frequency counts are integer numbers and should not be floating points.



4 Visualising Prediction Results



Mogul reports results back in mutliple ways:



5 References



A paper on Mogul is currently under preparation. An early, brief description of the package was presented as a poster at ISMB2003.



Rust AG, Ramsey S, Robinson M and Bolouri H, Reconstructing Transcriptional Regulatory Networks Via the Integration and Optimisation of Multiple Binding Site Prediction Algorithms, Int. Conf. on Systems Biology (ISMB2003), 2003 [PDF]



6 Credits





Alistair Rust 19th May 2004

Institute for Systems Biology

Seattle, WA 98103, USA



(This document was prepared using OpenOffice 1.0.1 and converted to HTML using the built-in convertor.)

Mogul Overview Document: 19th May 2004 Ver.1.0


Last updated: 2004/05/19 42:42:42

Please e-mail comments or corrections regarding this document to: MotifMogul you may email: