MotifMogul is a software tool for predicting transcription factor binding sites using experimentally verified Position Weight Matrices (PWMs). The included sets of PWMs are from the TRANSFAC Public 7.0 database and from the JASPAR database (PHYLOFACTS).
The software performs two different functions:
[Back to top of page]
[Back to top of page]
Algorithm Background Models
All 3 algorithms use sets of DNA sequences to generate a statistical model against which to select PWM predictions that are likely to be significant, in comparision to this statistical background.MotifLocator and MotifScanner
The TRANSFAC matrices used in the scans are from the Public 7.0 release. Matrices were divided into sets based on the species they are derived from: 187 mouse matrices and 209 human matrices.
The set of stringent matrices used in the ATF3 paper is a hand-curated list of 78 matrices derived from the TRANSFAC Professional release 8.3, with a focus on mouse specific matrices. Due to the TRANSFAC licence agreement, only those matrices within the Public release can be used in analyses via a web service. Therefore, the stringent list of 78 matrices was analyzed to identify those matrices present in the Public 7.0 list. This resulted in: 28 mouse matrices and 27 human matrices.Full list of included TRANSFAC Public 7.0 matrices: TRANSFACmatrices.txt. [Back to top of page]
The JASPAR matrices used in the scans are from the open-source JASPAR database. The set specifically incorporated into the website are the 174 matrices from the PHYLOFACTS set, which is derived from the paper:
The following paragraph is the description of the matrices from the JASPAR help page:
In short, the authors used the following strategy. Promoters (defined as the 4-kb region around the TSS) of human genes from the RefSeq database were aligned against the genomes of mouse, rat and dog. Every consensus sequence of length between 6 and 26, defined over an alphabet of 4 unique (A,C,G,T) and 7 degenerate (R, Y, K, M, S, W, N) nucleotides, was scanned over the alignments. A motif is regarded as conserved when it appears in the alignment both for the human and for the other three mammalian species. The conservation rate p is defined as the number of times a motif is conserved divided by the number of times it occurs in man only. This conservation rate is compared to the expected conservation rate p0, estimated from random motifs, which gives the motif conservation score MCS. Only motifs with an MCS>6 were retained, resulting in a list of 174 highly conserved motifs (see supplementary Table S2 of Xie et al.). The count matrices for these 174 motifs were extracted from the downloaded alignments. They were further annotated according to their resemblance with TRANSFAC and JASPAR CORE motifs. For TRANSFAC, the annotation of Xie et al. was used. For comparing to the JASPAR CORE matrices, the Pearson Correlation Coefficient (PCC) was used to define matrix similarity. All PHYLOFACTS matrices were scanned against the JASPAR CORE matrices, and matrices were regarded as being similar when PCC>0.8. When multiple hits were found, only the one with the highest PCC was retained.Full list of included JASPAR PHYLOFACTS matrices and their mapping to TRANSFAC Public 7.0: JASPARmatrices.txt. [Back to top of page]
During initial, exploratory analyses using MotifLocator, on the odd occasion a known binding site would be missed, even though there was a labelled matrix that should have hit. Examining these individual cases it appeared that the specific matrix did not report the hit as it was too stringent in a single position i.e., the matrix expected an A at 100%, whereas a C was observed.
In order to reduce this particular stringency, a low-level of noise was added to the matrices used by MotifLocator and MotifScanner. Specifically, if the expectation of a nucleotide is 0 then 0.01 is added to that position and the other expectations are adjusted accordingly. For example,
In essence this is akin to the methodology of adding a pseudocount to the nucleotide expectations. Since Clover implements this concept, the matrices used by Clover were NOT modified prior to scanning.
To reduce the number of potential false positive predictions made by the matrix scanning algorithms, we implemented a method to select only the most stringent predictions. The composition of matrices can vary greatly from those that are very tightly defined (i.e. a position that is always 100% an A) to those that are very loose (i.e. any of the four base nucleotides and so an N), so a single global threshold value is not always appropriate. Therefore we designed a scanning procedure that produced a distribution for each indiviudal matrix against random sequence and selected threshold values that gaves us the topmost, significant quantiles. It is described in more detail below.
To assess the statistical significance of MotifLocator scores, we evaluated scores on randomized sequence. We obtained the random distributions for each binding site matrix model individually, as the distributions and thresholds were sensitive to the choice of matrix (see below). The random sequences were constructed as follows. 200kBs of sequence was sampled from upstream regions of ~100 immune related genes. This sequence underwent a random shuffling and was then split into four equal seqments of 50kB each, to be accommodated by the scanning algorithm. Scanning was then performed, on both + and - strands. The randomization procedure was repeated twice for a total of 4x(2x50)x(1+2)=1200 random scores per matrix. An example distribution is given below. This distribution allows us to convert any MotifLocator score on true sequence to a p-value reflecting the probability that that score was obtained by chance. These distributions can also be used to set thresholds for filtering scanning results, for example corresponding to defined quantiles of the random distribution. Quantile values vary considerably between matrices. For example, the 0.1% topmost quantile ranges from 0.711 for TRANSFAC PAX1_B, to 1.087 for Transfac GATA_01.
[Back to top of page]
FASTA files: FASTA is probably the simplest of formats for unaligned sequences. FASTA files are easily created in a text editor. Each sequence is preceded by a line starting with >. The first word on this line is the name of the sequence. The rest of the line is a description of the sequence (free format). The remaining lines contain the sequence itself. You can put as many letters on a sequence line as you want. Examples are shown below:
>sequenceOne The first example sequence. GATGGATGGGCTAGATGATCGGATAGAGAGAGAGAGATTGTAG GATGGTATTTTAGATAGATAGAGAGAG >sequenceTwo The second example sequence. ATGGATTGATAGATAGGCTAGCTCCGCATCAGCTACGACTCAG AGTCATCGATCTGCTAGCATCCTCGACTACTGG[Back to top of page]
If you have an problems using the web server please read the FAQ first to see whether there is an answer to your problem there.
Otherwise if you have any comments or questions regarding MotifMogul you may email:
[Back to top of page]
The development of
MotifMogul is supported by a grant from the
National Institute of Allergy and Infectious Disease
(NIAID), a division of the National Institutes of