
Using Mogul to Scan DNA Sequences for Known Transcription Factor Binding Sites: A Short Tutorial
The notes on sequence retrieval are specific to mammalian genomes. Yeast is not covered.
Mogul is designed to only accept sequence in FASTA format. The basic file format is:
>SequenceName
agtgctagatagatag...
I would recommend that when searching the 2 websites below, that RefSeq mRNA identifiers are used i.e. identifiers that begin with NM_. It is therefore best to translate NP_ identifiers to their NM_ equivalents.
The University of California Santa Cruz (UCSC) Genome Bioinformatics Group provide downloads of upstream DNA regions for a number of different species. The pre-processed data provides 3 sets of FASTA files of upstream regions in the ranges 1K, 2K and 5K bases.
The documentation on the UCSC website states: “Sequences [1000/2000/5000] bases upstream of annotated transcription start of RefSeq genes. This includes only the cases where the transcription start is separately from the coding region start”.
Specific downloads paths are below:
Human DNA sequence:
1K - http://genome.ucsc.edu/goldenPath/hg16/bigZips/upstream1000.zip
2K - http://genome.ucsc.edu/goldenPath/hg16/bigZips/upstream2000.zip
5K - http://genome.ucsc.edu/goldenPath/hg16/bigZips/upstream5000.zip
Mouse DNA sequence:
1K - http://genome.ucsc.edu/goldenPath/mm4/bigZips/upstream1000.zip
2K - http://genome.ucsc.edu/goldenPath/mm4/bigZips/upstream2000.zip
5K - http://genome.ucsc.edu/goldenPath/mm4/bigZips/upstream5000.zip
Rat DNA sequence:
1K – http://genome.ucsc.edu/goldenPath/rnJun2003/bigZips/upstream1000.zip
2K - http://genome.ucsc.edu/goldenPath/rnJun2003/bigZips/upstream2000.zip
5K - http://genome.ucsc.edu/goldenPath/rnJun2003/bigZips/upstream5000.zip
Advantages:
Simple, one-stop source for sequence data.
Disadvantages:
If specific data-sets are required, then the user needs to search for the genes of interest within the downloaded file and then manually copy and paste the sequence data.
Pre-computed sequence files do not allow for much flexibility.
EnsEMBL is a European-based, automated genome annotation project, jointly hosted by the Sanger Institute and the European Bioinformatics Institute, both in Cambridge, UK. As of April 2004, EnsEMBL provides genomic data on 11 different organisms. This includes human, mouse, rat and fugu. The EnsEMBL homepage is located at: http://www.ensembl.org/
Sequence and annotation data is available via a graphical user interface wizard that wraps around a search engine called EnsMART. The following is a brief worked example for retrieving 2 mouse DNA sequences from EnsMART.
Open up the following URL in a browser – http://www.ensembl.org/Multi/martview. This will open the following page.

Select mouse (Mus musculus) from the species drop-down window. Click the Next button. This will produce the following:

(At any point during this search process, you can check on the status of your search by checking the “Summary” panel, on the right hand side of the screen. If no results are obtained from a particular search then warnings are given in this column. And if you are unhappy with after a particular search you can use the Back button to move back to the previous window.)
In this example, I shall be searching for the 2 genes Sfrs7 (NM_146083) and Adam17 (NM_009615). To do this, uncheck the boxes:
“Limit to(..)” in the REGION sub-window.
“Known Genes” in the GENE sub-window.
Check the button “Limit to genes with these Ids”. Type the NM identifiers into the neighbouring window. Select “RefSeqIDs” from the drop-down menu that originally says “AFFY-UG.....” and then click “Next”.
This should produce the following window. Note that “Summary” panel states that 2 genes have been found with the required RefSeq Ids.

This is the OUTPUT screen and provides a large number of options of the data to export. In this case, we are interested in Sequence data, so click on the “Sequence” tab. This gives:

In this window it is possible to select DNA from different regions of a gene. The cartoon graphic of a gene illustrates the current selected sequence. To obtain the upstream regions of the 2 genes we are interested in, check the “5' upstream only” option. The size of the region can also be changed by changing the number in the “5' Flank (bp)” box, which by default is set to 1000 base pairs. Clicking the “Export” button will simply produce a multiple FASTA file in a new browser window, as shown below. A text file can be downloaded by checking the “Text, Fasta” option in the “Select the output format:” panel.

There are 2 points to make from this example.
The names of the genes in the FASTA file are based on references to the EnsEMBL internal database identifiers (i.e. ENSMUSG00...).
There are more than 2 sequences in the output when only we were only interested in 2 genes.
There are more than 2 sequences in the output because EnsEMBL is more complete when it searches its annotation database. EnsEMBL makes predictions for genes that can include multiple transcripts. To see this, go back to the original Output Screen and select the “Features”.

To verify the number of transcripts that a gene has, from the “GENE” panel, check “Ensembl Transcript ID” and “Transcript count”. Then again click “Export”.

This shows that the gene Sfrs7 (NM_146083) has 3 possible transcripts. To visualise these transcripts, clicking on the relevant record in the “Ensembl Gene ID” column will bring up the following page. Here you can see the 3 potential transcripts.

Advantages:
The user interface wizard provides a means to select specific sequences.
Regions other than upstream regions (e.g. intronic) can also be retrieved, which may be of use when analysing higher order species.
Disadvantages:
The identifiers associated with retrieved sequence are internal EnsEMBL references and are not include more general identifiers, such as NM_.
For genes with multiple transcripts, a decision must be made as to the desired upstream region.
Mogul assumes that sequences are presented in FASTA format.
Mogul does not store any sequence data. All sequence data must be supplied by the user.
Motif files presented to Mogul should be simple text files. Mogul will NOT analyse sequences or motif files in Word (.doc) format. This is because Word files contain hidden information regarding the format (page size, font etc.) of the document that it represents. This is redundant information. If you do have a Word document save it as a text file before using it with Mogul. The scanning process should work for sequence/text files created under Linux and Windows.
Open up a browser and go to http://hamid-lx:8080/MogulHome.html
Slide the browser window down to the portion of the screen that is called MultiScan Mogul.
From the drop-down window, select the Species that the sequences are from. This determines a background matrix used by one of the algortihms to identify motifs.
In 'Upload sequences in FASTA Format' click on Browse and upload your FASTA file. The FASTA file should be a simple text file of which can contain just one or many DNA sequences. EACH sequence in the file should start with:
>TheNameofTheFirstSequence
atgatgaatta...and the rest of the sequence
>TheNameofTheSecondSequence
atgatgaatta...and the rest of the sequence
If no motif files are supplied to Mogul, sets of background motifs are used. These are are follows:
Yeast:
~90 text motifs collected from http://biochemie.web.med.uni-muenchen.de/Yeast_Biology/Promoters.htm and from Jennifer Smith at ISB.
31 matrices from TRANSFAC.
Mouse
No text motifs are currently defined.
320 matrices extracted from TRANSFAC Professional v7.3 for MotifScanner. AHAB currently uses the yeast matrices.
Human
Not text motifs are currently defined.
367 matrices extracted from TRANSFAC Professional v7.3 for MotifScanner. AHAB currently uses the yeast matrices.
Each motif should be listed on a separate line in this file, in the format:
Motif_name_1 tab Actual_motif_sequence
Motif_name_2 tab Actual_motif_sequence
You may include comments provided those lines start with #.
The standard one-letter codes (acgt) for the nucleotides are used. The symbol 'n' is used for a position where any nucleotide is accepted.
Ambiguities are indicated by listing the acceptable nucleotides for a given position, between square parentheses [ ]. For example: [ACG] stands for A or C or G.
Ambiguities are also indicated by listing between a pair of curly brackets { } the nucleotides that are not accepted at a given position. For example: {AG} stands for any nucleotides except A and G.
Repetition of an element of the pattern can be indicated by following that element with a numerical value or a numerical range between parenthesis. Examples: N(3) corresponds to N-N-N, N(2,4) corresponds to N-N or N-N-N or N-N-N-N.
For example, the GAL4 binding site in yeast can be defined by CG[CG]N(11)CCG.
Click the Browse button to the left of “Upload motifs as text sequences” and select the file containing the consensus sequences. If you click on the highlighted text that says “matrices” this will give you the guidelines above as to the required file format.
Click on the box next to the algorithm called Fuzznuc. This empty box should then display an arrow or change in some way depending on the browser being used.
Matrices should be defined using the following format.
>
Name=Motif_Name
Consensus=ACGT
Width=4
10space1space0space1
1space10space0space1
0space1space9space2
1space2space2space7
<
>
Name=Next_Motif
etc.
Each row or line in a matrix corresponds to a basepair position in the motif. The columns are in the order a c g and t respectively.
The frequency counts are integer numbers and should not be floating points.
Due to an internal constraint of the scanner AHAB, a matrix can be a MAXIMUM of 16 base pairs long. If a desired matrix is longer than this width then it should either be split into 2 or terminal base pairs should be trimmed off. If AHAB is not selected MotifSampler is able to accept matrices greater than 16 base pairs.
And another internal constraint of AHAB means that the maximum length a motif name can be is 10 characters. If a longer name is entered in the matrix file, it is truncated to 10 characters.
To analyse a TRANSFAC-like / matrix-type motif, click the Browse button to the left of “Upload motifs as matrices” and select the file containing your matrices. If you click on the highlighted text that says matrices this will give you guidelines as to the required file format.
Then choose either or both of the boxes next to the algorithms MotifScanner and AHAB. These should then display an arrow or change in some way, again depending on the browser.
Click the button labelled “Submit” to run the selected analyses.
If there are inconsistencies in the formatting of the text or matrices files entered, then an error message will be displayed in the browser window. The line with the formatting error will be displayed and must be corrected before trying to re-run the analyses.
For correctly formatted motif files, Mogul will start scanning the FASTA file containing the mutliple sequences. A message will be displayed in the browse window confirming that the analyses have been started. This web page is automatically updated until the scanning process is complete.
The length of time for which the analyses run, will depends on the number of sequences in the FASTA file, the length of these sequences and the number of motifs that are to be searched for. If you have many sequence and many motifs to scan for, then you will need to be patient.
When the scanning process is complete, the browser page will automatically change to the results page. At the foot of this page, a message saying the “All seqeunces processed” will appear.
For each sequence a single message line is produced listing how many motifs were found, a link to a GFF file (that gives the coordinate information for each motif 'hit') and a link to a PDF file. If only have a few hits then the PDF files look very clumsy and child-like. They look better with more hits!
Running large sets of motifs against a “whole genome” will be very slow. “Whole genome” searches are certainly fine to run, I would just suggest that for this process though, only a handful of motifs are chosen.
Running very short motifs of just 4-5 base pairs will result in large numbers of hits. To get better results, try and run with longer consensus motifs. This will generate statistically more reliable hits.
Alistair 23/04/2004
Mogul Scanning Tutorial: 21st April 2004 Ver.1.0