Using Mogul to Scan DNA Sequences for Known Transcription Factor Binding Sites: A Short Tutorial

1 Sequence Data Sources

1.1 Basic Assumptions



1.2 UCSC

The University of California Santa Cruz (UCSC) Genome Bioinformatics Group provide downloads of upstream DNA regions for a number of different species. The pre-processed data provides 3 sets of FASTA files of upstream regions in the ranges 1K, 2K and 5K bases.

The documentation on the UCSC website states: “Sequences [1000/2000/5000] bases upstream of annotated transcription start of RefSeq genes. This includes only the cases where the transcription start is separately from the coding region start”.

Specific downloads paths are below:

1K -

2K -

5K -

1K -

2K -

5K -

1K –

2K -

5K -



1.3 EnsEMBL

EnsEMBL is a European-based, automated genome annotation project, jointly hosted by the Sanger Institute and the European Bioinformatics Institute, both in Cambridge, UK. As of April 2004, EnsEMBL provides genomic data on 11 different organisms. This includes human, mouse, rat and fugu. The EnsEMBL homepage is located at:

Sequence and annotation data is available via a graphical user interface wizard that wraps around a search engine called EnsMART. The following is a brief worked example for retrieving 2 mouse DNA sequences from EnsMART.



2 Scanning Mutliple Sequences for Known TFBS

2.1 Rules of the Mogul Road

2.2 Getting Started


atgatgaatta...and the rest of the sequence


atgatgaatta...and the rest of the sequence

2.3 Default Running of the Scanners

If no motif files are supplied to Mogul, sets of background motifs are used. These are are follows:

2.4 Creating a File of Text Motifs

Motif_name_1 tab Actual_motif_sequence

Motif_name_2 tab Actual_motif_sequence

2.5 Preparing to Scan with Text Motifs

2.6 Creating a File of Matrix Motifs













2.7 Preparing to Scan with Matrix Motifs

2.8 Running the Scanners

2.9 Guidelines for Running Searches

Alistair 23/04/2004

Mogul Scanning Tutorial: 21st April 2004 Ver.1.0