Using Mogul to Scan DNA Sequences for Known Transcription Factor Binding Sites: A Short Tutorial


1 Sequence Data Sources

1.1 Basic Assumptions

>SequenceName

agtgctagatagatag...





1.2 UCSC

The University of California Santa Cruz (UCSC) Genome Bioinformatics Group provide downloads of upstream DNA regions for a number of different species. The pre-processed data provides 3 sets of FASTA files of upstream regions in the ranges 1K, 2K and 5K bases.

The documentation on the UCSC website states: “Sequences [1000/2000/5000] bases upstream of annotated transcription start of RefSeq genes. This includes only the cases where the transcription start is separately from the coding region start”.

Specific downloads paths are below:

1K - http://genome.ucsc.edu/goldenPath/hg16/bigZips/upstream1000.zip

2K - http://genome.ucsc.edu/goldenPath/hg16/bigZips/upstream2000.zip

5K - http://genome.ucsc.edu/goldenPath/hg16/bigZips/upstream5000.zip



1K - http://genome.ucsc.edu/goldenPath/mm4/bigZips/upstream1000.zip

2K - http://genome.ucsc.edu/goldenPath/mm4/bigZips/upstream2000.zip

5K - http://genome.ucsc.edu/goldenPath/mm4/bigZips/upstream5000.zip



1K – http://genome.ucsc.edu/goldenPath/rnJun2003/bigZips/upstream1000.zip

2K - http://genome.ucsc.edu/goldenPath/rnJun2003/bigZips/upstream2000.zip

5K - http://genome.ucsc.edu/goldenPath/rnJun2003/bigZips/upstream5000.zip





Advantages:



Disadvantages:



1.3 EnsEMBL

EnsEMBL is a European-based, automated genome annotation project, jointly hosted by the Sanger Institute and the European Bioinformatics Institute, both in Cambridge, UK. As of April 2004, EnsEMBL provides genomic data on 11 different organisms. This includes human, mouse, rat and fugu. The EnsEMBL homepage is located at: http://www.ensembl.org/



Sequence and annotation data is available via a graphical user interface wizard that wraps around a search engine called EnsMART. The following is a brief worked example for retrieving 2 mouse DNA sequences from EnsMART.



































Advantages:



Disadvantages:





2 Scanning Mutliple Sequences for Known TFBS

2.1 Rules of the Mogul Road



2.2 Getting Started



>TheNameofTheFirstSequence

atgatgaatta...and the rest of the sequence

>TheNameofTheSecondSequence

atgatgaatta...and the rest of the sequence



2.3 Default Running of the Scanners

If no motif files are supplied to Mogul, sets of background motifs are used. These are are follows:



2.4 Creating a File of Text Motifs

Motif_name_1 tab Actual_motif_sequence

Motif_name_2 tab Actual_motif_sequence



2.5 Preparing to Scan with Text Motifs



2.6 Creating a File of Matrix Motifs

>

Name=Motif_Name

Consensus=ACGT

Width=4

10space1space0space1

1space10space0space1

0space1space9space2

1space2space2space7

<

>

Name=Next_Motif

etc.





2.7 Preparing to Scan with Matrix Motifs



2.8 Running the Scanners



2.9 Guidelines for Running Searches







Alistair 23/04/2004

Mogul Scanning Tutorial: 21st April 2004 Ver.1.0