Getting Started

Test Simcluster with the example file following these steps:

  1. source ~/simcluster/bin/
  2. cd ~/simcluster/share/example
  3. simtree -i GDS550.dat
  4. treedraw -l -i GDS550-complete.newick -s

These commands will create a png file with the hierarchical cluster. The colors are selected by the sample information in the file

The -i option indicates the input file. This file is a table with the samples in the columns and variable (genes) in the rows. The columns are separated by Tab characters.

You can use other cluster linkage with the -m option. The available linkages are:

aAverage linkageAverage pairwise distance
sSingle linkageMinimum pairwise distance
mComplete linkageMaximum pairwise distance

For example, the command:

simtree -m a -i GDS550.dat

will generate the file GDS550-average.newick with average linkage.

Output format

Simtree generates the hierarchical cluster in the newick format. This format is used by the Phylogeny community. A formal description is available here.


Treedraw is the command line tree-drawer program which is distributed with Simcluster. Treedraw uses code from Treedraw X for tree drawing. The underling graphic engine is Cairo, a vector graphics library that allow us to generate PNG, PostScript and PDF files.

Treedraw was built to allow the generation of figures by scripts or the web interface. You can see an example of a generated file below:

If you provide a file with information about samples, treedraw will add samples description and colorize the leafs. Below you see the same data set drawn with this info file

The info file has 3 columns separated by Tabs:

  1. sample_id
  2. class
  3. Sample description

Second Example: Getting data from GEO

You can use get_samples perl script to automatic download a list of GEO Samples. Run this commands in the your ~/simcluster/examples/ directory:

  1. mkdir GEO
  2. cd GEO
  3. get_samples < ../brain_list
  4. make_table GSM*.txt > brain.dat

The make_table script read all the sample files and create three files:

  • brain.dat: with simcluster input matrix
  • A file with sample_id and description that can be used in treedraw.
  • keywords.txt: The list of keywords found in the GSM files. This file can help the choose of appropriate samples classes.

Filtering the data set

It is a well know fact that SAGE, for example, produces tags with sequencing errors. The best method to filter these tags is an open question. However, we provide a simple script which remove low count tags. The syntax is:

clean_tags.rb [matrix_file.dat] [threshold] > [matrix_clean.dat]

This command will remove all tags for which the sum of counts is equal or lesser then threshold value. For example: clean_tags.rb brain.dat 9 > brain_clean.dat

To cluster this data set type:

  1. simtree -i brain.dat
  2. treedraw -l -i brain-complete.newick -s

It is much harder to identify cluster patterns in this file without the classes colors. But choosing the appropriate classes for your samples could be the heart of your cluster analyses. The make_table script try to help you, but, of course, discovering the true classes is a process of try-and-fail. In this example, classify the samples by the tumor type is a good first guess, as implemented by this info file.

Discovering the number of clusters

The simpca program can estimate the number of cluster in the data set using the PCA analyses as described in Quackenbush (2001). Simpca calculate the Eigen-Values of the simplex covariance matrix, and suggest the number of sources which explain 95% of the total variability, or the number of variability sources where the difference of the eigen value is less then 5%. This parameters can be changed by command line options.

Try: simpca -i brain_clean.dat.

The algorithm suggests 3 as the number of main sources of variability (and ``therefore" the potential number of clusters) in this example.

Using Partition Cluster Algorithm

The simpart programs is another clustering tool which make partition clusters. Simpart can cluster using the K-means, K-medoids and SOM (Self Organizing Maps) algorithm.

For example:

simpart -k 3 -i brain.dat

This command will divide the samples in 3 clusters. The result will be printed on the screen, and save in the file brain_clean.partition.

A more human readable output can be created if the samples information file is supplied:

simpart -s -k 3 -i brain.dat

A file named will be created, and in partition cluster information will be in the second column, to each samples is assigned a letter representing its cluster.

This file can be used by treedraw program to colorize the hierarchical cluster:

treedraw -l -i brain_clean-complete.newick -s

###################### -->

The same kind of logical structure applies to all other Simcluster's individual tools for other types of clustering analysis. Use what you learned here and generalize to all other tools. Please find more information at manpages of each tool.