Getting Started
Test Simcluster with the example file following these steps:
- source ~/simcluster/bin/env.sh
- cd ~/simcluster/share/example
- simtree -i GDS550.dat
- treedraw -l -i GDS550-complete.newick -s samples.info
These commands will create a png file with the hierarchical cluster. The colors are selected by the sample information in the file samples.info.
The -i option indicates the input file. This file is a table with the samples in the columns and variable (genes) in the rows. The columns are separated by Tab characters.
You can use other cluster linkage with the -m For example, the command: simtree -m a -i GDS550.dat will generate the file GDS550-average.newick with average linkage. Simtree generates the hierarchical cluster in the newick format.
This format is used by the Phylogeny community. A formal description is available here.
Flag Name Description a Average linkage Average pairwise distance s Single linkage Minimum pairwise distance m Complete linkage Maximum pairwise distance Output format
Treedraw
Treedraw was built to allow the generation of figures by scripts or the web interface. You can see an example of a generated file below:
If you provide a file with information about samples, treedraw will add samples description and colorize the leafs. Below you see the same data set drawn with this info file
The info file has 3 columns separated by Tabs:
- sample_id
- class
- Sample description
Second Example: Getting data from GEO
You can use get_samples perl script to automatic download a list of GEO Samples. Run this commands in the your ~/simcluster/examples/ directory:
- mkdir GEO
- cd GEO
- get_samples < ../brain_list
- make_table GSM*.txt > brain.dat
The make_table script read all the sample files and create three files:
- brain.dat: with simcluster input matrix
- samples.info-gen.txt: A file with sample_id and description that can be used in treedraw.
- keywords.txt: The list of keywords found in the GSM files. This file can help the choose of appropriate samples classes.
Filtering the data set
It is a well know fact that SAGE, for example, produces tags with sequencing errors. The best method to filter these tags is an open question. However, we provide a simple script which remove low count tags. The syntax is:
clean_tags.rb [matrix_file.dat] [threshold] > [matrix_clean.dat]
This command will remove all tags for which the sum of counts is equal or lesser then threshold value. For example: clean_tags.rb brain.dat 9 > brain_clean.dat
To cluster this data set type:
- simtree -i brain.dat
- treedraw -l -i brain-complete.newick -s samples.info-gen.txt
It is much harder to identify cluster patterns in this file without the classes colors. But choosing the appropriate classes for your samples could be the heart of your cluster analyses. The make_table script try to help you, but, of course, discovering the true classes is a process of try-and-fail. In this example, classify the samples by the tumor type is a good first guess, as implemented by this info file.
Discovering the number of clusters
The simpca program can estimate the number of cluster in the data set using the PCA analyses as described in Quackenbush (2001). Simpca calculate the Eigen-Values of the simplex covariance matrix, and suggest the number of sources which explain 95% of the total variability, or the number of variability sources where the difference of the eigen value is less then 5%. This parameters can be changed by command line options.
Try: simpca -i brain_clean.dat.
The algorithm suggests 3 as the number of main sources of variability (and ``therefore" the potential number of clusters) in this example.
Using Partition Cluster Algorithm
The simpart programs is another clustering tool which make partition clusters. Simpart can cluster using the K-means, K-medoids and SOM (Self Organizing Maps) algorithm.
For example:
simpart -k 3 -i brain.dat
This command will divide the samples in 3 clusters. The result will be printed on the screen, and save in the file brain_clean.partition.
A more human readable output can be created if the samples information file is supplied:
simpart -s samples.info-gen.txt -k 3 -i brain.dat
A file named brain_clean-autoclass.info.txt will be created, and in partition cluster information will be in the second column, to each samples is assigned a letter representing its cluster.
This file can be used by treedraw program to colorize the hierarchical cluster:
treedraw -l -i brain_clean-complete.newick -s brain_clean-autoclass.info.txt
###################### -->The same kind of logical structure applies to all other Simcluster's individual tools for other types of clustering analysis. Use what you learned here and generalize to all other tools. Please find more information at manpages of each tool.

