CpGpromoter is a program for
a large-scale human promoter mapping using CpG islands. As defined by Gardiner-Garden
& Frommer, CpG islands are greater than 200 bp in length, have more
than 50% of G+C content, and have a CpG frequency of at least 0.6 of that
expected on the basis of the G+C content of the region. CpG islands are
an important signature of 5' region of many mammalian genes. The program
is based on results of discriminant analysis between the promoter-associated
CpG islands and non-associated ones (Ioshikhes & Zhang). It enables
an efficient mapping of human promoters with 2Kb resolution, if there is
a CpG island inside an interval (-500...+1,500) around a transcription
start site (TSS).
The current version of the
program is implemented as a consequence of cpgplot, comp and qda programs
with appropriate parsing and reformatting of the intermediate data. The
cpgplot program for mapping of CpG islands described at (Larsen et al.)
is available as a part of EMBOSS
programs of sequence analysis on the ftp server of the Sanger Centre, UK
(ftp://ftp.sanger.ac.uk/pub/EMBOSS/
in file EMBOSS-0.0.4.tar.gz).
The comp program replaced the composition program of GCG program package
which was used in the previous version of CpG_promoter. The qda program
of quadratic discriminant analysis (McLachlan) is those of the S-Plus programs
of applied statistics (Venables & Ripley). EMBOSS and S-Plus software
packages should be installed at a user's computer, to enable work of CpGpromoter.
GCG is not required for this version. Other software required is Perl and
gcc; you may use cc compiler instead of gcc changing accordingly lines
40, 43, 46, 49 of a file CpGpromoter.pl (read below).
Other parts of the program
are available for ftp downloading from this site to UNIX workstations as
a CpG_promoter.tar.Z
file. After retrieving the compressed tar file, do the following (in a
shell window):
1. Uncompress the .Z file
by using: uncompress CpG_promoter.tar.Z. This
should convert CpG_promoter.tar.Z file to CpG_promoter.tar.
2. Untar the file by using:
tar xvf CpG_promoter.tar. This should create a directory called CpG_promoter
containing all its files. You can then use
rm CpG_promoter.tar. (You need steps 1,2 only for the program installation;
once it is installed, proceed to the step 3.)
3. Change your current directory
to CpG_promoter. The CpG_promoter is then ready to use. Put a sequence
you wish to analyze into the CpG_promoter directory. The sequence should
be in a fasta format and called myquery.seq.
4. Run the CpG_promoter
Perl script using perl CpG_promoter.pl. The
components of the script are then consequtevely running sending corresponding
messages to a standard output. By the finish of the script running, information
about the mapped CpG islands should be available in
the files query.cpgplot, cpgplot.ps and toqda-1.dat.
5. By the last step of the
script, the S-Plus software package should start. You must attach a .Data
directory according to its path at your local computer, modifying accordingly
line 1 in a file splus.input. Parameter heart0 in the line 4 of this file
is result of qda classification of our Training Set (Ioshikhes & Zhang)
according to the interval (-500..+1500). To use a combined data set (Training+Test
Set 1+Test Subset 2-1) from this paper as a training set, use heart1 instead
of heart0.
The output data (at the
standard output and file output.dat) should look
like e.g.:
CpG island pos.
Class
(38989..39225)
"0"
(54844..55044)
"1"
Class "0" means the respective
island is predicted as promoter related (centered in the interval (-500...+1500)
around a possible TSS), class "1" - as promoter non-related.
The work on the CpG_promoter
is in progress, to make it more self-sufficient, with fewer of outside
software involved. In order to allow bench-scientists to have the tool
as soon as possible, we are giving it out before this work is completed
and before our paper is published (see reference below). This
program is free for academic use. For commercial use, please contact Ms.
Carol Dempster at 516-367-6885 (dempster@cshl.org).
If you have problems or comments, please contact :
================================================================================
Michael Q. Zhang,
Ph.D.
Associate Professor
Phone: (516) 367-8393
Watson School of Biological Sciences
Fax: (516) 367-8461
Cold Spring Harbor Laboratory
E-mail: mzhang@cshl.org
1 Bungtown Road,
WinFax:(516) 367-6862
P.O.Box 100,
WebURL: http://www.cshl.org/mzhanglab
Cold Spring Harbor, NY 11724 USA
================================================================================
Ilya Ioshikhes, Ph.D.
Instructor
Department of Molecular
Genetics,
Phone: (718) 430-3732
Albert Einstein College of
Medicine
FAX: (718) 430-8778
of Yeshiva University,
717 Ullmann Building 1300 Morris Park Avenue
E-mail: ilya@harryeagle.aecom.yu.edu
Bronx, New York 10461 USA
================================================================================
References :
Ioshikhes, I. & Zhang, M.Q. Large-scale human
promoter mapping using CpG islands. Nature Genetics 26, 61-63 (2000).
Gardiner-Garden, M. & Frommer, M. CpG islands
in vertebrate genomes. J. Mol. Biol. 196, 261-276 (1987).
Larsen, F., Gundersen, R., Lopez, R. & Prydz,
H. CpG islands as gene markers in the human genome. Genomics 13, 1095-1107
(1992).
McLachlan, G.J. Discriminant Analysis and Statistical
Pattern Recognition (Wiley, New York) (1992).
Venables, W.N. & Ripley, B.D. Modern Applied
Statistics with S-Plus. (Springer-Verlag, New York) (1994).