HMMTiling
BPmapRemoveRedundance (Internal)
HMMTiling is a
comprehensive software package for tiling array data analysis. It includes
command line python applications for filtering, mapping, quantile-normalizing
and enriched-region identification from ChIP-chip experiments on tiling arrays.
HMMTiling models the behavior of each individual probe as a baseline for each
ChIP or control experiment to be compared with. It then uses a Hidden Markov
Model to identify the enrichment probability at each probe location, thus can
determine the exact enriched regions.
As input,
HMMTiling expects a set of cel files from ChIP and control experiments as well
as two internal files: 1) a text version non-redundant bpmap file
(.bpmap.NR.txt); 2) Probe Behavior file (.pb). This package uses Affymetrix
FileIO SDK to read the binary format cel files.
The algorithm
used by HMMTiling is described in:
1. Li W, Meyer CA and
Liu XS: A hidden Markov model for analyzing ChIP-chip experiments on genome
tiling arrays and its application to p53 binding sequences. To appear in
Bioinformatics and ISMB2005.
2. Carroll JS, Liu XS,
Brodsky AS, Li W, Szary AJ, Meyer CA, Shao W, Hestermann EV, Geistlinger TR,
Fox EA, Silver PA and Brown M: Chromosome-wide Mapping of Estrogen Receptor
Binding Reveals Long-range Combinatorial Regulation. To appear in Cell.
Users can run
each program without any options for the command-line help.
Usage: %prog bpmapfile1 bpmapfile2 ...
This
program will make non-redundant text-format bpmap files from the original
Affymetrix binary bpmap files. It uses two-stage Redundancy Removal to ensure
unique probe measurement within 1 kb. It will add NR.txt suffix in the
non-redundant bpmap files
Usage:
%prog *.cel ..
This program will take a set
of input cel files and create quantile-normalized cel files. If you want to use
the pre-computed probe behavior file in the HMM algorithm, you must choose a
corresponding normalized cel file as a reference in the quantile normalization.
Usage: %prog options cel1 cel2 ..
This
program will take a set of normalized cel files and the text-version
non-redundant bpmap files to create a matrix file. One matrix file per
chromosome. The cel files and bpmap files should be in the same folder. In the
matrix file, the first two columns are ChipID and Position, the rightwards
columns are different array samples (experiments).
Usage: %prog -i Matrixfile -o PBfile --pcutoff --pcols –ecols
This program will take the
matrix file generated from Cel2Matrix program and create a probe behavior file.
The default output is the chromosome name from the matrix file +
".pb" suffix like chr21.gb. You may specify the experiment index as
partial outlier (you have to specify the percentage cutoffs as well, the top
percentage probes in this experiment are excluded in probe behavior estimation)
or entire probe experiments (all probes in this experiment are included in
estimation). So the rest of the experiments not mentioned are entire outliers
(all probes in this experiment are excluded in estimation). The leftmost
experiment is indexed as 0, the second left is indexed as 1 and so on.
Usage: %prog -i MatrixFile [-p 300|-r 0.001] -t Treat
–c Control -b PBFile
This is the
main program in this package. This program will take the matrix file generated
from Cel2Matrix program to run HMM algorithm to get the log_odds of treatment
and control groups for each probe. Users can run BedCall program afterwards to
get the bed-format enriched region. It will add a ‘.hmm’ suffix to the matrix
file as the output hmm file. You may specify the experiment index in the
treatment and control group. The leftmost experiment is indexed as 0, the
second left is indexed as 1 and so on.
Usage: %prog options hmmfile1 hmmfile2 …
This program will take the a
set of hmm files (one hmm file one chromosome) generated by HMMTiling program
and create One ChIP-enriched regions in UCSC bed format according to user
specified criteria
The HMMTiling package is provided as python source
code. We are going to provide windows and Linux binary release with GUI later.
1. Download the HMMTiling source code.
2. Check the dependencies listed below and install /
upgrade if necessary
a. Python >= 2.4
with modules: numarray >=
b. gcc/g++ >=
c. SWIG >=
3. cd into the HMMTiling/SDK directory
4. Change the first two lines of install.sh to specify
the python install directory in your machine
PYTHONHEAD=~/local/include/python2.4/
PYTHONLIB=~/local/lib/python2.4/config/
5. run install.sh to wrap the c++ codes as python modules
6. Now go back to the HMMTiling directory and test every
program by simply type its name, like ‘python Qnormalize.py’
· Make a new directories as below in your linux machine
lab/ normalized #
normalized cel files and non-redundant bpmap files
lab/ HMMTiling #
HMMTiling software package
lab/ rawdata #
raw cel files and Affymetrix bpmap files
lab/test #
some parameter files used in testing
· Download the raw cel files (A,
B and C arrays) of Chr21/22 ChIP-chip experiments in Cawley et al. 2004. and Affymetrix original bpmap
files (version 33 of the NCBI assembly of the Human Genome). Put all the
files together into directory ‘rawdata’. To make the cel file names
human-readable, try to rename them with rename.sh
##########################
Make non-redundant text-format bpmap files from the original Affymetrix binary
bpmap files
################################
cd rawdata
python ../HMMTiling/BPmapRemoveRedundance.py
P1_CHIP_A.Anti-Sense.hs.NCBIv33.sary.bpmap
P1_CHIP_C.Anti-Sense.hs.NCBIv33.sary.bpmap
P1_CHIP_B.Anti-Sense.hs.NCBIv33.sary.bpmap
##############################
Quantile normalize on A, B and C arrays separately to make the probe
distribution the same across all samples. Put the normalized cel files into
../normalized
###############################
python ../HMMTiling/Qnormalize.py
--outdir ../normalized/ *_A_*CEL
python ../HMMTiling/Qnormalize.py --outdir ../normalized/ *_B_*CEL
python ../HMMTiling/Qnormalize.py --outdir ../normalized/ *_C_*CEL
##############################
Make matrix files
###############################
cd ..
#mv the non-redundant bpmap files to the normalized directory
mv rawdata/P1_CHIP_*.txt normalized
python HMMTiling/Cel2Matrix.py -o CawleyAll -m 1 --dir ./normalized/ --outlier
--cel-file test/Cawley.celfile
##############################
Make probe behavior file from the matrix files
###############################
python
HMMTiling/PBfileMaker.py -i CawleyAll.chr21.matrix --pcutoff 0.005 \
--pcols 24,25,26,27,32,33,34,35,40,41,42,43 --ecols \
0,1,2,3,8,9,10,11,16,17,18,19,28,29,30,31,36,37,38,39,44,45,46,47 &
python
HMMTiling/PBfileMaker.py -i CawleyAll.chr22.matrix --pcutoff 0.005 \
--pcols 24,25,26,27,32,33,34,35,40,41,42,43 --ecols \
0,1,2,3,8,9,10,11,16,17,18,19,28,29,30,31,36,37,38,39,44,45,46,47 &
##############################
Run HMM algorithm on the matrix file to find the p53-FL vs Input enriched
regions.
If we assume 300 peaks in Chr21/22, the transition probability will be defined
as 300/1,000,000(probes) = 0.0003
###############################
python HMMTiling/HMMTiling.py
-i CawleyAll.chr21.matrix \
-r 0.0003 -t 6,7,14,15,22,23 -c 2,3,10,11,18,19 -b chr21.pb &
python HMMTiling/HMMTiling.py -i CawleyAll.chr22.matrix \
-r 0.0003 -t 6,7,14,15,22,23 -c 2,3,10,11,18,19 -b chr22.pb &
##############################
Run bed calling these hmm files to get the bed-format enriched regions.
Set the cutoff as 6.
###############################
python HMMTiling/BedCall.py -o p53.bed
--tp 6 --tc 6 CawleyAll.*.hmm
##############################
Now you can use your probe behavior file and reference cel file (any normalized
cel file in the normalized directory is ok, one reference cel file one array
type) to help analyze other ChIP-chip experiments on human Chr21/22. Let’s use
the same raw cel files in Cawley et al.
2004 as an example. Suppose we only have 6 p53-FL ChIP and 6 genomic input
experiments.
###############################
##############################
Quantile normalize using the referece cel file. Put the normalized cel files
into ../p53FL
###############################
python
../HMMTiling/Qnormalize.py \
--reference ../normalized/HCT116_A_1_controlGST_1.CEL \
--outdir ../p53FL/ \
HCT116_A_1_p53-FL_1.CEL
HCT116_A_2_p53-FL_1.CEL
HCT116_A_3_p53-FL_1.CEL HCT116_A_1_p53-FL_2.CEL HCT116_A_2_p53-FL_2.CEL HCT116_A_3_p53-FL_2.CEL
HCT116_A_1_controlInput_1.CEL
HCT116_A_2_controlInput_1.CEL
HCT116_A_3_controlInput_1.CEL HCT116_A_1_controlInput_2.CEL HCT116_A_2_controlInput_2.CEL HCT116_A_3_controlInput_2.CEL &
python ../HMMTiling/Qnormalize.py \
--reference ../normalized/HCT116_B_1_controlGST_1.CEL \
--outdir ../p53FL/ \
HCT116_B_1_p53-FL_1.CEL
HCT116_B_2_p53-FL_1.CEL
HCT116_B_3_p53-FL_1.CEL HCT116_B_1_p53-FL_2.CEL HCT116_B_2_p53-FL_2.CEL HCT116_B_3_p53-FL_2.CEL HCT116_B_1_controlInput_1.CEL HCT116_B_2_controlInput_1.CEL HCT116_B_3_controlInput_1.CEL HCT116_B_1_controlInput_2.CEL HCT116_B_2_controlInput_2.CEL HCT116_B_3_controlInput_2.CEL &
python ../HMMTiling/Qnormalize.py \
--reference ../normalized/HCT116_C_1_controlGST_1.CEL \
--outdir ../p53FL/ \
HCT116_C_1_p53-FL_1.CEL
HCT116_C_2_p53-FL_1.CEL
HCT116_C_3_p53-FL_1.CEL HCT116_C_1_p53-FL_2.CEL HCT116_C_2_p53-FL_2.CEL HCT116_C_3_p53-FL_2.CEL HCT116_C_1_controlInput_1.CEL HCT116_C_2_controlInput_1.CEL HCT116_C_3_controlInput_1.CEL
HCT116_C_1_controlInput_2.CEL
HCT116_C_2_controlInput_2.CEL
HCT116_C_3_controlInput_2.CEL &
##############################
Make matrix, hmm and bed files
###############################
cp rawdata/P1_CHIP_*.txt p53FL/
python HMMTiling/Cel2Matrix.py -o p53FL -m 1 --dir ./p53FL/ --outlier
--cel-file test/p53FL.celfile
python HMMTiling/HMMTiling.py
-i p53FL.chr21.matrix \
-r 0.0003 -t 2,3,6,7,10,11 -c 0,1,4,5,8,9 -b chr21.pb &
python HMMTiling/HMMTiling.py -i p53FL.chr22.matrix \
-r 0.0003 -t 2,3,6,7,10,11 -c 0,1,4,5,8,9 -b chr22.pb &
python HMMTiling/BedCall.py
-o p53.1.bed --tp 6 --tc 6
p53FL*.hmm
##############################
Now you have a new list of p53 enriched region p53.1.bed which should be almost
the same as p53.bed
###############################