BINOCh
Binding Inference from Nucleosome Occupancy Changes

Sample H3K4me2 ChIP-seq nucleosome positioning data

Here we provide a sample data set from our paper:

"Nucleosome dynamics define transcriptional enhancers" Housheng Hansen He, Clifford A Meyer, et al, Nature Genetics, 42, p343–347, 2010.

To characterize the pattern of nucleosome positioning at enhancers, we used nucleosome-resolution ChIP-seq of H3K4me2 in the prostate cancer cell line LNCaP in response to stimulation by the AR agonist 5a-dihydrotestoterone (DHT). H3K4me2 nucleosome positioning data was generated for the vehicle control condition and the 4 hour DHT condition. Here we provide the bed file of Illumina sequence tags mapped to human reference genome hg18 for the vehicle condition and the 4 hour DHT condition. Using the 4 hour DHT bed file we discover nucleosome positions using the NPS nucleosome position discovery software.

Data preparation using the tagtab script

Using this data and the tagtab script provided with this package we generate a summary of tag counts in flanking and central nucleosome regions LNCaP_H3K4me2.xls. This file can be generated from tag bed files and the mononucleosome bed files using the tagtab script:
    tagtab LNCaP_DHT_peak.bed LNCaP_H3K4me2_veh.bed,LNCaP_H3K4me2_DHT.bed -o LNCaP_H3K4me2.xls
Here the file LNCaP_DHT_mono_peak.bed specifies the nucleosome positions and the sequence tag positions are specified by LNCaP_H3K4me2_veh.bed and LNCaP_H3K4me2_DHT.bed. All files MUST be sorted by genomic coordinate, chromosomes may appear in any order. The file generated, LNCaP_H3K4me2.xls is a table having the format:
    chromstart end veh_nuc veh_linDHT_4h_nuc DHT_4h_lin
    chr1828981 829526 53 27 108 81
    chr1829141 829711 36 30 110 48
    chr1830376 830861 59 17 159 56
    chr1830661 831231 48 31 125 110
    chr1831031 831581 31 4 44 8
    chr1842141 842721 17 15 47 36

The columns with headers chrom, start, and end contain the genomic locations of nucleosome pairs with midpoints separated by 250 to 450 basepairs. The midpoint separation parameter may be also be assigned using the optional parameters --minsep and --maxsep. Of the remaining column headers the ones that end in _nuc contain the counts of tags that fall in the nucleosome regions while those ending in _lin contain the tag counts for the regions between nucleosome pairs. Type tagtab -h for information on the options available when using this script.

DNA motif identification using the positional analysis method of the binoch script

To discover DNA sequence motifs that are localized near the midpoint between paired nucleosomes with high nucleosome stablization-destabilization (NSD) scores we run the binoch script:

    binoch LNCaP_H3K4me2.xls -a pos -g hg18 -n 0.01 --nuc=veh_nuc,DHT_4h_nuc --lin=veh_lin,DHT_4h_lin -o LNCaP_H3K4me2_DHT_pos.txt
Here the -a pos option is for the motif position analysis. The genome version is specified by the -g hg18 option. The option -n 0.01 specifies the top 1% of NSD scoring regions to be used in the analysis. The columns containing tag counts in flanking nucleosome regions are specified by the header labels for the appropriate columns --nuc=veh_nuc,DHT_4h_nuc. Similarly the tag counts in the region between flanking nucleosomes is specified by the option --lin=veh_lin,DHT_4h_lin. The --nuc and --lin options are comma delimited lists that must by ordered in such a way that each column containing flanking nucleosome counts must be matched by a column containing tag counts from the central region.

The output from this analysis is in the format:

    ID sym consensus numhits mean cutoff zscore pval
    M00957 PR .......G.AC....TGTTCT.... 242 -0.11 6.46 -5.94 1.40e-09
    M00956 AR ......GG.AC....TGTTCT.... 360 -0.09 4.91 -5.59 1.14e-08
    M00447 AR AGTAC.T.WTGTTCT 329 -0.09 5.51 -5.45 2.50e-08
    M00953 AR ......GG.ACA..GTGTTCT.... 179 -0.12 6.08 -5.32 5.12e-08
    M01012 HNF3 .....TGTTTR....... 870 -0.05 5.64 -5.09 1.75e-07
    M00481 AR GG.ACA...TGT.CT 237 -0.10 6.83 -5.01 2.66e-07
    M00791 HNF3 ....ACAAACA.. 968 -0.04 5.29 -4.50 3.38e-06
    M00954 PR .......G.A.....TGTTCT.... 321 -0.08 6.03 -4.44 4.42e-06
    M00724 HNF3alpha TGTTTGTTTT. 685 -0.05 6.56 -4.30 8.38e-06
    M00292 Freac-4 CTTAAGTAAACA.... 835 -0.04 2.49 -3.86 5.56e-05

The columns have the following meanings:

DNA motif identification using the enrichment analysis method of the binoch script

To discover DNA sequence motifs that are enriched in the high NSD-scoring regions relative to those regions that have neutral NSD scores we run the binoch script:

    binoch LNCaP_H3K4me2.xls -a enrich -n 0.01 -g hg18 -w 200 --nuc=veh_nuc,DHT_4h_nuc --lin=veh_lin,DHT_4h_lin -o LNCaP_H3K4me2_DHT_binom.txt
Here the -a enrich option is for the motif enrichment analysis. The -n 0.01 option specifies the fraction of the total number of paired nucleosome regions that is to be used in the comparison of top NSD-scoring and neutral NSD-scoring regions. The option -w 200 specifies 200 bp from the center of candidate regions to be used as the basis of the analysis. The remaining options are as before.

The output from this analysis is in the format:

    ID symconsensus numhits_fg numhits_bg n cutoff pval
    M00481 AR GG.ACA...TGT.CT 283 112 1175 5 4.21e-48
    M00956 AR ......GG.AC....TGTTCT.... 250 95 1175 5 6.39e-45
    M00953 AR ......GG.ACA..GTGTTCT.... 194 64 1175 5 1.69e-42
    M00957 PR .......G.AC....TGTTCT.... 313 157 1175 5 1.38e-33
    M00192 GR ...........TGT.CT.. 655 458 1175 5 2.83e-31
    M00955 GR .......G..C....TGTTCT.... 343 190 1175 5 4.82e-29
    M00921 GR ..TGT.CT 798 615 1175 5 1.67e-27
    M00954 PR .......G.A.....TGTTCT.... 301 164 1175 5 3.90e-26
    M00960 PR,GR ...AGAACA. 664 494 1175 5 1.49e-23
    M00290 Freac-2 .....GTAAACA.... 418 301 1175 5 2.77e-14

The columns have the following meanings:

Transcription factor motif libraries available with BINOCh

Several motif libraries are available for binoch analysis. These can be specified using the -m option:

Options available when using the binoch script

To use:

binoch [options] TABLEFILENAME

TABLEFILENAME can be either a .xls file or a .bed file .xls files regions are assumed to contain the paired nucleosome positions and are trimmed and padded in the position based analysis. .xls files can be generated using the tagtab script with nucleosome position .bed files and mapped ChIP-seq histone modification .bed files. .bed file regions are analysed with no sequence splicing in the position based analysis. If the option lin is unspecified scores will be computed from nuc only.

Type binoch -h for information on the options.

Options:

Registration is simple:

note: If you don't want to receive any email from the group, please remember to set the 'Delivery' type of your account as 'No Email'.

Google Groups Beta
Subscribe to BINOCh Announcement
Email:
Visit this group