This manual is intended for two kind of users group.
Biologists Users: in order to know where they should go to download these data and tools,what each output file represents
Developers: in order to make sure they’re using the right format of data and right version of tool to test, they have a consistent naming convention and can find each file easily.
Through the pipeline, several temporary files will be generated, some of them are only used for settings and transitions, others for continuing the next step, the rest for publishing and interpreting a biological story. Below is three sections of tables for universal name rules.
Note
%(DatasetID)s denotes id you input in the basis section in ChiLin.conf, %(treat_rep)s for treat number, %(control_rep)s for control rep. see Get Started use clear term to fill in the id example: use factor name plus your favorate number to replace the DatasetID below. if data is published, GSEID is recommended.
single end fastq files and absolid colored fastq files are the raw reads files supported, and users could use the bam files that has been already mapped back to the genomes.
ChiLin have been supporting the following format as input:
Format | Type | instruction |
---|---|---|
FASTQ | Seq | single-end fastq or absolid colored fastq |
BAM | Mapped | Skip mapping |
ChiLin do not support the following format in current version:
Format | Type | Solution |
---|---|---|
SRA | Seq | Use SRA Toolkit to convert to FASTQ format |
BED | Summit | |
BED | Peak | Could be converted to bam files using bedToBam |
wig | Profile | Use wiggleToBigwiggle to convert to Bigwiggle |
Bigwig | Profile | Compressed Bigwiggle |
This part only use the tools for raw fastq quality control
Raw reads mapping is the first step for analyzing ChIP-seq data, which is very important for following analysis.
Modern high throughput sequencers can generate tens of mil- lions of sequences in a single run. Before analysing this sequence to draw biological conclusions you should always perform some simple quality control checks to ensure that the raw data looks good and there are no problems or biases in your data which may a detect how you can usefully use it.
Here, we have chosen the bowtie for mapping raw reads data with standard parameters. below is the example command line we set in python script for hg19
> /usr/local/bin/bowtie -p 1 -S -m 1 /pathtoIndex/hg19 /pathto/treat /outputdirectory/treat.sam
Built-in tools would extract quality control preparation information from standard output of Bowtie and sam files to do the following description statistics.
1. temporary files .. csv-table:
:header: "Content", "File Name", "Tool used"
:widths: 20, 30, 15
:delim: ;
Bowie treat files ; %(DatasetID)s_treat_rep%{treat_rep}s.sam ; :ref:`Bowtie`
Bowtie control files ; %(DatasetID)s_control_rep%{control_rep}s.sam ; :ref:`Bowtie`
Bowtie temporary summary ; bowtie.tmp ; :ref: `Bowtie`
2. output files Bowtie result sam files would be converted to bam binary format for minimizing the file sizes through samtools:
samtools view -bt chrom_len sam bam
This part is designed for users who don’t have raw reads fastq files, but have bam files instead. ChiLin helps to convert bam files to fastq files for further processing all pipeline. The only different for Usage is to input bam suffix files in the ChiLin.conf
The convert tool used here is bedtools bamToFastq:
bamToFastq -i x.bam -fq test.fq
We do the peak calling analysis by MACS2, we set the parameter to meet the requirement for non redundant tags for further analysis. Cutoff of false discovery rate(fdr) is set to 0.01, only keep one tag for duplicate tags to remove possible bias. In default setting, macs2 would not build model. The standard command line involves here:
macs2 callpeak -B -q 0.01 --keep-dup 1 --shiftsize=73 --nomodel -t /pathto/treat.bam -c /pathto/control.bam -n macsname
1. Before peaks calling, bams files would be sent for MACS2_ subparser filterdup for statistical analysis on mapped reads non redundant rate, the higher the measurement is, the better the data quality are. 2. After peaks calling, There are three measurement involves here, total peaks count, confident peaks count , and shift size(optionally, used when model=Yes, see in Dive into conf)
Content | File Name | Tool used |
---|---|---|
separate treat bedGraphfile | %{DatasetID}s_treat_rep%{treat_rep}s.bdg | MACS2 |
separate control MACS bedGraph file | %{DatasetID}s_control_rep%{control_rep}s.bdg | MACS2 |
Overall MACS bedGraph file | %{DatasetID}s_treat.bdg | MACS2 |
bedGraph temporary file(remove exceptions) | %{DatasetID}s_treat.bdg.tmp | MACS2 |
sortedbed(For get top peaks) | %(DatasetID)s_sorted.bed | Linux sort |
top 1000 peaks(for latter MDSeqpos) | %{DatasetID}s_top1000_summits.bed | MACS |
MACS encode Peak(macs2 output) | %{DatasetID}s_treat_rep%{treat_rep}s_peaks.encodePeak | MACS |
treatrep_pq_table | %(DatasetID)s_rep%(treat_rep)s_pq_table.txt | MACS |
pq_table | %(DatasetID)s_pq_table.txt | MACS |
treat_rep%(DatasetID)s_rep%(treat_rep)s_treat_pvalue.bdg | MACS | |
treat_pvalue | %(DatasetID)s_treat_pvalue.bdg | MACS |
treatrep_qvalue | %(DatasetID)s_rep%(treat_rep)s_treat_qvalue.bdg | MACS |
lambda_bdg | %(DatasetID)s_rep%(control_rep)s_control_lambda.bdg | MACS |
Content | File Name | Tool used |
---|---|---|
treatreppeaks | %(DatasetID)s_rep%(treat_rep)s_peaks.bed | MACS |
treatpeaks | %(DatasetID)s_peaks.bed | MACS |
treatrepbw | %(DatasetID)s_treat%(treat_rep)s.bw | MACS |
treatbw | %(DatasetID)s_treat.bw | MACS |
controlrepbw | %(DatasetID)s_rep%(treat_rep)s_control | MACS |
controlbw | %(DatasetID)s_control.bw | MACS |
peaksrepxls | %(DatasetID)s_rep%(treat_rep)s_peaks.xls | MACS |
peaksxls | %(DatasetID)s_peaks.xls | MACS |
summitsrep | %(DatasetID)s_rep%(treat_rep)s_summits.bed | MACS |
summits | %(DatasetID)s_summits.bed | MACS |
Focus on the visulization of similarity between replicates. * Draw the venn diagram for peaks if there’re less than 3 replicates (treatment or control) * plot the Correlation score for whole genome region average peaks score
The R code is searched by regular expression to get the needed part for generating QC report
Folder | File Name | Content | Tool used |
---|---|---|---|
correlation plot code | %{DatasetID}s_cor.R | Buit-in tools | |
DHS peaks intersection | %{DatasetID}s_bedtools_dhs.txt | External Tools | |
overlap with velcro region | %{DatasetID}s_bedtools_velcro.txt | External Tools | |
peaks overlapped | %{DatasetID}s_overlapped_bed | External Tools | |
AnnotationQC | %{DatasetID}s_Metagene_distribution.pdf | R | |
AnnotationQC | %{DatasetID}s_peak_height_distribution.pdf | R |
Focus on association between intervals (result of peak calling) and traits like genome annotation.
CEAS part
Content | File Name | Tool used |
---|---|---|
CEAS script | %(DatasetID)s_ceas_CI.R | CEAS |
CEAS script | %(DatasetID)s_ceas_CI.pdf | CEAS |
CEAS xls | %(DatasetID)s_ceas.xls | CEAS |
CEAS R script | %(DatasetID)s_ceas.R | CEAS |
CEAS result pdf | %(DatasetID)s_ceas.pdf |
Conservation analysis part
Content | File Name | Tool used |
---|---|---|
conservtopsummits | %(DatasetID)s_top3000summits.bed | built-in tools |
conservR | %(DatasetID)s_conserv.R | built-in tools |
conservpng | %(DatasetID)s_conserv.png | built-in tools |
Here, we use the powerful combination of denovo motif finding algorithm, MDscan, and database-based search algorithm, Seqpos for motif analysis.
Content | File Name | Tool used | ||
---|---|---|---|---|
summitspeaks1000 | %(DatasetID)s_summits_p1000.bed | linux tools | ||
bgfreq | %(DatasetID)s_bgfreq | External Tools | ||
seqpos | %(DatasetID)s_seqpos.zip |
|
extract all the genes upstream or downstream the predicting peaks for functional clustering or annotation.
You could check your top rated peaks in the Cistrome Radar and Finder to find interesting associated results,
An example QC report is here QC.
Note
Output Format is optional(default PDF) Below is output in the root directory, that is the folder named after ${DatasetID}
Provide the overall report of the whole pipeline for viewing general result.
Based on Chip-seq pipeline and Cistrome DC database, QC program will generate a comprehensive quality control report about a particular dataset as well as the relative result compared to the whole DC database. * QC report summary information
Give an overview of all the measurement pass or fail information