Manual

This manual is intended for two kind of users group.

Biologists Users: in order to know where they should go to download these data and tools,what each output file represents

Developers: in order to make sure they’re using the right format of data and right version of tool to test, they have a consistent naming convention and can find each file easily.

Through the pipeline, several temporary files will be generated, some of them are only used for settings and transitions, others for continuing the next step, the rest for publishing and interpreting a biological story. Below is three sections of tables for universal name rules.

Note

%(DatasetID)s denotes id you input in the basis section in ChiLin.conf, %(treat_rep)s for treat number, %(control_rep)s for control rep. see Get Started use clear term to fill in the id example: use factor name plus your favorate number to replace the DatasetID below. if data is published, GSEID is recommended.

Raw Data

supported format

single end fastq files and absolid colored fastq files are the raw reads files supported, and users could use the bam files that has been already mapped back to the genomes.

ChiLin have been supporting the following format as input:

Format Type instruction
FASTQ Seq single-end fastq or absolid colored fastq
BAM Mapped Skip mapping

ChiLin do not support the following format in current version:

Format Type Solution
SRA Seq Use SRA Toolkit to convert to FASTQ format
BED Summit  
BED Peak Could be converted to bam files using bedToBam
wig Profile Use wiggleToBigwiggle to convert to Bigwiggle
Bigwig Profile Compressed Bigwiggle

Quality Control

This part only use the tools for raw fastq quality control

  1. tools involved We integrated Babraham’s FastQC to assess the raw FastQ format files and extracted sequence quality scores from their summary,
  2. Historic Data to compare new data will be stored for further comparing with all Cistrome DC project historic data, which is collected by all DC team members and save into sqlite3 database format for indexing.

Output of Raw

  1. temporary files
  1. No final result for package by default setting in this step

Reads Mapping

Raw reads mapping is the first step for analyzing ChIP-seq data, which is very important for following analysis.

Data analysis

Modern high throughput sequencers can generate tens of mil- lions of sequences in a single run. Before analysing this sequence to draw biological conclusions you should always perform some simple quality control checks to ensure that the raw data looks good and there are no problems or biases in your data which may a detect how you can usefully use it.

Here, we have chosen the bowtie for mapping raw reads data with standard parameters. below is the example command line we set in python script for hg19

> /usr/local/bin/bowtie -p 1 -S -m 1 /pathtoIndex/hg19 /pathto/treat /outputdirectory/treat.sam

Quality Control

Built-in tools would extract quality control preparation information from standard output of Bowtie and sam files to do the following description statistics.

Output of Mapping

1. temporary files .. csv-table:

:header: "Content", "File Name", "Tool used"
:widths: 20, 30, 15
:delim: ;

Bowie treat files ; %(DatasetID)s_treat_rep%{treat_rep}s.sam ; :ref:`Bowtie`
Bowtie control files ; %(DatasetID)s_control_rep%{control_rep}s.sam ; :ref:`Bowtie`
Bowtie temporary summary ; bowtie.tmp ; :ref: `Bowtie`

2. output files Bowtie result sam files would be converted to bam binary format for minimizing the file sizes through samtools:

samtools view -bt chrom_len sam bam

Groom

This part is designed for users who don’t have raw reads fastq files, but have bam files instead. ChiLin helps to convert bam files to fastq files for further processing all pipeline. The only different for Usage is to input bam suffix files in the ChiLin.conf

The convert tool used here is bedtools bamToFastq:

bamToFastq -i x.bam -fq test.fq

Peak Calling

Data Analysis

We do the peak calling analysis by MACS2, we set the parameter to meet the requirement for non redundant tags for further analysis. Cutoff of false discovery rate(fdr) is set to 0.01, only keep one tag for duplicate tags to remove possible bias. In default setting, macs2 would not build model. The standard command line involves here:

macs2 callpeak -B -q 0.01 --keep-dup 1 --shiftsize=73 --nomodel  -t /pathto/treat.bam  -c /pathto/control.bam -n macsname

Quality Control

1. Before peaks calling, bams files would be sent for MACS2_ subparser filterdup for statistical analysis on mapped reads non redundant rate, the higher the measurement is, the better the data quality are. 2. After peaks calling, There are three measurement involves here, total peaks count, confident peaks count , and shift size(optionally, used when model=Yes, see in Dive into conf)

Output files

  1. temporary files
Content File Name Tool used
separate treat bedGraphfile %{DatasetID}s_treat_rep%{treat_rep}s.bdg MACS2
separate control MACS bedGraph file %{DatasetID}s_control_rep%{control_rep}s.bdg MACS2
Overall MACS bedGraph file %{DatasetID}s_treat.bdg MACS2
bedGraph temporary file(remove exceptions) %{DatasetID}s_treat.bdg.tmp MACS2
sortedbed(For get top peaks) %(DatasetID)s_sorted.bed Linux sort
top 1000 peaks(for latter MDSeqpos) %{DatasetID}s_top1000_summits.bed MACS
MACS encode Peak(macs2 output) %{DatasetID}s_treat_rep%{treat_rep}s_peaks.encodePeak MACS
treatrep_pq_table %(DatasetID)s_rep%(treat_rep)s_pq_table.txt MACS
pq_table %(DatasetID)s_pq_table.txt MACS
treat_rep%(DatasetID)s_rep%(treat_rep)s_treat_pvalue.bdg MACS  
treat_pvalue %(DatasetID)s_treat_pvalue.bdg MACS
treatrep_qvalue %(DatasetID)s_rep%(treat_rep)s_treat_qvalue.bdg MACS
lambda_bdg %(DatasetID)s_rep%(control_rep)s_control_lambda.bdg MACS
  1. final results
Content File Name Tool used
treatreppeaks %(DatasetID)s_rep%(treat_rep)s_peaks.bed MACS
treatpeaks %(DatasetID)s_peaks.bed MACS
treatrepbw %(DatasetID)s_treat%(treat_rep)s.bw MACS
treatbw %(DatasetID)s_treat.bw MACS
controlrepbw %(DatasetID)s_rep%(treat_rep)s_control MACS
controlbw %(DatasetID)s_control.bw MACS
peaksrepxls %(DatasetID)s_rep%(treat_rep)s_peaks.xls MACS
peaksxls %(DatasetID)s_peaks.xls MACS
summitsrep %(DatasetID)s_rep%(treat_rep)s_summits.bed MACS
summits %(DatasetID)s_summits.bed MACS

Replicates analysis

Data analysis

Focus on the visulization of similarity between replicates. * Draw the venn diagram for peaks if there’re less than 3 replicates (treatment or control) * plot the Correlation score for whole genome region average peaks score

Quality Control

The R code is searched by regular expression to get the needed part for generating QC report

Folder File Name Content Tool used
correlation plot code %{DatasetID}s_cor.R Buit-in tools  
DHS peaks intersection %{DatasetID}s_bedtools_dhs.txt External Tools  
overlap with velcro region %{DatasetID}s_bedtools_velcro.txt External Tools  
peaks overlapped %{DatasetID}s_overlapped_bed External Tools  
AnnotationQC %{DatasetID}s_Metagene_distribution.pdf R  
AnnotationQC %{DatasetID}s_peak_height_distribution.pdf R  

Meta genomics Study

Focus on association between intervals (result of peak calling) and traits like genome annotation.

  • CEAS: Annotate the given intervals and scores with genome features
  • Conservation Plot: Calculates the PhastCons scores in several intervals sets

output files

CEAS part

Content File Name Tool used
CEAS script %(DatasetID)s_ceas_CI.R CEAS
CEAS script %(DatasetID)s_ceas_CI.pdf CEAS
CEAS xls %(DatasetID)s_ceas.xls CEAS
CEAS R script %(DatasetID)s_ceas.R CEAS
CEAS result pdf %(DatasetID)s_ceas.pdf  

Conservation analysis part

Content File Name Tool used
conservtopsummits %(DatasetID)s_top3000summits.bed built-in tools
conservR %(DatasetID)s_conserv.R built-in tools
conservpng %(DatasetID)s_conserv.png built-in tools

Motif

Here, we use the powerful combination of denovo motif finding algorithm, MDscan, and database-based search algorithm, Seqpos for motif analysis.

output files

Content File Name Tool used
summitspeaks1000 %(DatasetID)s_summits_p1000.bed linux tools
bgfreq %(DatasetID)s_bgfreq External Tools
seqpos %(DatasetID)s_seqpos.zip
ref:MDSeqpos

Other analysis type

GO analysis

extract all the genes upstream or downstream the predicting peaks for functional clustering or annotation.

Cistrome Radar/ Finder

You could check your top rated peaks in the Cistrome Radar and Finder to find interesting associated results,

Summary Report

Data Analysis Summary text

Quality report Instruction

An example QC report is here QC.

Note

Output Format is optional(default PDF) Below is output in the root directory, that is the folder named after ${DatasetID}

Provide the overall report of the whole pipeline for viewing general result.

Based on Chip-seq pipeline and Cistrome DC database, QC program will generate a comprehensive quality control report about a particular dataset as well as the relative result compared to the whole DC database. * QC report summary information

Give an overview of all the measurement pass or fail information
  • Basic information: Species, Cell Type, Tissue Origin, Cell line, Factor, Experiment, Platform, Treatment and Control.
  • Reads Genomic Mapping QC measurement: QC of raw sequence data with FastQC, FastQC score distribution, Basic mapping QC statistics, Mappable reads ratio, Mappable Redundant rate.
  • Peak calling QC measurement: Peak calling summary, High confident Peak, Peaks overlapped with DHS(Dnase Hypersensitivity sites), Velcro ratio(human only), Profile correlation within union peak regions, Peaks overlap between Replicates.
  • Functional Genomic QC measurement: Peak Height distribution, Meta Gene distribution, Peak conservation score, Motif QCmeasurement analysis.