Manual¶

This manual is intended for two kind of users group.

Biologists Users: in order to know where they should go to download these data and tools,what each output file represents

Developers: in order to make sure they’re using the right format of data and right version of tool to test, they have a consistent naming convention and can find each file easily.

Through the pipeline, several temporary files will be generated, some of them are only used for settings and transitions, others for continuing the next step, the rest for publishing and interpreting a biological story. Below is three sections of tables for universal name rules.

Note

%(DatasetID)s denotes id you input in the basis section in ChiLin.conf, %(treat_rep)s for treat number, %(control_rep)s for control rep. see Get Started use clear term to fill in the id example: use factor name plus your favorate number to replace the DatasetID below. if data is published, GSEID is recommended.

Raw Data¶

supported format¶

single end fastq files and absolid colored fastq files are the raw reads files supported, and users could use the bam files that has been already mapped back to the genomes.

ChiLin have been supporting the following format as input:

Format	Type	instruction
FASTQ	Seq	single-end fastq or absolid colored fastq
BAM	Mapped	Skip mapping

ChiLin do not support the following format in current version:

Format	Type	Solution
SRA	Seq	Use SRA Toolkit to convert to FASTQ format
BED	Summit
BED	Peak	Could be converted to bam files using bedToBam
wig	Profile	Use wiggleToBigwiggle to convert to Bigwiggle
Bigwig	Profile	Compressed Bigwiggle

Quality Control¶

This part only use the tools for raw fastq quality control

tools involved We integrated Babraham’s FastQC to assess the raw FastQ format files and extracted sequence quality scores from their summary,
Historic Data to compare new data will be stored for further comparing with all Cistrome DC project historic data, which is collected by all DC team members and save into sqlite3 database format for indexing.

Output of Raw¶

temporary files

No final result for package by default setting in this step

Reads Mapping¶

Raw reads mapping is the first step for analyzing ChIP-seq data, which is very important for following analysis.

Data analysis¶

Modern high throughput sequencers can generate tens of mil- lions of sequences in a single run. Before analysing this sequence to draw biological conclusions you should always perform some simple quality control checks to ensure that the raw data looks good and there are no problems or biases in your data which may a detect how you can usefully use it.

Here, we have chosen the bowtie for mapping raw reads data with standard parameters. below is the example command line we set in python script for hg19

> /usr/local/bin/bowtie -p 1 -S -m 1 /pathtoIndex/hg19 /pathto/treat /outputdirectory/treat.sam

Quality Control¶

Built-in tools would extract quality control preparation information from standard output of Bowtie and sam files to do the following description statistics.

Output of Mapping¶

1. temporary files .. csv-table:

:header: "Content", "File Name", "Tool used"
:widths: 20, 30, 15
:delim: ;

Bowie treat files ; %(DatasetID)s_treat_rep%{treat_rep}s.sam ; :ref:`Bowtie`
Bowtie control files ; %(DatasetID)s_control_rep%{control_rep}s.sam ; :ref:`Bowtie`
Bowtie temporary summary ; bowtie.tmp ; :ref: `Bowtie`

2. output files Bowtie result sam files would be converted to bam binary format for minimizing the file sizes through samtools:

samtools view -bt chrom_len sam bam

Groom¶

This part is designed for users who don’t have raw reads fastq files, but have bam files instead. ChiLin helps to convert bam files to fastq files for further processing all pipeline. The only different for Usage is to input bam suffix files in the ChiLin.conf

The convert tool used here is bedtools bamToFastq:

bamToFastq -i x.bam -fq test.fq

Peak Calling¶

Data Analysis¶

We do the peak calling analysis by MACS2, we set the parameter to meet the requirement for non redundant tags for further analysis. Cutoff of false discovery rate(fdr) is set to 0.01, only keep one tag for duplicate tags to remove possible bias. In default setting, macs2 would not build model. The standard command line involves here:

macs2 callpeak -B -q 0.01 --keep-dup 1 --shiftsize=73 --nomodel  -t /pathto/treat.bam  -c /pathto/control.bam -n macsname

Quality Control¶

1. Before peaks calling, bams files would be sent for MACS2_ subparser filterdup for statistical analysis on mapped reads non redundant rate, the higher the measurement is, the better the data quality are. 2. After peaks calling, There are three measurement involves here, total peaks count, confident peaks count , and shift size(optionally, used when model=Yes, see in Dive into conf)

Output files¶

temporary files

Content	File Name	Tool used
separate treat bedGraphfile	%{DatasetID}s_treat_rep%{treat_rep}s.bdg	MACS2
separate control MACS bedGraph file	%{DatasetID}s_control_rep%{control_rep}s.bdg	MACS2
Overall MACS bedGraph file	%{DatasetID}s_treat.bdg	MACS2
bedGraph temporary file(remove exceptions)	%{DatasetID}s_treat.bdg.tmp	MACS2
sortedbed(For get top peaks)	%(DatasetID)s_sorted.bed	Linux sort
top 1000 peaks(for latter MDSeqpos)	%{DatasetID}s_top1000_summits.bed	MACS
MACS encode Peak(macs2 output)	%{DatasetID}s_treat_rep%{treat_rep}s_peaks.encodePeak	MACS
treatrep_pq_table	%(DatasetID)s_rep%(treat_rep)s_pq_table.txt	MACS
pq_table	%(DatasetID)s_pq_table.txt	MACS
treat_rep%(DatasetID)s_rep%(treat_rep)s_treat_pvalue.bdg	MACS
treat_pvalue	%(DatasetID)s_treat_pvalue.bdg	MACS
treatrep_qvalue	%(DatasetID)s_rep%(treat_rep)s_treat_qvalue.bdg	MACS
lambda_bdg	%(DatasetID)s_rep%(control_rep)s_control_lambda.bdg	MACS

final results

Content	File Name	Tool used
treatreppeaks	%(DatasetID)s_rep%(treat_rep)s_peaks.bed	MACS
treatpeaks	%(DatasetID)s_peaks.bed	MACS
treatrepbw	%(DatasetID)s_treat%(treat_rep)s.bw	MACS
treatbw	%(DatasetID)s_treat.bw	MACS
controlrepbw	%(DatasetID)s_rep%(treat_rep)s_control	MACS
controlbw	%(DatasetID)s_control.bw	MACS
peaksrepxls	%(DatasetID)s_rep%(treat_rep)s_peaks.xls	MACS
peaksxls	%(DatasetID)s_peaks.xls	MACS
summitsrep	%(DatasetID)s_rep%(treat_rep)s_summits.bed	MACS
summits	%(DatasetID)s_summits.bed	MACS

Replicates analysis¶

Data analysis¶

Focus on the visulization of similarity between replicates. * Draw the venn diagram for peaks if there’re less than 3 replicates (treatment or control) * plot the Correlation score for whole genome region average peaks score

Quality Control¶

The R code is searched by regular expression to get the needed part for generating QC report

Folder	File Name	Content
correlation plot code	%{DatasetID}s_cor.R	Buit-in tools
DHS peaks intersection	%{DatasetID}s_bedtools_dhs.txt	External Tools
overlap with velcro region	%{DatasetID}s_bedtools_velcro.txt	External Tools
peaks overlapped	%{DatasetID}s_overlapped_bed	External Tools
AnnotationQC	%{DatasetID}s_Metagene_distribution.pdf	R
AnnotationQC	%{DatasetID}s_peak_height_distribution.pdf	R

Meta genomics Study¶

Focus on association between intervals (result of peak calling) and traits like genome annotation.

CEAS: Annotate the given intervals and scores with genome features
Conservation Plot: Calculates the PhastCons scores in several intervals sets

output files¶

CEAS part

Content	File Name	Tool used
CEAS script	%(DatasetID)s_ceas_CI.R	CEAS
CEAS script	%(DatasetID)s_ceas_CI.pdf	CEAS
CEAS xls	%(DatasetID)s_ceas.xls	CEAS
CEAS R script	%(DatasetID)s_ceas.R	CEAS
CEAS result pdf	%(DatasetID)s_ceas.pdf

Conservation analysis part

Content	File Name	Tool used
conservtopsummits	%(DatasetID)s_top3000summits.bed	built-in tools
conservR	%(DatasetID)s_conserv.R	built-in tools
conservpng	%(DatasetID)s_conserv.png	built-in tools

Motif¶

Here, we use the powerful combination of denovo motif finding algorithm, MDscan, and database-based search algorithm, Seqpos for motif analysis.

output files¶

Content File Name Tool used

summitspeaks1000 %(DatasetID)s_summits_p1000.bed linux tools

bgfreq %(DatasetID)s_bgfreq External Tools

seqpos

%(DatasetID)s_seqpos.zip

ref:	MDSeqpos

Other analysis type¶

GO analysis¶

extract all the genes upstream or downstream the predicting peaks for functional clustering or annotation.

Cistrome Radar/ Finder¶

You could check your top rated peaks in the Cistrome Radar and Finder to find interesting associated results,

Summary Report¶

Data Analysis Summary text¶

Quality report Instruction¶

An example QC report is here QC.

Note

Output Format is optional(default PDF) Below is output in the root directory, that is the folder named after ${DatasetID}

Provide the overall report of the whole pipeline for viewing general result.

Based on Chip-seq pipeline and Cistrome DC database, QC program will generate a comprehensive quality control report about a particular dataset as well as the relative result compared to the whole DC database. * QC report summary information

Give an overview of all the measurement pass or fail information

Basic information: Species, Cell Type, Tissue Origin, Cell line, Factor, Experiment, Platform, Treatment and Control.
Reads Genomic Mapping QC measurement: QC of raw sequence data with FastQC, FastQC score distribution, Basic mapping QC statistics, Mappable reads ratio, Mappable Redundant rate.
Peak calling QC measurement: Peak calling summary, High confident Peak, Peaks overlapped with DHS(Dnase Hypersensitivity sites), Velcro ratio(human only), Profile correlation within union peak regions, Peaks overlap between Replicates.
Functional Genomic QC measurement: Peak Height distribution, Meta Gene distribution, Peak conservation score, Motif QCmeasurement analysis.

Manual¶

Raw Data¶

supported format¶

Quality Control¶

Output of Raw¶

Reads Mapping¶

Data analysis¶

Quality Control¶

Output of Mapping¶

Groom¶

Peak Calling¶

Data Analysis¶

Quality Control¶

Output files¶

Replicates analysis¶

Data analysis¶

Quality Control¶

Meta genomics Study¶

output files¶

Motif¶

output files¶

Other analysis type¶

GO analysis¶

Cistrome Radar/ Finder¶

Summary Report¶

Data Analysis Summary text¶

Quality report Instruction¶

Table Of Contents

Related Topics

This Page

Navigation

Manual¶

Raw Data¶

supported format¶

Quality Control¶

Output of Raw¶

Reads Mapping¶

Data analysis¶

Quality Control¶

Output of Mapping¶

Groom¶

Peak Calling¶

Data Analysis¶

Quality Control¶

Output files¶

Replicates analysis¶

Data analysis¶

Quality Control¶

Meta genomics Study¶

output files¶

Motif¶

output files¶

Other analysis type¶

GO analysis¶

Cistrome Radar/ Finder¶

Summary Report¶

Data Analysis Summary text¶

Quality report Instruction¶

Table Of Contents

Related Topics

This Page

Quick search