Deeptools 神器之 bamCoverage
Part1Deeptools 神器之 bamCoverage
1Deeptools
Deeptools2 软件在 2016 年发表在 NAR(Nucleic Acids Research)上,它第一次是在 2014 年部署在 Galaxy 服务器上,16 年推出的 Deeptools2 在计算和可视化方面做出了很多更新和升级,旨在更快速和便捷的分析和可视化高通量测序数据。
deepTools2: a next generation web server fordeep-sequencing data analysis[1]
主要功能:
1、处理 bam 和 bigwig 文件
multiBamSummary multiBigwigSummary correctGCBias bamCoverage bamCompare bigwigCompare computeMatrix alignmentSieve
2、对 bam 文件 QC 质量监控
plotCorrelation plotPCA plotFingerprint bamPEFragmentSize computeGCBias plotCoverage
3、可视化
plotHeatmap plotProfile plotEnrichment Miscellaneous computeMatrixOperations estimateReadFiltering
在比对得到的 bam 文件需要查看某些基因的比对情况
及覆盖度
,直接在 igv 里加载 bam 文件会比较大,而且慢,一般需要转换为 bigwig
、wig
或 bedgraph
等文件,这些文件里只包含了每个位置的测序深度的信息,igv 里可以快速查找每个基因。
deeptools 里的 bamCoverage 命令可以将 bam 文件转为 bigwig 或 bedgraph 文件。
软件安装
软件需要依赖这些软件:
Python 2.7 or Python 3.x numpy >= 1.8.0 scipy >= 0.17.0 py2bit >= 0.1.0 pyBigWig >= 0.2.1 pysam >= 0.8 matplotlib >= 1.4.0
# 快捷安装方式
$ conda install -c bioconda deeptools
# 用pip安装
$ pip install deeptools
# 安装到指定路径
$ pip install --install-option="--prefix=/MyPath/Tools/deepTools2.0" git+https://github.com/deeptools/deepTools.git
2bamCoverage
bamCoverage 参数:
$ bamCoverage -h
usage: An example usage is:$ bamCoverage -b reads.bam -o coverage.bw
This tool takes an alignment of reads or fragments as input (BAM file) and generates a coverage track (bigWig or bedGraph) as output. The coverage is
calculated as the number of reads per bin, where bins are short consecutive counting windows of a defined size. It is possible to extended the length of the
reads to better reflect the actual fragment length. *bamCoverage* offers normalization by scaling factor, Reads Per Kilobase per Million mapped reads
(RPKM), counts per million (CPM), bins per million mapped reads (BPM) and 1x depth (reads per genome coverage, RPGC).
Required arguments:
--bam BAM file, -b BAM file
BAM file to process (default: None)
Output:
--outFileName FILENAME, -o FILENAME
Output file name. (default: None)
--outFileFormat {bigwig,bedgraph}, -of {bigwig,bedgraph}
Output file type. Either "bigwig" or "bedgraph". (default: bigwig)
Optional arguments:
--help, -h show this help message and exit
--scaleFactor SCALEFACTOR
The computed scaling factor (or 1, if not applicable) will be multiplied by this. (Default: 1.0)
--MNase Determine nucleosome positions from MNase-seq data. Only 3 nucleotides at the center of each fragment are counted. The fragment ends
are defined by the two mate reads. Only fragment lengthsbetween 130 - 200 bp are considered to avoid dinucleosomes or other
artifacts. By default, any fragments smaller or larger than this are ignored. To over-ride this, use the --minFragmentLength and
--maxFragmentLength options, which will default to 130 and 200 if not otherwise specified in the presence of --MNase. *NOTE*:
Requires paired-end data. A bin size of 1 is recommended. (default: False)
--Offset INT [INT ...]
Uses this offset inside of each read as the signal. This is useful in cases like RiboSeq or GROseq, where the signal is 12, 15 or 0
bases past the start of the read. This can be paired with the --filterRNAstrand option. Note that negative values indicate offsets
from the end of each read. A value of 1 indicates the first base of the alignment (taking alignment orientation into account).
Likewise, a value of -1 is the last base of the alignment. An offset of 0 is not permitted. If two values are specified, then they
will be used to specify a range of positions. Note that specifying something like --Offset 5 -1 will result in the 5th through last
position being used, which is equivalent to trimming 4 bases from the 5-prime end of alignments. Note that if you specify
--centerReads, the centering will be performed before the offset. (default: None)
--filterRNAstrand {forward,reverse}
Selects RNA-seq reads (single-end or paired-end) originating from genes on the given strand. This option assumes a standard dUTP-
based library preparation (that is, --filterRNAstrand=forward keeps minus-strand reads, which originally came from genes on the
forward strand using a dUTP-based method). Consider using --samExcludeFlag instead for filtering by strand in other contexts.
(default: None)
--version show program's version number and exit
--binSize INT bp, -bs INT bp
Size of the bins, in bases, for the output of the bigwig/bedgraph file. (Default: 50)
--region CHR:START:END, -r CHR:START:END
Region of the genome to limit the operation to - this is useful when testing parameters to reduce the computing time. The format is
chr:start:end, for example --region chr10 or --region chr10:456700:891000. (default: None)
--blackListFileName BED file [BED file ...], -bl BED file [BED file ...]
A BED or GTF file containing regions that should be excluded from all analyses. Currently this works by rejecting genomic chunks
that happen to overlap an entry. Consequently, for BAM files, if a read partially overlaps a blacklisted region or a fragment spans
over it, then the read/fragment might still be considered. Please note that you should adjust the effective genome size, if
relevant. (default: None)
--numberOfProcessors INT, -p INT
Number of processors to use. Type "max/2" to use half the maximum number of processors or "max" to use all available processors.
(Default: 1)
--verbose, -v Set to see processing messages. (default: False)
Read coverage normalization options:
--effectiveGenomeSize EFFECTIVEGENOMESIZE
The effective genome size is the portion of the genome that is mappable. Large fractions of the genome are stretches of NNNN that
should be discarded. Also, if repetitive regions were not included in the mapping of reads, the effective genome size needs to be
adjusted accordingly. A table of values is available here:
http://deeptools.readthedocs.io/en/latest/content/feature/effectiveGenomeSize.html . (default: None)
--normalizeUsing {RPKM,CPM,BPM,RPGC,None}
Use one of the entered methods to normalize the number of reads per bin. By default, no normalization is performed. RPKM = Reads Per
Kilobase per Million mapped reads; CPM = Counts Per Million mapped reads, same as CPM in RNA-seq; BPM = Bins Per Million mapped
reads, same as TPM in RNA-seq; RPGC = reads per genomic content (1x normalization); Mapped reads are considered after blacklist
filtering (if applied). RPKM (per bin) = number of reads per bin / (number of mapped reads (in millions) * bin length (kb)). CPM
(per bin) = number of reads per bin / number of mapped reads (in millions). BPM (per bin) = number of reads per bin / sum of all
reads per bin (in millions). RPGC (per bin) = number of reads per bin / scaling factor for 1x average coverage. None = the default
and equivalent to not setting this option at all. This scaling factor, in turn, is determined from the sequencing depth: (total
number of mapped reads * fragment length) / effective genome size. The scaling factor used is the inverse of the sequencing depth
computed for the sample to match the 1x coverage. This option requires --effectiveGenomeSize. Each read is considered independently,
if you want to only count one mate from a pair in paired-end data, then use the --samFlagInclude/--samFlagExclude options. (Default:
None)
--exactScaling Instead of computing scaling factors based on a sampling of the reads, process all of the reads to determine the exact number that
will be used in the output. This requires significantly more time to compute, but will produce more accurate scaling factors in
cases where alignments that are being filtered are rare and lumped together. In other words, this is only needed when region-based
sampling is expected to produce incorrect results. (default: False)
--ignoreForNormalization IGNOREFORNORMALIZATION [IGNOREFORNORMALIZATION ...], -ignore IGNOREFORNORMALIZATION [IGNOREFORNORMALIZATION ...]
A list of space-delimited chromosome names containing those chromosomes that should be excluded for computing the normalization.
This is useful when considering samples with unequal coverage across chromosomes, like male samples. An usage examples is
--ignoreForNormalization chrX chrM. (default: None)
--skipNonCoveredRegions, --skipNAs
This parameter determines if non-covered regions (regions without overlapping reads) in a BAM file should be skipped. The default is
to treat those regions as having a value of zero. The decision to skip non-covered regions depends on the interpretation of the
data. Non-covered regions may represent, for example, repetitive regions that should be skipped. (default: False)
--smoothLength INT bp
The smooth length defines a window, larger than the binSize, to average the number of reads. For example, if the --binSize is set to
20 and the --smoothLength is set to 60, then, for each bin, the average of the bin and its left and right neighbors is considered.
Any value smaller than --binSize will be ignored and no smoothing will be applied. (default: None)
Read processing options:
--extendReads [INT bp], -e [INT bp]
This parameter allows the extension of reads to fragment size. If set, each read is extended, without exception. *NOTE*: This
feature is generally NOT recommended for spliced-read data, such as RNA-seq, as it would extend reads over skipped regions. *Single-
end*: Requires a user specified value for the final fragment length. Reads that already exceed this fragment length will not be
extended. *Paired-end*: Reads with mates are always extended to match the fragment size defined by the two read mates. Unmated
reads, mate reads that map too far apart (>4x fragment length) or even map to different chromosomes are treated like single-end
reads. The input of a fragment length value is optional. If no value is specified, it is estimated from the data (mean of the
fragment size of all mate reads). (default: False)
--ignoreDuplicates If set, reads that have the same orientation and start position will be considered only once. If reads are paired, the mate's
position also has to coincide to ignore a read. (default: False)
--minMappingQuality INT
If set, only reads that have a mapping quality score of at least this are considered. (default: None)
--centerReads By adding this option, reads are centered with respect to the fragment length. For paired-end data, the read is centered at the
fragment length defined by the two ends of the fragment. For single-end data, the given fragment length is used. This option is
useful to get a sharper signal around enriched regions. (default: False)
--samFlagInclude INT Include reads based on the SAM flag. For example, to get only reads that are the first mate, use a flag of 64. This is useful to
count properly paired reads only once, as otherwise the second mate will be also considered for the coverage. (Default: None)
--samFlagExclude INT Exclude reads based on the SAM flag. For example, to get only reads that map to the forward strand, use --samFlagExclude 16, where
16 is the SAM flag for reads that map to the reverse strand. (Default: None)
--minFragmentLength INT
The minimum fragment length needed for read/pair inclusion. This option is primarily useful in ATACseq experiments, for filtering
mono- or di-nucleosome fragments. (Default: 0)
--maxFragmentLength INT
The maximum fragment length needed for read/pair inclusion. (Default: 0)
最简单的命令:
$ bamCoverage -b reads.bam -o coverage.bw
参数介绍:
1、--bam/-b: 需要转换的 bam 文件
2、--outFileName/-o: 输出文件名
3、--outFileFormat/-of: 输出文件格式,bigwig 或 bedgraph,默认 bigwig
可选参数:
4、--help/-h: 查看帮助信息
5、--scaleFactor: 缩放因子,默认是 1
6、--MNase: MNase-seq 数据确定核小体的位置,具体见参考文档
7、--Offset: 整数,reads 的偏移,针对 RiboSeq 或 GROseq
8、--filterRNAstrand: 过滤正负链,针对连特异性测序数据
9、 --version: 输出软件版本信息
10、--binSize/-bs: 划分区间大小,整数型,默认 50bp 一个区间,越小分辨率越高
11、--region/-r: 对特定的区间计算,用于测试
12、--blackListFileName: bed 格式文件,用于排除不需要的区间
13、--numberOfProcessors/-p: 线程数,整数型
14、--verbose/-v: 输出计算过程中的信息,默认不输出
reads 归一化参数:
15、--effectiveGenomeSize: 有效基因组大小
16、--normalizeUsing: {RPKM,CPM,BPM,RPGC,None}五个选项可选,默认 none,具体见参考文档
17、--exactScaling: 精确计算缩放因子,所需时间更长
18、--ignoreForNormalization: 排除不需要归一化的染色体
19、--skipNonCoveredRegions/--skipNAs: 跳过没有 reads 覆盖的区域
20、--smoothLength: 整数型,平滑化窗口,必须大于 binsize
reads 处理参数:
21、--extendReads/-e: 延长 reads 的长度,对应 RNA-seq 数据不适合
22、--ignoreDuplicates: 忽略重复的 reads
23、--minMappingQuality: reads 最小比对质量值,整数型
24、--centerReads: 中心化 reads,可获得更陡峭的峰
25、--samFlagInclude: 根据 sam 格式的 flag 来保留对应的 reads 用于分析
26、--samFlagExclude: 根据 sam 格式的 flag 来剔除对应的 reads 用于分析
27、--minFragmentLength: reads 的 Fragment 最小长度,对应于 ATAC-seq 数据可用来筛选单核小体和多核小体
28、--maxFragmentLength: reads 的 Fragment 最大长度
使用示例
# 输出bigwig
$ bamCoverage -p 8 \
--bam test.bam \
--binSize 10 \
--centerReads \
--smoothLength 14 \
--normalizeUsing RPKM \
-o test.bigwig
# 输出bedgraph
$ bamCoverage -p 8 \
--bam test.bam \
--binSize 10 \
--centerReads \
--smoothLength 14 \
--normalizeUsing RPKM \
--outFileFormat bedgraph \
-o test.bedgraph
igv 可视化
参考资料
deepTools2: a next generation web server fordeep-sequencing data analysis: https://academic.oup.com/nar/article/44/W1/W160/2499308
欢迎小伙伴留言评论!
今天的分享就到这里了,敬请期待下一篇!
最后欢迎大家分享转发,您的点赞是对我的鼓励和肯定!
如果觉得对您帮助很大,打赏一下吧!