MeRIP-seq 数据分析之 callpeak 及 peak 可视化

原创 JunJunLab 老俊俊的生信笔记 2022-08-15

收录于合集

点击上方关注“公众号”

1前言

前面我们比对得到了 高质量的 bam 文件，RNA-seq 到这一步应该走 定量分析 了，但是 m6A 修饰是 单位点修饰 ，显然对基因或转录本定量并不合适，而是应该去寻找基因上某个位点的 m6A 修饰的位置，也就是 callpeak 过程：

这里使用文章中的 macs2 和 exomePeak2 软件进行 callpeak，区别是 macs2 最早是专门针对分析 Chip-seq 数据的，exomePeak 是针对分析 m6A-seq 数据的。

2macs2 callpeak

macs2 其实已经更新到 macs3 了，下面是更新的特点：

主要参数介绍：

-t： 处理组样本，IP 样本。
-c： 对照组样本，Input 样本。
-f： 输入文件格式，默认自动识别。
-g： 有效基因组大小，可以是数字，也可以是：hs，mm，ce，dm。
--keep-dup： 保留重复的 tag。
--outdir： 文件输出路径。
-n： 输出文件名称前缀。
-B： 保留 the fragment pileup, control lambda 到 bedGraph 文件。
--SPMR： normalized pileup tracks across many datasets，可以在样本间进行比较 peaks。
--nomodel： 不使用 macs2 建立 shifting model。
--extsize： 与--nomodel 一起使用，从 5'->3'延伸 fragment 到指定长度。
-q： qvalue。
--fe-cutoff： fold-enrichment。

callpeak：

$ vi callpeak.sh

#!/bin/bash
mkdir 6.macs-data
# IVT sample
macs2 callpeak -t 4.map-data/SRR14765639.high_quality_sorted.bam \
          -c 4.map-data/SRR14765638.high_quality_sorted.bam \
          -B --SPMR --keep-dup all \
          -n IVT --outdir 6.macs-data -f BAM \
          -g 105516336 --nomodel --extsize 150 \
          -q 0.05 --fe-cutoff 2

# METTL3 ko rep1
macs2 callpeak -t 4.map-data/SRR14765641.high_quality_sorted.bam \
          -c 4.map-data/SRR14765640.high_quality_sorted.bam \
          -B --SPMR --keep-dup all \
          -n M3_KO_rep1 --outdir 6.macs-data -f BAM \
          -g 105516336 --nomodel --extsize 150 \
          -q 0.05 --fe-cutoff 2

# METTL3 ko rep2
macs2 callpeak -t 4.map-data/SRR14765643.high_quality_sorted.bam \
          -c 4.map-data/SRR14765642.high_quality_sorted.bam \
          -B --SPMR --keep-dup all \
          -n M3_KO_rep2 --outdir 6.macs-data -f BAM \
          -g 105516336 --nomodel --extsize 150 \
          -q 0.05 --fe-cutoff 2

# WT rep1
macs2 callpeak -t 4.map-data/SRR14765645.high_quality_sorted.bam \
          -c 4.map-data/SRR14765644.high_quality_sorted.bam \
          -B --SPMR --keep-dup all \
          -n WT_rep1 --outdir 6.macs-data -f BAM \
          -g 105516336 --nomodel --extsize 150 \
          -q 0.05 --fe-cutoff 2

# WT rep2
macs2 callpeak -t 4.map-data/SRR14765647.high_quality_sorted.bam \
          -c 4.map-data/SRR14765646.high_quality_sorted.bam \
          -B --SPMR --keep-dup all \
          -n WT_rep2 --outdir 6.macs-data -f BAM \
          -g 105516336 --nomodel --extsize 150 \
          -q 0.05 --fe-cutoff 2

# 后台运行
$ nohup ./callpeak.sh &

产生的结果文件：

文件说明：

1、_control_lambda.bdg/_treat_pileup.bdg： Input 和 IP 样本的 bedGraph 格式数据，可载入 IGV 里可视化。
2、_peaks.narrowPeak： BED6+4 格式的数据，分别为：染色体-peak 起始位置-peak 终止位置-peak 名称-peak 长度-正负链-FoldChange- -log10pvalue- -log10qvalue-peak 起始位置距离 summit 的距离 ：
3、_peaks.xls： xls 格式的 peak 数据，abs_summit为顶峰的位置，pileup为顶峰的值：
4、_summits.bed： 记录了每个 peak 顶峰的位置，最后一列为 -log10qvalue，可以用这个去找 motif：

对 peak 取交集：

对于有两个生物学重复样本，使用 bedtools intersect 取交集，取至少有 50% 区域重叠的 peak 区域：

$ bedtools intersect -a WT_rep1_peaks.narrowPeak \
                     -b WT_rep2_peaks.narrowPeak \
                     -f 0.5 -F 0.5 -e > WT_macs2_intersect.bed
$ bedtools intersect -a M3_KO_rep1_peaks.narrowPeak \
                     -b M3_KO_rep2_peaks.narrowPeak \
                     -f 0.5 -F 0.5 -e > M3_KO_macs2_intersect.bed

导入 IGV 进行验证一下，可以看到去的是两个 replicate 的交集：

加载 bedgraph 文件和交集 peak 文件导入看看：

可以明显看到 Jun 基因在 3'UTR 区域有个 m6A，敲除 METTL3 后，这个 m6A 修饰就消失了。

Junb 基因：

3exomePeak2 callpeak

打开我们之前创建的 R project，创建一个 step2_exomepeak2_callpeak.R 脚本。

下载 GTF 注释文件：

$ wget https://hgdownload.soe.ucsc.edu/goldenPath/mm10/bigZips/genes/mm10.ncbiRefSeq.gtf.gz
$ gunzip mm10.ncbiRefSeq.gtf.gz

批量 callpeak：

# install
BiocManager::install('exomePeak2')
# library
library(exomePeak2)

# 注释文件
gtf = '../mm10.ncbiRefSeq.gtf'

# input样本
input_bam = c('../4.map-data/SRR14765638.high_quality_sorted.bam',
              '../4.map-data/SRR14765640.high_quality_sorted.bam',
              '../4.map-data/SRR14765642.high_quality_sorted.bam',
              '../4.map-data/SRR14765644.high_quality_sorted.bam',
              '../4.map-data/SRR14765646.high_quality_sorted.bam')

# ip样本
ip_bam = c('../4.map-data/SRR14765639.high_quality_sorted.bam',
           '../4.map-data/SRR14765641.high_quality_sorted.bam',
           '../4.map-data/SRR14765643.high_quality_sorted.bam',
           '../4.map-data/SRR14765645.high_quality_sorted.bam',
           '../4.map-data/SRR14765647.high_quality_sorted.bam')

# 创建保存结果文件夹
dir.create('../7.exomepeak2-data')

# 变量名
name = c('IVT','M3KO_rep1','M3KO_rep2','WT_rep1','WT_rep2')

# 批量callpeak
lapply(1:5, function(x){
           exomePeak2(gff_dir = gtf,
                      bam_ip = ip_bam[x],
                      bam_input = input_bam[x],
                      save_dir = paste('../7.exomepeak2-data/',name[x],sep = ''),
                      paired_end = T,parallel = T,
                      p_cutoff = 0.00001,log2FC_cutoff = 1,
                      fragment_length = 150)
}) -> result
## Make the TxDb object ... OK
## Warning: Missing BSgenome or UCSC genome name, peak calling without GC content correction.
## Generate bins on exons ... OK
## Count reads on bins ... OK
## Identify background with Gaussian Mixture Model ... OK
## Peak Calling with Poisson GLM ... OK
## Count reads on the merged peaks and the control regions ... OK
## Calculate peaks/sites statistics with Poisson GLM ... OK
## Result files have been saved in the directory: '../7.exomepeak2-data/IVT'.
## ...

产生的结果文件：

Mod.csv 文件：

Mod.bed 文件：

每列说明：

生物学重复取交集：

$ cd 7.exomepeak2-data/

# 复制peaks文件到当前文件夹
$ cp M3KO_rep1/Mod.bed ./M3KO_rep1.bed
$ cp M3KO_rep2/Mod.bed ./M3KO_rep2.bed
$ cp WT_rep1/Mod.bed ./WT_rep1.bed
$ cp WT_rep2/Mod.bed ./WT_rep2.bed

同样过滤掉 overlap 小于 50% 的 peaks，使用作者提供的一个 python 小脚本：

$ vi exomePeak2_intersect.py
#to filter intersect with proportion less then 0.5
from __future__ import division
import sys
for line in open(sys.argv[1],"r") :
        line = line.strip()
        info = line.split("\t")
        a_len = sum([int(i) for i in info[10].split(",")])
        b_len = sum([int(i) for i in info[22].split(",")])
        o_len = int(info[24])
        if (int(info[1])-int(info[13]))*(int(info[2])-int(info[14])) <= 0 :
                print line
        else:
                if (o_len/a_len) >= 0.5 or (o_len/b_len) >= 0.5 :
                        print line

过滤，这个 python 脚本需要在 python2 环境下运行：

$ conda create -n py2 python=2 bedtools

# METTL3
$ bedtools intersect -a M3KO_rep1.bed -b M3KO_rep2.bed -s -split -wo > tmp_file
$ python exomePeak2_intersect.py tmp_file | cut -f 1,2,3,4,5,6,7,8,9,10,11,12 | sort | uniq > M3KO_reps_intersect.bed

# WT
$ bedtools intersect -a WT_rep1.bed -b WT_rep2.bed -s -split -wo > tmp_file
$ python exomePeak2_intersect.py tmp_file | cut -f 1,2,3,4,5,6,7,8,9,10,11,12 | sort | uniq > WT_reps_intersect.bed

$ rm tmp_file

IGV 里查看检查，没有问题：

4矫正假阳性 peak

由上面我们得到了 macs2 和 exomePeak2 的 IVT 和 METTL3 及 WT 的 m6A peak，接下来需要使用 IVT 的 peak 进行去除假阳性矫正，因为 IVT 得到的 peak 肯定都是 m6A 抗体非特异性结合及其它原因 富集下来的片段。

1、矫正 macs2 的 peak

$ cd ../6.macs-data/
$ mkdir calibrated-data

$ vi calibrate.sh
#!/bin/bash
for i in WT M3_KO
do
 #  IVT与 m6A peak取交集，获取假阳性peak
 bedtools intersect -a IVT_peaks.narrowPeak -b ${i}_macs2_intersect.bed \
 -wa -f 0.5 -F 0.5 -e | sort | uniq \
 > calibrated-data/${i}_false_positive.bed

 # 去除 m6A peak中的 IVT 的peak，矫正假阳性peak
 bedtools intersect -a ${i}_macs2_intersect.bed -b IVT_peaks.narrowPeak \
 -v -wa -f 0.5 -F 0.5 -e | sort | uniq \
 > calibrated-data/${i}_calibrated.bed

 # 提取 IVT 唯一的peak
 bedtools intersect -a IVT_peaks.narrowPeak -b ${i}_macs2_intersect.bed \
 -v -wa -f 0.5 -F 0.5 -e | sort | uniq \
 > calibrated-data/${i}_IVT_uniq.bed
done

# run
$ ./calibrate.sh

2、矫正 exomePeak2 的 peak

$ cd ../7.exomepeak2-data/
$ cp IVT/Mod.bed ./IVT.bed

# 写脚本批量矫正
$ vi calibrate.sh
#!/bin/bash
mkdir calibrated-data
# for loop
for i in WT M3KO
do
 bedtools intersect -a IVT.bed -b ${i}_reps_intersect.bed -s -split -wo > tmp_file
 #  IVT与 m6A peak取交集，获取假阳性peak
 python exomePeak2_intersect.py tmp_file | cut -f 1,2,3,4,5,6,7,8,9,10,11,12 | \
 sort | uniq > calibrated-data/${i}_false_positive.bed

 # 去除 m6A peak中的 IVT 的peak，矫正假阳性peak
 python exomePeak2_intersect.py tmp_file | cut -f 16 > tmp2
 awk 'BEGIN{OFS=FS="\t"}ARGIND==1{a[$0]=FNR}ARGIND==2{if(!($4 in a)){print $0}}' tmp2 ${i}_reps_intersect.bed | \
 sort | uniq > calibrated-data/${i}_calibrated.bed

 # 提取 IVT 唯一的peak
 python exomePeak2_intersect.py tmp_file | cut -f 4 > tmp3
 awk 'BEGIN{OFS=FS="\t"}ARGIND==1{a[$0]=FNR}ARGIND==2{if(!($4 in a)){print $0}}' tmp3 IVT.bed | \
 sort | uniq > calibrated-data/${i}_IVT_uniq.bed
done

# run
$ ./calibrate.sh

矫正的结果 peak 数量：

$ wc -l WT_reps_intersect.bed
14439 WT_reps_intersect.bed
$ wc -l calibrated-data/WT_calibrated.bed
7342 calibrated-data/WT_calibrated.bed

可以看到矫正后只剩差不多一半的 peak 了，说明假阳性率还是很高的。

5bam 转 bw 文件可视化 m6A coverage

把 bam 文件转为 bw 文件，结合 peak 文件进行可视化，之前装的 deeptools 好像用不了，换个环境重新装一下：

$ conda create -n deeptools deeptools=3.5.0 python=3
$ conda activate deeptools

$ vi bam2bw.sh
#!/bin/bash
# make dir to save bw
mkdir 5.bigwig-data
# bam to bigwig
for i in SRR147656{38..47}
do
 bamCoverage -p 10 --binSize 10 --normalizeUsing BPM \
 --exactScaling --smoothLength 15 \
 --centerReads --bam 4.map-data/${i}.high_quality_sorted.bam \
 -o 5.bigwig-data/${i}.bw
done

# run
$ nohup ./bam2bw.sh &

# 查看文件
$ ls -lh 5.bigwig-data/ | cut -d ' ' -f5,6,7,8,9
28M Oct 13 16:45 SRR14765638.bw
20M Oct 13 16:59 SRR14765639.bw
38M Oct 13 17:15 SRR14765640.bw
31M Oct 13 17:31 SRR14765641.bw
31M Oct 13 17:45 SRR14765642.bw
33M Oct 13 18:01 SRR14765643.bw
37M Oct 13 18:17 SRR14765644.bw
31M Oct 13 18:31 SRR14765645.bw
41M Oct 13 18:46 SRR14765646.bw
36M Oct 13 19:01 SRR14765647.bw

导入 IGV 进行可视化，结合我们的 peak 文件，可以看到这个基因由 两个 m6A 修饰，METTL3 敲除以后，m6A 修饰就消失了：

下面这张图有一个假阳性 peak：

macs2 和 exomePeak2 矫正之后的 peak 数量差的不大：

$ wc -l 6.macs-data/calibrated-data/WT_calibrated.bed
7911 6.macs-data/calibrated-data/WT_calibrated.bed
$ wc -l 7.exomepeak2-data/calibrated-data/WT_calibrated.bed
7342 7.exomepeak2-data/calibrated-data/WT_calibrated.bed

欢迎加入生信交流群。加我微信我也拉你进 微信群聊 老俊俊生信交流群 哦,代码已上传至QQ群文件夹，欢迎下载。

群二维码：

老俊俊微信：

知识星球：

所以今天你学习了吗？

欢迎小伙伴留言评论！