质控 + 接头过滤一步走: fastp 软件

Original JunJunLab 老俊俊的生信笔记 2022-08-15

收录于合集

点击上方关注“公众号”

1前言

前面我们使用 fastqc+multiqc+cutadapt 三款软件对数据进行质控检查和接头过滤。还有一款 fastp 软件也使用的比较多，直接质控+自动检查 adapter+过滤 等功能于一体，今天给大家介绍一下。

点赞的人还挺多，作者最近刚刚更新过。

github 地址：https://github.com/OpenGene/fastp^[1]

2安装

conda 安装：

最方便的安装方法当然是 conda：

# note: the fastp version in bioconda may be not the latest
conda install -c bioconda fastp

下载后好像不是最新的，大家下载二进制的吧。

安装二进制文件

编译好的二进制：

# download the latest build
wget http://opengene.org/fastp/fastp
chmod a+x ./fastp

# or download specified version, i.e. fastp v0.23.0
wget http://opengene.org/fastp/fastp.0.23.0
mv fastp.0.23.0 fastp
chmod a+x ./fastp

源码安装

Step 1: download and build libisal：

git clone https://github.com/intel/isa-l.git
cd isa-l
./autogen.sh
./configure --prefix=/usr --libdir=/usr/lib64
make
sudo make install

step 2: download and build libdeflate：

git clone https://github.com/ebiggers/libdeflate.git
cd libdeflate
make
sudo make install

Step 3: download and build fastp：

# get source (you can also use browser to download from master or releases)
git clone https://github.com/OpenGene/fastp.git

# build
cd fastp
make

# Install
sudo make install

3使用

功能：

由图中可以看出功能还是挺多的，自带检测接头序列并去除 还是挺人性化，说比 fastqc 更快。

此外还可以去重：

基本用法：

SE 单端数据：

fastp -i in.fq -o out.fq

PE 双端数据：

fastp -i in.R1.fq.gz -I in.R2.fq.gz -o out.R1.fq.gz -O out.R2.fq.gz \
--detect_adapter_for_pe

指定去除单端接头序列：

fastp -i in.fq -o out.fq --adapter_sequence=AGATCGGAAGAGC

指定去除双端接头序列：

fastp -i in.R1.fq.gz -I in.R2.fq.gz \
-o out.R1.fq.gz -O out.R2.fq.gz \
--adapter_sequence=AGATCGGAAGAGC \
--adapter_sequence_r2=AGATCGGAAGAGC

指定碱基质量值和线程数：

fastp -i in.fq -o out.fq \
-q 20 -w 10

去重，默认等级 3，等级越高消耗内存越大：

fastp -i in.fq -o out.fq \
--dedup --dup_calc_accuracy 3

软件结果除了输出过滤的 fastq 文件，还有默认在当前目录输出 html 和 json 文件。

质控和过滤结果：

总统统计：

接头序列及重复率：

碱基质量值和 GC 含量分布，这是过滤前的，后面还有过滤后的：

4批量过滤质控

我们结合之前的代码，用这个 批量质控+过滤：

# save data
mkdir fastp_res

# write script
vi fastp.sh
#!/bin/bash
for i in SRR147656{38..47}
do
 fastp -i 1.raw-data/${i}_1.fastq.gz \
          -I 1.raw-data/${i}_2.fastq.gz \
          -o 3.fastp_res/${i}.trimmed_1.fastq.gz \
          -O 3.fastp_res/${i}.trimmed_2.fastq.gz \
          --adapter_sequence=AGATCGGAAGAGC \
          --adapter_sequence_r2=AGATCGGAAGAGC \
          -q 20 -w 10 \
          -h 3.fastp_res/${i}.html \
          -j 3.fastp_res/${i}.json
done

# run
nohup ./fastp.sh &