宏基因组分析01. KneadData质控宏基因组数据
本系列课程前情回顾
KneadData 简介
Kneaddata是专门质控宏基因组数据的工具。与其它高通量数据仅需进行测序质量控制相比,宏基因组数据通常含有大量宿主,对于生物学家如何去除这些宿主过程中软件(blast、bowtie2)、参数(相似度、覆盖度阈值)的选择还是比较困难的。为解决这一难题,哈佛大学Huttenhower实验室近期推出了质控、去宿主的全套流程,方便同行一步完成以上分析。
软件主页:http://huttenhower.sph.harvard.edu/kneaddata ,目前软件正在投稿中,需要引用时请引看主页是否有更新。
KneadData is a tool designed to perform quality control on metagenomic sequencing data, especially data from microbiome experiments. In these experiments, samples are typically taken from a host in hopes of learning something about the microbial community on the host. However, metagenomic sequencing data from such experiments will often contain a high ratio of host to bacterial reads. This tool aims to perform principled in silico separation of bacterial reads from these “contaminant” reads, be they from the host, from bacterial 16S sequences, or other user-defined sources.
KneadData 安装
KneadData代码主页 https://bitbucket.org/biobakery/kneaddata/wiki/Home
kneaddata可用Conda安装,如果系统没有安装过Conda,请参考 《Nature Method:Bioconda解决生物软件安装的烦恼 https://blog.csdn.net/woodcorpse/article/details/82226929》 文章安装和配置Conda。
建议python环境使用2.7版本,成功率更高。
###创建环境并安装###
conda create -n kneaddata kneaddata=0.6.1 python=2.7
###列出该环境下安装的包
conda list -n kneaddata
###进入环境
source activate kneaddata
宿主参考基因组准备
###建立索引
###Usage: bowtie2-build [options]* <reference_in> <bt2_index_base>###
mkdir -p ~/meta/index_base.dir && cd ~/meta/index_base.dir
# 以水稻为例 rice from ensemble plant
wget -c ftp://ftp.ensemblgenomes.org/pub/plants/release-40/fasta/oryza_sativa/dna/Oryza_sativa.IRGSP-1.0.dna_sm.toplevel.fa.gz &
gunzip Oryza_sativa.IRGSP-1.0.dna_sm.toplevel.fa.gz &
nohup bowtie2-build --threads 20 ./Oryza_sativa.IRGSP-1.0.dna_sm.toplevel.fa rice_genome &
INDEX=~/meta/index_base.dir/rice_genome
###测试数据下载
mkdir -p ~/meta/raw_data.dir && cd ~/meta/raw_data.dir
wget http://bailab.genetics.ac.cn/note/zhiwen/basic/05/10_19_top_1.fq.gz
wget http://bailab.genetics.ac.cn/note/zhiwen/basic/05/10_19_top_2.fq.gz
wget http://bailab.genetics.ac.cn/note/zhiwen/basic/05/10_20_top_1.fq.gz
wget http://bailab.genetics.ac.cn/note/zhiwen/basic/05/10_20_top_2.fq.gz
KneadData运行示例
###单个文件使用kneaddata 的命令行如下所示:
kneaddata -h
##需要用到trimmomatic、bowtie2、bmtagger、trf、fastqc
##关于trimmomatic的参数
##ILLUMINACLIP: Cut adapter and other illumina-specific sequences from the read.
##SLIDINGWINDOW: Perform a sliding window trimming, cutting once the average quality within the ##window falls below a threshold.
##LEADING: Cut bases off the start of a read, if below a threshold quality
##TRAILING: Cut bases off the end of a read, if below a threshold quality
##CROP: Cut the read to a specified length
##HEADCROP: Cut the specified number of bases from the start of the read
##MINLEN: Drop the read if it is below a specified length
##TOPHRED33: Convert quality scores to Phred-33
##TOPHRED64: Convert quality scores to Phred-64
kneaddata -i ~/meta/raw_data.dir/10_19_top_1.fq.gz \
-i ~/meta/raw_data.dir/10_19_top_1.fq.gz \
-o ~/meta/raw_data.dir/temp -v -t 20 --remove-intermediate-output \
--trimmomatic-options "SLIDINGWINDOW:4:20 MINLEN:50" \
--bowtie2-options "--very-sensitive --dovetail" -db ~/meta/index_base.dir/rice_genome
##-i 表示输入的fastq文件
##-o 表示输出的文件夹
##-v 输出进程
##-t 线程数
##--remove-intermediate-output 移除中间文件
运行结果包括以下文件:
##不包含宿主基因组的序列
10_19_top_1_kneaddata_paired_1.fastq
10_19_top_1_kneaddata_paired_2.fastq
10_19_top_1_kneaddata_unmatched_1.fastq
10_19_top_1_kneaddata_unmatched_2.fastq
##包含宿主基因组的序列
10_19_top_1_kneaddata_rice_genome_bowtie2_paired_contam_1.fastq
10_19_top_1_kneaddata_rice_genome_bowtie2_paired_contam_2.fastq
10_19_top_1_kneaddata_rice_genome_bowtie2_unmatched_1_contam.fastq
10_19_top_1_kneaddata_rice_genome_bowtie2_unmatched_2_contam.fastq
多任务使用parallel并行运行
parallel -j 3 --xapply \
'kneaddata -i {1} -i {2} \
-o temp -v -t 10 --remove-intermediate-output \
--trimmomatic-options "SLIDINGWINDOW:4:20 MINLEN:50" \
--bowtie2-options "--very-sensitive --dovetail" \
-db ~/meta/index_base.dir/rice_genome' \
::: ~/meta/raw_data.dir/*_top_1.fq.gz ::: ~/meta/raw_data.dir/*_top_2.fq.gz
##{}表示文件名
##-j表示同时运行的命令数
##--xapply表示参数成对的输入,如果不使用,则输入参数会两两组合
质控结果统计
统计所有样品质控过程中数据量
kneaddata_read_count_table --input ~/meta/raw_data.dir/temp --output ./kneaddata_read_counts.txt
如下为两个测试文件对的统计结果:我们要关注最的还剩多少数据(final pair1/2列)、可进一步统计低质量去除后剩余(trimmed pair1/2)、去完宿主还剩多少(decontaminated rice_genome pair1/2)。对于这些单端序列(orphan2),一般不建议使用。
Sample raw pair1 raw pair2 trimmed pair1 trimmed pair2 trimmed orphan2 decontaminated rice_genome pair1 decontaminated rice_genome pair2 decontaminated rice_genome orphan1 decontaminated rice_genome orphan2 final pair1 inal pair2 final orphan1 final orphan2
10_19_top_1_kneaddata 25000 25000 24999 24999 1 22181 22181 16 21 22181 22181 16 21
10_20_top_1_kneaddata 25000 25000 25000 25000 NA 22006 22006 13 16 22006 22006 13 16
有参分析双端序列合并
mkdir -p ~/meta/raw_data.dir/concat
for i in ~/meta/raw_data.dir/temp/*top_1_kneaddata_paired_1.fastq; do
base=$(basename ${i} top_1_kneaddata_paired_1.fastq)
cat ~/meta/raw_data.dir/temp/${base}top_1_kneaddata_paired* > ~/meta/raw_data.dir/concat/${base}.fq
done
本周就到这里,下周我们将进一步学习后续分析。