其他
前面我带领大家通过IMGT数据库认知免疫组库,而且也一起从IMGT数据库下载免疫组库相关fasta序列,免疫组库重要的研究对象就是分成BCR的IGH,IGK,IGL这3类,以及TCR的TRA,TRB,TRD,TRG,它们各自都有V,D(可选),J,C基因。
igblast因为是ncbi出品,所以在免疫组库分析领域还算是使用频率较高的,值得注意的是igblast软件虽然下载即可使用,但是软件用法超级复杂,软件输出的结果文件需要耗费至少五六个小时去理解。
首先看看网页版igblast
imgt.TR.Homo_sapiens.J.f.orf.p
376 sequences; 82,149 total letters
软件安装及数据库文件准备
cd ~/biosoft/igblast
# 软件是39M,下载速度可能会比较慢
wget -c ftp://ftp.ncbi.nih.gov/blast/executables/igblast/release/1.9.0/ncbi-igblast-1.9.0-x64-linux.tar.gz
tar -xzf ncbi-igblast-1.9.0-x64-linux.tar.gz
# 因为我电脑是Mac,所以这个 ncbi-igblast-1.9.0-x64-linux.tar.gz 没有用。
# 然后因为网速问题,我其实最后选择了conda安装igblast-1.7.0
conda install igblast
使用igblast进行序列比对
拼接双端测序数据
FLASH (F ast L ength A djustment of SH ort reads) is a very fast and accurate software tool to merge paired-end reads from next-generation sequencing experiments.
cd ~/biosoft/FLASH
wget http://ccb.jhu.edu/software/FLASH/FLASH-1.2.11-Linux-x86_64.tar.gz
tar zxvf FLASH-1.2.11-Linux-x86_64.tar.gz
./FLASH-1.2.11-Linux-x86_64/flash
# Usage: flash [OPTIONS] MATES_1.FASTQ MATES_2.FASTQ
# Run `./FLASH-1.2.11-Linux-x86_64/flash --help | less' for more information.
flash raw/ERR3445170_1.fastq.gz raw/ERR3445170_2.fastq.gz -M 250
output.extendeFrags.fastq 为拼接后的扩增片段序列文件; output.flash.log 为日志文件,详细记录了拼接过程中的参数和拼接统计的数据; output.hist 为拼接后的reads长度的统计信息文件; output.histogram 为拼接后的reads长度直方图文件; output.notCombined_1.fastq 为拼接不上的reads1序列文件; output.notCombined_2.fastq 为拼接不上的reads2序列文件;
[FLASH] Total pairs: 84833
[FLASH] Combined pairs: 30753
[FLASH] Uncombined pairs: 54080
[FLASH] Percent combined: 36.25%
-m 拼接时overlap区的最小bp阈值,默认10bp; -M 拼接时overlap区的最长bp阈值,默认65,我们这里调整为250; -x overlap区允许的最大碱基错配比率(最大碱基错配数目/overlap区长度),默认为0.25; -p 碱基质量值类型,phred64或者33,默认是33; -r reads长度; -f 片段长度,也就是测序的文库大小; -s 文库的偏差; -o 输出文件前缀; -t 设置线程数,默认为1,FLASH软件支持多线程,速度快;
source activate qc
conda install -y trim-galore
trim_galore -q 25 --phred33 --length 36 --stringency 3 --paired -o clean raw/ERR3445170_1.fastq.gz raw/ERR3445170_2.fastq.gz
# 然后再使用flash对PE数据进行拼接
flash raw/ERR3445170_1.fastq.gz raw/ERR3445170_2.fastq.gz -M 250 -o clean
[FLASH] Total pairs: 84047
[FLASH] Combined pairs: 81600
[FLASH] Uncombined pairs: 2447
[FLASH] Percent combined: 97.09%
构建人类的免疫组库数据库
cd ~/biosoft/igblast/imgt
wget http://www.imgt.org/download/V-QUEST/IMGT_V-QUEST_reference_directory/Homo_sapiens/TR/TR{A,B,D,G}{V,J}.fasta
wget http://www.imgt.org/download/V-QUEST/IMGT_V-QUEST_reference_directory/Homo_sapiens/TR/TR{B,D}D.fasta
perl edit_imgt_file.pl imgt/TRBV.fasta > database/human_TRBV.fa
makeblastdb -parse_seqids -dbtype nucl -in database/human_TRBV.fa
perl edit_imgt_file.pl imgt/TRBD.fasta > database/human_TRBD.fa
makeblastdb -parse_seqids -dbtype nucl -in database/human_TRBD.fa
perl edit_imgt_file.pl imgt/TRBJ.fasta > database/human_TRBJ.fa
makeblastdb -parse_seqids -dbtype nucl -in database/human_TRBJ.fa
运行igblast
seqkit fq2fa clean.extendedFrags.fastq > in.fa
# 因为igblastn的输入数据需要是fasta格式
igblastn \
-germline_db_V ~/biosoft/igblast/database/human_TRBV.fa \
-germline_db_D ~/biosoft/igblast/database/human_TRBD.fa \
-germline_db_J ~/biosoft/igblast/database/human_TRBJ.fa \
-domain_system imgt \
-organism human \
-ig_seqtype TCR \
-auxiliary_data ~/biosoft/igblast/optional_file/human_gl.aux \
-show_translation \
-outfmt 7 \
-num_threads 4 \
-query in.fa \
-out test.blast.output.7
免疫组库的测试数据
Change-O is a collection of tools for processing the output of V(D)J alignment tools, assigning clonal clusters to immunoglobulin (Ig) sequences, and reconstructing germline sequences.
We have hosted a small example data set resulting from the Roche 454 example workflow described in the pRESTO documentation. In addition to the example FASTA files, we have included the standalone IgBLAST results.
https://ncbi.github.io/igblast/cook/examples.html http://www.bio-info-trainee.com/352.html https://www.ncbi.nlm.nih.gov/igblast/ https://changeo.readthedocs.io/en/version-0.3.4---license-change/examples/igblast.html
文末友情宣传
生信爆款入门-全球听(买一得五)(第5期)(可能是最后一期)你的生物信息学入门课 (必看!)数据挖掘第3期(两天变三周,实力加量),医学生/临床医师首选技能提高课 生信技能树的2019年终总结 ,你的生物信息学成长宝藏 2020学习主旋律,B站74小时免费教学视频为你领路,还等什么,看啊!!!