PANDA姐的转录组入门(5)：序列比对

原创 2017-07-22 沈梦圆 沈梦圆

明天要去深圳开会一个星期，所以提前把下周的序列比对做了。为了节约时间，我把数据从笔记本拷贝到服务器上进行操作。（用的是WinSCP，用来服务器与本地电脑的数据传输。）关于转录组的比对，我还没有接触过，以下仅为个人学习笔记，可能存在错误，仅供参考。

任务列表

比对软件
hisat2的用法
下载index文件
比对、排序、索引
质量控制
载入IGV，截图几个基因

比对软件

详见我的文献阅读报告1

出来以上图片里的软件，还有Bowtie2 ,BBMap,BWA,CLC,Novoalign,SMALT等，维基百科上还有一大堆2，以上我接触过Bowtie、Bowtie2、BBMap、BWA 这四个比对软件，但是都用我平时用来分析宏基因组的，跟转录组的比对会有点不同，根据不同的目的选择不同优势的比对软件。（对于几款常用的比对软件肯定有人做过测评，可以查查文献，选适合自己的。）

hisat2的用法

本作业是比对到基因组，所以使用gapped or splices mapper，此流程已经更新。TopHat首次被发表已经是7年前，STAR的比对速度是TopHat的50倍，HISAT更是STAR的1.2倍。3HISAT2是TopHat2/Bowti2的继任者，使用改进的BWT算法，实现了更快的速度和更少的资源占用，作者推荐TopHat2/Bowti2和HISAT的用户转换到HISAT2。4

官网：https://ccb.jhu.edu/software/hisat2/index.shtml（学习一个软件最好的方法就是结合现有中文博文资料，加上阅读官方说明书和HELP文档，一般刚开始学习的时候先使用默认参数，不要乱调参数。）

# HELP文档
$ ./hisat2 -h
HISAT2 version 2.1.0 by Daehwan Kim (infphilo@gmail.com, www.ccb.jhu.edu/people/infphilo)
Usage:
hisat2 [options]* -x <ht2-idx> {-1 <m1> -2 <m2> | -U <r> | --sra-acc <SRA accession number>} [-S <sam>]
<ht2-idx> Index filename prefix (minus trailing .X.ht2).
<m1> Files with #1 mates, paired with files in <m2>.
Could be gzip'ed (extension: .gz) or bzip2'ed (extension: .bz2).
<m2> Files with #2 mates, paired with files in <m1>.
Could be gzip'ed (extension: .gz) or bzip2'ed (extension: .bz2).
<r> Files with unpaired reads.
Could be gzip'ed (extension: .gz) or bzip2'ed (extension: .bz2).
<SRA accession number> Comma-separated list of SRA accession numbers, e.g. --sra-acc SRR353653,SRR353654.
<sam> File for SAM output (default: stdout)
<m1>, <m2>, <r> can be comma-separated lists (no whitespace) and can be
specified many times. E.g. '-U file1.fq,file2.fq -U file3.fq'.
Options (defaults in parentheses):
Input:
-q query input files are FASTQ .fq/.fastq (default)
--qseq query input files are in Illumina's qseq format
-f query input files are (multi-)FASTA .fa/.mfa
-r query input files are raw one-sequence-per-line
-c <m1>, <m2>, <r> are sequences themselves, not files
-s/--skip <int> skip the first <int> reads/pairs in the input (none)
-u/--upto <int> stop after first <int> reads/pairs (no limit)
-5/--trim5 <int> trim <int> bases from 5'/left end of reads (0)
-3/--trim3 <int> trim <int> bases from 3'/right end of reads (0)
--phred33 qualities are Phred+33 (default)
--phred64 qualities are Phred+64
--int-quals qualities encoded as space-delimited integers
--sra-acc SRA accession ID
Alignment:
--n-ceil <func> func for max # non-A/C/G/Ts permitted in aln (L,0,0.15)
--ignore-quals treat all quality values as 30 on Phred scale (off)
--nofw do not align forward (original) version of read (off)
--norc do not align reverse-complement version of read (off)
Spliced Alignment:
--pen-cansplice <int> penalty for a canonical splice site (0)
--pen-noncansplice <int> penalty for a non-canonical splice site (12)
--pen-canintronlen <func> penalty for long introns (G,-8,1) with canonical splice sites
--pen-noncanintronlen <func> penalty for long introns (G,-8,1) with noncanonical splice sites
--min-intronlen <int> minimum intron length (20)
--max-intronlen <int> maximum intron length (500000)
--known-splicesite-infile <path> provide a list of known splice sites
--novel-splicesite-outfile <path> report a list of splice sites
--novel-splicesite-infile <path> provide a list of novel splice sites
--no-temp-splicesite disable the use of splice sites found
--no-spliced-alignment disable spliced alignment
--rna-strandness <string> specify strand-specific information (unstranded)
--tmo reports only those alignments within known transcriptome
--dta reports alignments tailored for transcript assemblers
--dta-cufflinks reports alignments tailored specifically for cufflinks
--avoid-pseudogene tries to avoid aligning reads to pseudogenes (experimental option)
--no-templatelen-adjustment disables template length adjustment for RNA-seq reads
Scoring:
--mp <int>,<int> max and min penalties for mismatch; lower qual = lower penalty <6,2>
--sp <int>,<int> max and min penalties for soft-clipping; lower qual = lower penalty <2,1>
--no-softclip no soft-clipping
--np <int> penalty for non-A/C/G/Ts in read/ref (1)
--rdg <int>,<int> read gap open, extend penalties (5,3)
--rfg <int>,<int> reference gap open, extend penalties (5,3)
--score-min <func> min acceptable alignment score w/r/t read length
(L,0.0,-0.2)
Reporting:
-k <int> (default: 5) report up to <int> alns per read
Paired-end:
-I/--minins <int> minimum fragment length (0), only valid with --no-spliced-alignment
-X/--maxins <int> maximum fragment length (500), only valid with --no-spliced-alignment
--fr/--rf/--ff -1, -2 mates align fw/rev, rev/fw, fw/fw (--fr)
--no-mixed suppress unpaired alignments for paired reads
--no-discordant suppress discordant alignments for paired reads
Output:
-t/--time print wall-clock time taken by search phases
--un <path> write unpaired reads that didn't align to <path>
--al <path> write unpaired reads that aligned at least once to <path>
--un-conc <path> write pairs that didn't align concordantly to <path>
--al-conc <path> write pairs that aligned concordantly at least once to <path>
(Note: for --un, --al, --un-conc, or --al-conc, add '-gz' to the option name, e.g.
--un-gz <path>, to gzip compress output, or add '-bz2' to bzip2 compress output.)
--summary-file print alignment summary to this file.
--new-summary print alignment summary in a new style, which is more machine-friendly.
--quiet print nothing to stderr except serious errors
--met-file <path> send metrics to file at <path> (off)
--met-stderr send metrics to stderr (off)
--met <int> report internal counters & metrics every <int> secs (1)
--no-head supppress header lines, i.e. lines starting with @
--no-sq supppress @SQ header lines
--rg-id <text> set read group id, reflected in @RG line and RG:Z: opt field
--rg <text> add <text> ("lab:value") to @RG line of SAM header.
Note: @RG line only printed when --rg-id is set.
--omit-sec-seq put '*' in SEQ and QUAL fields for secondary alignments.
Performance:
-o/--offrate <int> override offrate of index; must be >= index's offrate
-p/--threads <int> number of alignment threads to launch (1)
--reorder force SAM output order to match order of input reads
--mm use memory-mapped I/O for index; many 'hisat2's can share
Other:
--qc-filter filter out reads that are bad according to QSEQ filter
--seed <int> seed for random number generator (0)
--non-deterministic seed rand. gen. arbitrarily instead of using read attributes
--remove-chrname remove 'chr' from reference names in alignment
--add-chrname add 'chr' to reference names in alignment
--version print version information and quit
-h/--help print this usage message

下载index文件

cd ~/reference
mkdir -p index/hisat && cd index/hisat
wget -c ftp://ftp.ccb.jhu.edu/pub/infphilo/hisat2/data/hg19.tar.gz
wget -c ftp://ftp.ccb.jhu.edu/pub/infphilo/hisat2/data/mm10.tar.gz
tar zxvf hg19.tar.gz
tar xvzf mm10.tar.gz

关于index文件的建立，也可以自己根据基因组用比对软件进行创建，详细步骤仍是要看软件说明文档。

比对、排序、索引

把fastq格式的reads比对上去得到sam文件，接着用samtools把它转为bam文件，并且排序(注意N和P两种排序区别)索引好（可以使用管道实现，省去中间SAM保存的过程，直接输出BAM文件5）。

我用的是hisat2工具来比对，一般情况下我就用默认参数啦！6

reference=/home/jianmingzeng/reference/index/hisat/grcm38/genome
~/biosoft/HISAT/current/hisat2 -p 5 -x $reference -U SRR3589959.fastq -S control_1.sam 2>control_1.log
~/biosoft/HISAT/current/hisat2 -p 5 -x $reference -U SRR3589960.fastq -S control_2.sam 2>control_2.log
~/biosoft/HISAT/current/hisat2 -p 5 -x $reference -U SRR3589961.fastq -S Akap95_1.sam 2>Akap95_1.log
~/biosoft/HISAT/current/hisat2 -p 5 -x $reference -U SRR3589962.fastq -S Akap95_2.sam 2>Akap95_2.log
ls *sam |while read id;do (nohup samtools sort -n -@ 5 -o ${id%%.*}.Nsort.bam $id &);done
# 注意，以上代码是错的！！！

在上面的基础进行修改，编写bash脚本：
软件版本：

map.sh

#! usr/bin/bash
set -u
set -e
set -o pipefail
hg19_ref=~/rna_seq/data/reference/index/hisat/hg19/genome
mm10_ref=~/rna_seq/data/reference/index/hisat/mm10/genome
data_path=~/rna_seq/data
NUM_THREADS=25
ls --color=never Homo*1.fastq.gz|while read id;do(~/biosoft/hisat2-2.1.0/hisat2 -t -p $NUM_THREADS -x $hg19_ref -1 $data_path/${id%_*}_1.fastq.gz -2 $data_path/${id%_*}_2.fastq.gz 2>${id%_*}_map.log | samtools view -Sb - >${id%_*}.bam);done
ls --color=never Mus*1.fastq.gz|while read id;do(~/biosoft/hisat2-2.1.0/hisat2 -t -p $NUM_THREADS -x $mm10_ref -1 $data_path/${id%_*}_1.fastq.gz -2 $data_path/${id%_*}_2.fastq.gz 2>${id%_*}_map.log | samtools view -Sb - >${id%_*}.bam);done
ls --color=never *.bam|while read id;do(samtools sort --threads $NUM_THREADS $id -o ${id%.*}_sorted.bam);done
ls --color=never *_sorted.bam | while read id;do(samtools index $id);done

检查脚本： （检错bash脚本是否写得正确，其实有个命令的，我给忘了，只能把运行命令行输出来，肉眼检查下。）

运行脚本：
bash map.sh

质量控制

对bam文件进行简单QC

Reads比对后的质量控制(评估比对质量的指标)：
比对上的reads占总reads的百分比；
Reads比对到外显子和参考链上的覆盖度是否一致；
比对到基因组序列,多重比对reads;

相关质控软件除了Picard,RSeQC,Qualimap还有一大堆7。
下面是我简单尝试了下，没有深入研究解读，具体参数详情，要花时间研究。

~/biosoft/qualimap_v2.2.1/qualimap rnaseq -bam Homo_sapiens_Control_293_cell_sorted.bam -outdir Homo_sapiens_Control_293_cell_sorted_QC -pe -s --java-mem-size=60G -gtf ../data/reference/genome/hg19/gencode.v26lift37.annotation.sorted.gtf

conda install ucsc-gtftogenepred
gtfToGenePred -genePredExt -ignoreGroupsWithoutExons -geneNameAsName2 gencode.v26lift37.annotation.gtf gencode.v26lift37.annotation.gpd
wget http://hgdownload.soe.ucsc.edu/goldenPath/hg19/database/refFlat.txt.gz
gzip -d refFlat.txt.gz
java -jar ~/biosoft/picard_2.8.0/picard.jar CollectRnaSeqMetrics I=Homo_sapiens_Control_293_cell_sorted.bam O=Homo_sapiens_Control_293_cell_RNA_Metrics REF_FLAT=~/rna_seq/data/reference/genome/hg19/refFlat.txt STRAND_SPECIFICITY=NONE

bam_stat.py -i Homo_sapiens_Control_293_cell_sorted.bam

载入IGV，截图几个基因

【读文献】RNA-seq数据分析指南:http://www.biotrainee.com/thread-166-1-1.html ↩
List of RNA-Seq bioinformatics tools:https://wiki2.org/en/List_of_RNA-Seq_bioinformatics_tools#Alignment_tools ↩
RNA-seq流程需要进化啦！:http://www.bio-info-trainee.com/1022.html ↩
RNA-Seq比对新流程HISAT2:http://www.biotrainee.com/thread-415-1-1.html ↩
sam2bam工具把NGS数据分析耗时部分给改进了：http://www.biotrainee.com/thread-1076-1-1.html ↩
一个RNA-seq的反思:http://www.bio-info-trainee.com/2275.html ↩
QUALITY CONTROL SOFTWARE TOOLS | RNA SEQUENCING DATA ANALYSIS:https://omictools.com/quality-control2-category ↩

历史记录

PANDA姐的转录组入门(1)：计算机资源的准备

PANDA姐的转录组入门(2)：读文章拿到测序数据

PANDA姐的转录组入门(3):了解fastq测序数据

PANDA姐的转录组入门(4):了解参考基因组及基因注释

~ 热成一条闪电 ~

有问题请联系我

个人微信ID：
Shenmengyuan1993

一把短刀，怎么就让他连捅18人？！

上海超市血案：背后缘由让人揪心

为啥一线城市只有广州取消限购？是因为穷吗

当前三大问题：国家的方向感、精英的安全感、百姓的希望感

这次我怀疑邱成桐已经“学阀化”了