查看原文
其他

PANDA姐的转录组入门(5):序列比对

2017-07-22 沈梦圆 沈梦圆

明天要去深圳开会一个星期,所以提前把下周的序列比对做了。为了节约时间,我把数据从笔记本拷贝到服务器上进行操作。(用的是WinSCP,用来服务器与本地电脑的数据传输。)关于转录组的比对,我还没有接触过,以下仅为个人学习笔记,可能存在错误,仅供参考。


任务列表

  • 比对软件

  • hisat2的用法

  • 下载index文件

  • 比对、排序、索引

  • 质量控制

  • 载入IGV,截图几个基因

比对软件

详见我的文献阅读报告1

出来以上图片里的软件,还有Bowtie2 ,BBMap,BWA,CLC,Novoalign,SMALT等,维基百科上还有一大堆2,以上我接触过Bowtie、Bowtie2、BBMap、BWA 这四个比对软件,但是都用我平时用来分析宏基因组的,跟转录组的比对会有点不同,根据不同的目的选择不同优势的比对软件。(对于几款常用的比对软件肯定有人做过测评,可以查查文献,选适合自己的。)

hisat2的用法

本作业是比对到基因组,所以使用gapped or splices mapper,此流程已经更新。TopHat首次被发表已经是7年前,STAR的比对速度是TopHat的50倍,HISAT更是STAR的1.2倍。3HISAT2是TopHat2/Bowti2的继任者,使用改进的BWT算法,实现了更快的速度和更少的资源占用,作者推荐TopHat2/Bowti2和HISAT的用户转换到HISAT2。4

官网:https://ccb.jhu.edu/software/hisat2/index.shtml(学习一个软件最好的方法就是结合现有中文博文资料,加上阅读官方说明书和HELP文档,一般刚开始学习的时候先使用默认参数,不要乱调参数。)

  1. # HELP文档

  2. $ ./hisat2 -h

  3. HISAT2 version 2.1.0 by Daehwan Kim (infphilo@gmail.com, www.ccb.jhu.edu/people/infphilo)

  4. Usage:

  5.  hisat2 [options]* -x <ht2-idx> {-1 <m1> -2 <m2> | -U <r> | --sra-acc <SRA accession number>} [-S <sam>]

  6.  <ht2-idx>  Index filename prefix (minus trailing .X.ht2).

  7.  <m1>       Files with #1 mates, paired with files in <m2>.                      

  8.             Could be gzip'ed (extension: .gz) or bzip2'ed (extension: .bz2).

  9.  <m2>       Files with #2 mates, paired with files in <m1>.

  10.             Could be gzip'ed (extension: .gz) or bzip2'ed (extension: .bz2).

  11.  <r>        Files with unpaired reads.

  12.             Could be gzip'ed (extension: .gz) or bzip2'ed (extension: .bz2).

  13.  <SRA accession number>        Comma-separated list of SRA accession numbers, e.g. --sra-acc SRR353653,SRR353654.

  14.  <sam>      File for SAM output (default: stdout)

  15.  <m1>, <m2>, <r> can be comma-separated lists (no whitespace) and can be

  16.  specified many times.  E.g. '-U file1.fq,file2.fq -U file3.fq'.

  17. Options (defaults in parentheses):

  18. Input:

  19.  -q                 query input files are FASTQ .fq/.fastq (default)

  20.  --qseq             query input files are in Illumina's qseq format

  21.  -f                 query input files are (multi-)FASTA .fa/.mfa

  22.  -r                 query input files are raw one-sequence-per-line

  23.  -c                 <m1>, <m2>, <r> are sequences themselves, not files

  24.  -s/--skip <int>    skip the first <int> reads/pairs in the input (none)

  25.  -u/--upto <int>    stop after first <int> reads/pairs (no limit)

  26.  -5/--trim5 <int>   trim <int> bases from 5'/left end of reads (0)

  27.  -3/--trim3 <int>   trim <int> bases from 3'/right end of reads (0)

  28.  --phred33          qualities are Phred+33 (default)

  29.  --phred64          qualities are Phred+64

  30.  --int-quals        qualities encoded as space-delimited integers

  31.  --sra-acc          SRA accession ID

  32. Alignment:

  33.  --n-ceil <func>    func for max # non-A/C/G/Ts permitted in aln (L,0,0.15)

  34.  --ignore-quals     treat all quality values as 30 on Phred scale (off)

  35.  --nofw             do not align forward (original) version of read (off)

  36.  --norc             do not align reverse-complement version of read (off)

  37. Spliced Alignment:

  38.  --pen-cansplice <int>              penalty for a canonical splice site (0)

  39.  --pen-noncansplice <int>           penalty for a non-canonical splice site (12)

  40.  --pen-canintronlen <func>          penalty for long introns (G,-8,1) with canonical splice sites

  41.  --pen-noncanintronlen <func>       penalty for long introns (G,-8,1) with noncanonical splice sites

  42.  --min-intronlen <int>              minimum intron length (20)

  43.  --max-intronlen <int>              maximum intron length (500000)

  44.  --known-splicesite-infile <path>   provide a list of known splice sites

  45.  --novel-splicesite-outfile <path>  report a list of splice sites

  46.  --novel-splicesite-infile <path>   provide a list of novel splice sites

  47.  --no-temp-splicesite               disable the use of splice sites found

  48.  --no-spliced-alignment             disable spliced alignment

  49.  --rna-strandness <string>          specify strand-specific information (unstranded)

  50.  --tmo                              reports only those alignments within known transcriptome

  51.  --dta                              reports alignments tailored for transcript assemblers

  52.  --dta-cufflinks                    reports alignments tailored specifically for cufflinks

  53.  --avoid-pseudogene                 tries to avoid aligning reads to pseudogenes (experimental option)

  54.  --no-templatelen-adjustment        disables template length adjustment for RNA-seq reads

  55. Scoring:

  56.  --mp <int>,<int>   max and min penalties for mismatch; lower qual = lower penalty <6,2>

  57.  --sp <int>,<int>   max and min penalties for soft-clipping; lower qual = lower penalty <2,1>

  58.  --no-softclip      no soft-clipping

  59.  --np <int>         penalty for non-A/C/G/Ts in read/ref (1)

  60.  --rdg <int>,<int>  read gap open, extend penalties (5,3)

  61.  --rfg <int>,<int>  reference gap open, extend penalties (5,3)

  62.  --score-min <func> min acceptable alignment score w/r/t read length

  63.                     (L,0.0,-0.2)

  64. Reporting:

  65.  -k <int> (default: 5) report up to <int> alns per read

  66. Paired-end:

  67.  -I/--minins <int>  minimum fragment length (0), only valid with --no-spliced-alignment

  68.  -X/--maxins <int>  maximum fragment length (500), only valid with --no-spliced-alignment

  69.  --fr/--rf/--ff     -1, -2 mates align fw/rev, rev/fw, fw/fw (--fr)

  70.  --no-mixed         suppress unpaired alignments for paired reads

  71.  --no-discordant    suppress discordant alignments for paired reads

  72. Output:

  73.  -t/--time          print wall-clock time taken by search phases

  74.  --un <path>           write unpaired reads that didn't align to <path>

  75.  --al <path>           write unpaired reads that aligned at least once to <path>

  76.  --un-conc <path>      write pairs that didn't align concordantly to <path>

  77.  --al-conc <path>      write pairs that aligned concordantly at least once to <path>

  78.  (Note: for --un, --al, --un-conc, or --al-conc, add '-gz' to the option name, e.g.

  79.  --un-gz <path>, to gzip compress output, or add '-bz2' to bzip2 compress output.)

  80.  --summary-file     print alignment summary to this file.

  81.  --new-summary      print alignment summary in a new style, which is more machine-friendly.

  82.  --quiet            print nothing to stderr except serious errors

  83.  --met-file <path>  send metrics to file at <path> (off)

  84.  --met-stderr       send metrics to stderr (off)

  85.  --met <int>        report internal counters & metrics every <int> secs (1)

  86.  --no-head          supppress header lines, i.e. lines starting with @

  87.  --no-sq            supppress @SQ header lines

  88.  --rg-id <text>     set read group id, reflected in @RG line and RG:Z: opt field

  89.  --rg <text>        add <text> ("lab:value") to @RG line of SAM header.

  90.                     Note: @RG line only printed when --rg-id is set.

  91.  --omit-sec-seq     put '*' in SEQ and QUAL fields for secondary alignments.

  92. Performance:

  93.  -o/--offrate <int> override offrate of index; must be >= index's offrate

  94.  -p/--threads <int> number of alignment threads to launch (1)

  95.  --reorder          force SAM output order to match order of input reads

  96.  --mm               use memory-mapped I/O for index; many 'hisat2's can share

  97. Other:

  98.  --qc-filter        filter out reads that are bad according to QSEQ filter

  99.  --seed <int>       seed for random number generator (0)

  100.  --non-deterministic seed rand. gen. arbitrarily instead of using read attributes

  101.  --remove-chrname   remove 'chr' from reference names in alignment

  102.  --add-chrname      add 'chr' to reference names in alignment

  103.  --version          print version information and quit

  104.  -h/--help          print this usage message

下载index文件

  1. cd ~/reference

  2. mkdir -p index/hisat && cd index/hisat

  3. wget -c ftp://ftp.ccb.jhu.edu/pub/infphilo/hisat2/data/hg19.tar.gz  

  4. wget -c ftp://ftp.ccb.jhu.edu/pub/infphilo/hisat2/data/mm10.tar.gz

  5. tar zxvf hg19.tar.gz

  6. tar xvzf mm10.tar.gz

关于index文件的建立,也可以自己根据基因组用比对软件进行创建,详细步骤仍是要看软件说明文档。

比对、排序、索引

把fastq格式的reads比对上去得到sam文件,接着用samtools把它转为bam文件,并且排序(注意N和P两种排序区别)索引好(可以使用管道实现,省去中间SAM保存的过程,直接输出BAM文件5)

我用的是hisat2工具来比对,一般情况下我就用默认参数啦!6

  1. reference=/home/jianmingzeng/reference/index/hisat/grcm38/genome

  2. ~/biosoft/HISAT/current/hisat2 -p 5 -x $reference -U SRR3589959.fastq -S control_1.sam 2>control_1.log

  3. ~/biosoft/HISAT/current/hisat2 -p 5 -x $reference -U SRR3589960.fastq -S control_2.sam 2>control_2.log

  4. ~/biosoft/HISAT/current/hisat2 -p 5 -x $reference -U SRR3589961.fastq -S Akap95_1.sam 2>Akap95_1.log

  5. ~/biosoft/HISAT/current/hisat2 -p 5 -x $reference -U SRR3589962.fastq -S Akap95_2.sam 2>Akap95_2.log

  6. ls *sam |while read id;do (nohup samtools sort -n -@ 5 -o ${id%%.*}.Nsort.bam $id &);done

  7. # 注意,以上代码是错的!!!

在上面的基础进行修改,编写bash脚本: 
软件版本: 

map.sh

  1. #! usr/bin/bash

  2. set -u

  3. set -e

  4. set -o pipefail

  5. hg19_ref=~/rna_seq/data/reference/index/hisat/hg19/genome

  6. mm10_ref=~/rna_seq/data/reference/index/hisat/mm10/genome

  7. data_path=~/rna_seq/data

  8. NUM_THREADS=25

  9. ls --color=never Homo*1.fastq.gz|while read id;do(~/biosoft/hisat2-2.1.0/hisat2 -t -p $NUM_THREADS -x  $hg19_ref  -1 $data_path/${id%_*}_1.fastq.gz -2 $data_path/${id%_*}_2.fastq.gz 2>${id%_*}_map.log | samtools view -Sb  - >${id%_*}.bam);done

  10. ls --color=never  Mus*1.fastq.gz|while read id;do(~/biosoft/hisat2-2.1.0/hisat2 -t  -p $NUM_THREADS -x  $mm10_ref  -1 $data_path/${id%_*}_1.fastq.gz -2 $data_path/${id%_*}_2.fastq.gz 2>${id%_*}_map.log | samtools view -Sb  -  >${id%_*}.bam);done

  11. ls --color=never  *.bam|while read id;do(samtools sort --threads $NUM_THREADS $id  -o ${id%.*}_sorted.bam);done

  12. ls --color=never  *_sorted.bam | while read id;do(samtools index $id);done

检查脚本: (检错bash脚本是否写得正确,其实有个命令的,我给忘了,只能把运行命令行输出来,肉眼检查下。)

运行脚本: 
bash map.sh 

质量控制

对bam文件进行简单QC 

Reads比对后的质量控制(评估比对质量的指标): 
比对上的reads占总reads的百分比; 
Reads比对到外显子和参考链上的覆盖度是否一致; 
比对到基因组序列,多重比对reads;

相关质控软件除了Picard,RSeQC,Qualimap还有一大堆7。 
下面是我简单尝试了下,没有深入研究解读,具体参数详情,要花时间研究。

~/biosoft/qualimap_v2.2.1/qualimap rnaseq -bam Homo_sapiens_Control_293_cell_sorted.bam -outdir Homo_sapiens_Control_293_cell_sorted_QC -pe -s --java-mem-size=60G -gtf ../data/reference/genome/hg19/gencode.v26lift37.annotation.sorted.gtf 

  1. conda install ucsc-gtftogenepred

  2. gtfToGenePred -genePredExt -ignoreGroupsWithoutExons -geneNameAsName2 gencode.v26lift37.annotation.gtf gencode.v26lift37.annotation.gpd

  3. wget http://hgdownload.soe.ucsc.edu/goldenPath/hg19/database/refFlat.txt.gz

  4. gzip -d refFlat.txt.gz

  5. java -jar ~/biosoft/picard_2.8.0/picard.jar CollectRnaSeqMetrics I=Homo_sapiens_Control_293_cell_sorted.bam O=Homo_sapiens_Control_293_cell_RNA_Metrics REF_FLAT=~/rna_seq/data/reference/genome/hg19/refFlat.txt STRAND_SPECIFICITY=NONE

bam_stat.py -i Homo_sapiens_Control_293_cell_sorted.bam 

载入IGV,截图几个基因


  1. 【读文献】RNA-seq数据分析指南:http://www.biotrainee.com/thread-166-1-1.html 

  2. List of RNA-Seq bioinformatics tools:https://wiki2.org/en/List_of_RNA-Seq_bioinformatics_tools#Alignment_tools 

  3. RNA-seq流程需要进化啦!:http://www.bio-info-trainee.com/1022.html 

  4. RNA-Seq比对新流程HISAT2:http://www.biotrainee.com/thread-415-1-1.html 

  5. sam2bam工具把NGS数据分析耗时部分给改进了:http://www.biotrainee.com/thread-1076-1-1.html 

  6. 一个RNA-seq的反思:http://www.bio-info-trainee.com/2275.html 

  7. QUALITY CONTROL SOFTWARE TOOLS | RNA SEQUENCING DATA ANALYSIS:https://omictools.com/quality-control2-category 

历史记录

PANDA姐的转录组入门(1):计算机资源的准备

PANDA姐的转录组入门(2):读文章拿到测序数据

PANDA姐的转录组入门(3):了解fastq测序数据

PANDA姐的转录组入门(4):了解参考基因组及基因注释


~ 热成一条闪电 ~

有问题请联系我

个人微信ID:
Shenmengyuan1993

您可能也对以下帖子感兴趣

文章有问题?点此查看未经处理的缓存