什么? R 版 Hisat2

Original JunJunLab 老俊俊的生信笔记 2022-08-15

收录于合集

R 版 Hisat2：Rhisat2

最近逛 bioconductor 网站上，偶然发现了居然还有 R 版本的 hisat2 比对软件！名字叫 Rhisat2。于是就探索了以下，差不多基本和 linux 中使用的 hisat2 差不多，唯一变化的就是运行环境在 R 里而已。

1、安装

# 安装bioconductor上的
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("Rhisat2")
# 安装最新版
BiocManager::install("fmicompbio/Rhisat2")

# 加载
library(Rhisat2)
# 查看使用手册
browseVignettes("Rhisat2")

2、数据准备

去 UCSC 数据库下载了 Human 的 3 条染色体序列：

# 下载
$ wget https://hgdownload.soe.ucsc.edu/goldenPath/hg38/chromosomes/chr1.fa.gz
$ wget https://hgdownload.soe.ucsc.edu/goldenPath/hg38/chromosomes/chr2.fa.gz
$ wget https://hgdownload.soe.ucsc.edu/goldenPath/hg38/chromosomes/chr3.fa.gz
# 解压
$ gunzip ./*.gz

3、软件使用

3.1、构建索引

首先我们先对 fasta 文件构建索引，使用 hisat2_build()函数，首先看看用法：

?hisat2_build()
# This function can be used to call the hisat2-build binary.
# Usage
hisat2_build(
  references,
  outdir,
  ...,
  prefix = "index",
  force = FALSE,
  strict = TRUE,
  execute = TRUE
)

参数说明：

1、references： 字符串向量，参考序列的路径
2、outdir： 输出索引文件的文件夹名称，如果文件夹已经存在，则会报错，可以使用 force=TRUE 解决
3、prefix： 输出索引文件名的前缀
4、force： 逻辑值，是否需要强制重写输出文件夹
5、strict： 逻辑值，是否需要严格检查输入参数
6、execute： 逻辑值，是否需要执行组合好的 shell 命令，FALSE 则返回字符串形式的 shell 命令，不执行命令
7、...： 二进制软件的其他命令参数

我们可以查看 linux 版的 hisat2_buid 的所有参数：

hisat2_build_usage()
 [1] "HISAT2 version 2.1.0 by Daehwan Kim (infphilo@gmail.com, http://www.ccb.jhu.edu/people/infphilo)"
 [2] "Usage: hisat2-build [options]* <reference_in> <ht2_index_base>"
 [3] "    reference_in            comma-separated list of files with ref sequences"
 [4] "    hisat2_index_base       write ht2 data to files with this dir/basename"
 [5] "Options:"
 [6] "    -c                      reference sequences given on cmd line (as"
 [7] "                            <reference_in>)"
 [8] "    -a/--noauto             disable automatic -p/--bmax/--dcv memory-fitting"
 [9] "    -p                      number of threads"
[10] "    --bmax <int>            max bucket sz for blockwise suffix-array builder"
[11] "    --bmaxdivn <int>        max bucket sz as divisor of ref len (default: 4)"
[12] "    --dcv <int>             diff-cover period for blockwise (default: 1024)"
[13] "    --nodc                  disable diff-cover (algorithm becomes quadratic)"
[14] "    -r/--noref              don't build .3/.4.ht2 (packed reference) portion"
[15] "    -3/--justref            just build .3/.4.ht2 (packed reference) portion"
[16] "    -o/--offrate <int>      SA is sampled every 2^offRate BWT chars (default: 5)"
[17] "    -t/--ftabchars <int>    # of chars consumed in initial lookup (default: 10)"
[18] "    --localoffrate <int>    SA (local) is sampled every 2^offRate BWT chars (default: 3)"
[19] "    --localftabchars <int>  # of chars consumed in initial lookup in a local index (default: 6)"
[20] "    --snp <path>            SNP file name"
[21] "    --haplotype <path>      haplotype file name"
[22] "    --ss <path>             Splice site file name"
[23] "    --exon <path>           Exon file name"
[24] "    --seed <int>            seed for random number generator"
[25] "    -q/--quiet              verbose output (for debugging)"
[26] "    -h/--help               print detailed description of tool and its options"
[27] "    --usage                 print this usage message"
[28] "    --version               print version information and quit"
[29] ""
[30] "*** Warning ***"
[31] "'hisat2-build-s' was run directly.  It is recommended that you run the wrapper script 'hisat2-build' instead."

那么怎么把这些参数用到 R 里面呢？假如我们使用 " -k 2 " 或者 " -p 6 " 这样的参数，我们只需要使用 " k=2 " 或 " p=6 " 就可以了。

下面正式建立索引：

# 设置工作路径
setwd('C:/Users/admin/Desktop/hisat2')
# 查看当前目录文件
list.files()
[1] "chr1.fa" "chr2.fa" "chr3.fa"
# 保存文件名给fa_files
fa_files <- list.files()
# 使用8个线程构建索引，我们先打印shell命令看看
print(
hisat2_build(references = fa_files,
             outdir = 'my_index',
             prefix = 'Rhisat2_index',
             strict = TRUE,
             execute = FALSE,
             p =8 )
)
[1] "\"C:/Users/admin/Documents/R/win-library/4.0/Rhisat2/hisat2-build\" -p 8 \"chr1.fa\",\"chr2.fa\",\"chr3.fa\" \"my_index/Rhisat2_index\""

# 好像没什么问题，我们运行一下
hisat2_build(references = fa_files,
             outdir = 'my_index',
             prefix = 'Rhisat2_index',
             strict = TRUE,
             execute = TRUE,
             p =8 )

我们的后台疯狂的运行起来了！我 12 个线程怎么 8 个就跑的满满的。运行了五六分钟就结束了。

# 查看当前目录的文件夹，可以看到多生成了一个my_index的文件夹
list.dirs()
[1] "."          "./my_index"
# 查看文件夹内容
list.files(path = './my_index/')
[1] "Rhisat2_index.1.ht2" "Rhisat2_index.2.ht2" "Rhisat2_index.3.ht2"
[4] "Rhisat2_index.4.ht2" "Rhisat2_index.5.ht2" "Rhisat2_index.6.ht2"
[7] "Rhisat2_index.7.ht2" "Rhisat2_index.8.ht2"
# 生成了8个索引文件

3.2、比对

还是先查看比对用法：

?hisat2()
# Usage
hisat2(
  sequences,
  index,
  ...,
  type = c("single", "paired"),
  outfile,
  force = FALSE,
  strict = TRUE,
  execute = TRUE
)

参数说明：

1、sequences： 如果是单端测序数据，直接跟文件名就行了，如果是双端测序数据，跟上为含有 2 个长度的 list，其中 list 中两个元素分别为两个测序文件名。
2、index： 字符串，跟上索引的路径加索引文件名前缀
3、type： 字符串，测序数据类型，单端为 'single'，双端为 'paired'
4、outfile： 字符串，输出文件的路径，可选，如果缺失该参数，则默认返回 R 的字符串向量

其他参数和上面相同，我们看看 hisat2 软件的参数，参数太多，中间省略：

hisat2_usage()
  [1] "HISAT2 version 2.1.0 by Daehwan Kim (infphilo@gmail.com, www.ccb.jhu.edu/people/infphilo)"
  [2] "Usage: "
  [3] "  hisat2-align [options]* -x <ht2-idx> {-1 <m1> -2 <m2> | -U <r>} [-S <sam>]"
  [4] ""
  [5] "  <ht2-idx>  Index filename prefix (minus trailing .X.ht2)."
  [6] "  <m1>       Files with #1 mates, paired with files in <m2>."
  [7] "  <m2>       Files with #2 mates, paired with files in <m1>."
  [8] "  <r>        Files with unpaired reads."
  [9] "  <sam>      File for SAM output (default: stdout)"
 [10] ""
 [11] "  <m1>, <m2>, <r> can be comma-separated lists (no whitespace) and can be"
 [12] "  specified many times.  E.g. '-U file1.fq,file2.fq -U file3.fq'."
 ...
[101] "  --version          print version information and quit"
[102] "  -h/--help          print this usage message"
[103] ""
[104] "*** Warning ***"
[105] "'hisat2-align' was run directly.  It is recommended that you run the wrapper script 'hisat2' instead."

单端比对：

# 查看fastq文件
list.files(pattern = '\\.fq$')
[1] "test_single.fq" "test1_R1.fq"    "test1_R2.fq"
# 比对
hisat2(sequences = './test_single.fq',
       index = 'my_index/Rhisat2_index',
       type = 'single',
       outfile = 'map_data/single.sam',
       force = TRUE,
       strict = TRUE,
       execute = TRUE,
       p = 8)

双端比对：

# 准备list文件名
paired_fa <- as.list(c("test1_R1.fq","test1_R2.fq"))
paired_fa
[[1]]
[1] "test1_R1.fq"

[[2]]
[1] "test1_R2.fq"
# 比对
hisat2(sequences = paired_fa,
       index = 'my_index/Rhisat2_index',
       type = 'paired',
       outfile = 'map_data/test_paired.sam',
       force = TRUE,
       strict = TRUE,
       execute = TRUE,
       p = 8)

还可以在比对过程中提供可变剪接位点信息，帮助在比对过程找到正确的剪接。extract_splice_sites() 函数可以从 gtf、gff3、GRanges、TxDb 等格式的文件提取。

# 提取剪接位点信息，输出到当前目录
gtf <- './hg19.ncbiRefSeq.gtf.gz'
extract_splice_sites(features= gtf, outfile= './sp_file')
# 比对
hisat2(sequences = paired_fa,
       index = 'my_index/Rhisat2_index',
       type = 'paired',
       outfile = 'map_data/test_paired.sam',
       force = TRUE,
       strict = TRUE,
       execute = TRUE,
       `known-splicesite-infile`= 'sp_file',
       p = 8)

使用感

觉得在 R 里处理没有 linux 里方便，也有可能前者我用的比较多吧，R 里也可以批量，后续接 Rsamtools 转为 bam 文件、Rsubread 定量、DESeq2 差异分析可以整条流水线。

欢迎小伙伴留言评论！

今天的分享就到这里了，敬请期待下一篇！

最后欢迎大家分享转发，您的点赞是对我的鼓励和肯定！

如果觉得对您帮助很大，打赏一下吧！

震撼！上海开放Google、Meta等国际平台，中国数字化迈出历史性一步！

战争烈度再升级，特朗普加征10%对华关税

特朗普称计划对中国商品征收25%的关税，外交部回应

从地方目标看全国！今年经济增长还是有很高要求

二姐李思林早知道李玟有轻生的念头，居然没一点防备措施！

什么? R 版 Hisat2

R 版 Hisat2：Rhisat2

1、安装

2、数据准备

3、软件使用

使用感

今天的分享就到这里了，敬请期待下一篇！

您可能也对以下帖子感兴趣

震撼！上海开放Google、Meta等国际平台，中国数字化迈出历史性一步！

战争烈度再升级，特朗普加征10%对华关税

特朗普称计划对中国商品征收25%的关税，外交部回应

从地方目标看全国！今年经济增长还是有很高要求

二姐李思林早知道李玟有轻生的念头，居然没一点防备措施！

生成图片，分享到微信朋友圈

什么? R 版 Hisat2

R 版 Hisat2：Rhisat2

1、安装

2、数据准备

3、软件使用

使用感

今天的分享就到这里了，敬请期待下一篇！

您可能也对以下帖子感兴趣