什么? R 版 Hisat2
R 版 Hisat2:Rhisat2
最近逛 bioconductor 网站上,偶然发现了居然还有 R 版本的 hisat2 比对软件!名字叫 Rhisat2
。于是就探索了以下,差不多基本和 linux 中使用的 hisat2 差不多,唯一变化的就是运行环境在 R 里而已。
1、安装
# 安装bioconductor上的
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("Rhisat2")
# 安装最新版
BiocManager::install("fmicompbio/Rhisat2")
# 加载
library(Rhisat2)
# 查看使用手册
browseVignettes("Rhisat2")
2、数据准备
去 UCSC 数据库下载了 Human 的 3 条染色体序列:
# 下载
$ wget https://hgdownload.soe.ucsc.edu/goldenPath/hg38/chromosomes/chr1.fa.gz
$ wget https://hgdownload.soe.ucsc.edu/goldenPath/hg38/chromosomes/chr2.fa.gz
$ wget https://hgdownload.soe.ucsc.edu/goldenPath/hg38/chromosomes/chr3.fa.gz
# 解压
$ gunzip ./*.gz
3、软件使用
3.1、构建索引
首先我们先对 fasta 文件构建索引,使用 hisat2_build()
函数,首先看看用法:
?hisat2_build()
# This function can be used to call the hisat2-build binary.
# Usage
hisat2_build(
references,
outdir,
...,
prefix = "index",
force = FALSE,
strict = TRUE,
execute = TRUE
)
参数说明:
1、references: 字符串向量,参考序列的路径
2、outdir: 输出索引文件的文件夹名称,如果文件夹已经存在,则会报错,可以使用 force=TRUE
解决
3、prefix: 输出索引文件名的前缀
4、force: 逻辑值,是否需要强制重写输出文件夹
5、strict: 逻辑值,是否需要严格检查输入参数
6、execute: 逻辑值,是否需要执行组合好的 shell 命令,FALSE 则返回字符串形式的 shell 命令,不执行命令
7、...: 二进制软件的其他命令参数
我们可以查看 linux 版的 hisat2_buid 的所有参数:
hisat2_build_usage()
[1] "HISAT2 version 2.1.0 by Daehwan Kim (infphilo@gmail.com, http://www.ccb.jhu.edu/people/infphilo)"
[2] "Usage: hisat2-build [options]* <reference_in> <ht2_index_base>"
[3] " reference_in comma-separated list of files with ref sequences"
[4] " hisat2_index_base write ht2 data to files with this dir/basename"
[5] "Options:"
[6] " -c reference sequences given on cmd line (as"
[7] " <reference_in>)"
[8] " -a/--noauto disable automatic -p/--bmax/--dcv memory-fitting"
[9] " -p number of threads"
[10] " --bmax <int> max bucket sz for blockwise suffix-array builder"
[11] " --bmaxdivn <int> max bucket sz as divisor of ref len (default: 4)"
[12] " --dcv <int> diff-cover period for blockwise (default: 1024)"
[13] " --nodc disable diff-cover (algorithm becomes quadratic)"
[14] " -r/--noref don't build .3/.4.ht2 (packed reference) portion"
[15] " -3/--justref just build .3/.4.ht2 (packed reference) portion"
[16] " -o/--offrate <int> SA is sampled every 2^offRate BWT chars (default: 5)"
[17] " -t/--ftabchars <int> # of chars consumed in initial lookup (default: 10)"
[18] " --localoffrate <int> SA (local) is sampled every 2^offRate BWT chars (default: 3)"
[19] " --localftabchars <int> # of chars consumed in initial lookup in a local index (default: 6)"
[20] " --snp <path> SNP file name"
[21] " --haplotype <path> haplotype file name"
[22] " --ss <path> Splice site file name"
[23] " --exon <path> Exon file name"
[24] " --seed <int> seed for random number generator"
[25] " -q/--quiet verbose output (for debugging)"
[26] " -h/--help print detailed description of tool and its options"
[27] " --usage print this usage message"
[28] " --version print version information and quit"
[29] ""
[30] "*** Warning ***"
[31] "'hisat2-build-s' was run directly. It is recommended that you run the wrapper script 'hisat2-build' instead."
那么怎么把这些参数用到 R 里面呢?假如我们使用 " -k 2 " 或者 " -p 6 " 这样的参数,我们只需要使用 " k=2 " 或 " p=6 " 就可以了。
下面正式建立索引:
# 设置工作路径
setwd('C:/Users/admin/Desktop/hisat2')
# 查看当前目录文件
list.files()
[1] "chr1.fa" "chr2.fa" "chr3.fa"
# 保存文件名给fa_files
fa_files <- list.files()
# 使用8个线程构建索引,我们先打印shell命令看看
print(
hisat2_build(references = fa_files,
outdir = 'my_index',
prefix = 'Rhisat2_index',
strict = TRUE,
execute = FALSE,
p =8 )
)
[1] "\"C:/Users/admin/Documents/R/win-library/4.0/Rhisat2/hisat2-build\" -p 8 \"chr1.fa\",\"chr2.fa\",\"chr3.fa\" \"my_index/Rhisat2_index\""
# 好像没什么问题,我们运行一下
hisat2_build(references = fa_files,
outdir = 'my_index',
prefix = 'Rhisat2_index',
strict = TRUE,
execute = TRUE,
p =8 )
我们的后台疯狂的运行起来了!我 12 个线程怎么 8 个就跑的满满的。运行了五六分钟就结束了。
# 查看当前目录的文件夹,可以看到多生成了一个my_index的文件夹
list.dirs()
[1] "." "./my_index"
# 查看文件夹内容
list.files(path = './my_index/')
[1] "Rhisat2_index.1.ht2" "Rhisat2_index.2.ht2" "Rhisat2_index.3.ht2"
[4] "Rhisat2_index.4.ht2" "Rhisat2_index.5.ht2" "Rhisat2_index.6.ht2"
[7] "Rhisat2_index.7.ht2" "Rhisat2_index.8.ht2"
# 生成了8个索引文件
3.2、比对
还是先查看比对用法:
?hisat2()
# Usage
hisat2(
sequences,
index,
...,
type = c("single", "paired"),
outfile,
force = FALSE,
strict = TRUE,
execute = TRUE
)
参数说明:
1、sequences: 如果是单端测序数据,直接跟文件名就行了,如果是双端测序数据,跟上为含有 2 个长度的 list,其中 list 中两个元素分别为两个测序文件名。
2、index: 字符串,跟上索引的路径加索引文件名前缀
3、type: 字符串,测序数据类型,单端为 'single'
,双端为 'paired'
4、outfile: 字符串,输出文件的路径,可选,如果缺失该参数,则默认返回 R 的字符串向量
其他参数和上面相同,我们看看 hisat2 软件的参数,参数太多,中间省略:
hisat2_usage()
[1] "HISAT2 version 2.1.0 by Daehwan Kim (infphilo@gmail.com, www.ccb.jhu.edu/people/infphilo)"
[2] "Usage: "
[3] " hisat2-align [options]* -x <ht2-idx> {-1 <m1> -2 <m2> | -U <r>} [-S <sam>]"
[4] ""
[5] " <ht2-idx> Index filename prefix (minus trailing .X.ht2)."
[6] " <m1> Files with #1 mates, paired with files in <m2>."
[7] " <m2> Files with #2 mates, paired with files in <m1>."
[8] " <r> Files with unpaired reads."
[9] " <sam> File for SAM output (default: stdout)"
[10] ""
[11] " <m1>, <m2>, <r> can be comma-separated lists (no whitespace) and can be"
[12] " specified many times. E.g. '-U file1.fq,file2.fq -U file3.fq'."
...
[101] " --version print version information and quit"
[102] " -h/--help print this usage message"
[103] ""
[104] "*** Warning ***"
[105] "'hisat2-align' was run directly. It is recommended that you run the wrapper script 'hisat2' instead."
单端比对:
# 查看fastq文件
list.files(pattern = '\\.fq$')
[1] "test_single.fq" "test1_R1.fq" "test1_R2.fq"
# 比对
hisat2(sequences = './test_single.fq',
index = 'my_index/Rhisat2_index',
type = 'single',
outfile = 'map_data/single.sam',
force = TRUE,
strict = TRUE,
execute = TRUE,
p = 8)
双端比对:
# 准备list文件名
paired_fa <- as.list(c("test1_R1.fq","test1_R2.fq"))
paired_fa
[[1]]
[1] "test1_R1.fq"
[[2]]
[1] "test1_R2.fq"
# 比对
hisat2(sequences = paired_fa,
index = 'my_index/Rhisat2_index',
type = 'paired',
outfile = 'map_data/test_paired.sam',
force = TRUE,
strict = TRUE,
execute = TRUE,
p = 8)
还可以在比对过程中提供可变剪接位点信息,帮助在比对过程找到正确的剪接。extract_splice_sites() 函数可以从 gtf、gff3、GRanges、TxDb 等格式的文件提取。
# 提取剪接位点信息,输出到当前目录
gtf <- './hg19.ncbiRefSeq.gtf.gz'
extract_splice_sites(features= gtf, outfile= './sp_file')
# 比对
hisat2(sequences = paired_fa,
index = 'my_index/Rhisat2_index',
type = 'paired',
outfile = 'map_data/test_paired.sam',
force = TRUE,
strict = TRUE,
execute = TRUE,
`known-splicesite-infile`= 'sp_file',
p = 8)
使用感
觉得在 R 里处理没有 linux 里方便,也有可能前者我用的比较多吧,R 里也可以批量,后续接 Rsamtools 转为 bam 文件、Rsubread 定量、DESeq2 差异分析可以整条流水线。
欢迎小伙伴留言评论!
今天的分享就到这里了,敬请期待下一篇!
最后欢迎大家分享转发,您的点赞是对我的鼓励和肯定!
如果觉得对您帮助很大,打赏一下吧!