详细介绍最新版可变剪接软件rMATS

Original 生信阿拉丁生信阿拉丁 2022-05-16

收录于合集

点击蓝字获取更多精彩信息

可变剪切能够产生多种类型的mRNA，因此一个基因就可以产生多种不同的蛋白。这个过程极大的增加了mRNA和蛋白质的多样性。可变剪切（alternative splicing）是一种后转录生物学过程，对细胞活动和疾病过程具有重要的且广泛的影响。研究表明人的基因组中有超过90-95%的多外显子基因存在可变剪切。到目前为止，也有很多软件可对其进行检测，今天我们就来了解一下这款常用可变剪切软件rMATS的最新版详情。

软件介绍

rMATS是检测可变剪切事件的常用软件之一，其可以从RNA测序数据中，检测出多种类型的可变剪切事件，并提供了定量和组间差异分析的功能，可对生物学重复的样本进行组间分析。2020年6月更新的4.1版中更是对软件功能进行了完善：

1. 添加参数--task、--tmp等，以在不同的计算机上运行部分计算；

2. 添加参数--variable-read-length，能够允许不同长度的长度的reads进行分析；

3. 添加参数--paired-stats，进行成对统计分析；

4. 添加参数--novelSS, --mil, --mel，以检测新发可变剪切；

5. 输出文件中用fromGTF.novelJunction 和 fromGTF.novelSpliceSite 代替 fromGTF.novelEvents；

6. 版本兼容了python2和python3；

7. 在仅一个样本的组别或仅一个组别时，务必添加参数--statoff；

8. 修改了部分之前版本的bug。

软件网页链接：http://rnaseq-mats.sourceforge.net/

其检测的可变检测的事件类型如下：

软件安装

rMATS turbo是rMATS的C/Cython版本。主要的差别在于速度和存储资源上，相比较rMATS turbo要快100倍，输出文件要小1000倍。具体可以参考文档：https://github.com/Xinglab/rmats-turbo/blob/v4.1.0/README.md，因此我们安装rMATS turbo。

安装依赖：Python (either 2.7 or 3.6),BLAS,LAPACK,GNU Scientific Library,GCC,gfortran,CMake等。保证以上依赖均存在的情况下就可以进行安装了。其实安装好conda，这些基础的包均已包括了。

1conda create --name py2 python=2.7
2
3conda activate py2
4
5conda install -c bioconda rmats

安装好以后就可以进行软件测试啦。

软件使用及测试

参数说明:

 1python rmats.py -h
 2
 3usage: rmats.py [options]
 4
 5optional arguments:
 6  -h, --help            show this help message and exit
 7  --version             show program's version number and exit
 8  --gtf GTF             An annotation of genes and transcripts in GTF format
 9  --b1 B1               A text file containing a comma separated list of the
10                        BAM files for sample_1. (Only if using BAM)
11  --b2 B2               A text file containing a comma separated list of the
12                        BAM files for sample_2. (Only if using BAM)
13  --s1 S1               A text file containing a comma separated list of the
14                        FASTQ files for sample_1. If using paired reads the
15                        format is ":" to separate pairs and "," to separate
16                        replicates. (Only if using fastq)
17  --s2 S2               A text file containing a comma separated list of the
18                        FASTQ files for sample_2. If using paired reads the
19                        format is ":" to separate pairs and "," to separate
20                        replicates. (Only if using fastq)
21  --od OD               The directory for final output
22  --tmp TMP             The directory for intermediate output such as ".rmats"
23                        files from the prep step
24  -t {paired,single}    Type of read used in the analysis: either "paired" for
25                        paired-end data or "single" for single-end data.
26                        Default: paired
27  --libType {fr-unstranded,fr-firststrand,fr-secondstrand}
28                        Library type. Use fr-firststrand or fr-secondstrand
29                        for strand-specific data. Default: fr-unstranded
30  --readLength READLENGTH
31                        The length of each read
32  --variable-read-length
33                        Allow reads with lengths that differ from --readLength
34                        to be processed. --readLength will still be used to
35                        determine IncFormLen and SkipFormLen
36  --anchorLength ANCHORLENGTH
37                        The anchor length. Default is 1
38  --tophatAnchor TOPHATANCHOR
39                        The "anchor length" or "overhang length" used in the
40                        aligner. At least "anchor length" NT must be mapped to
41                        each end of a given junction. The default is 6. (Only
42                        if using fastq)
43  --bi BINDEX           The directory name of the STAR binary indices (name of
44                        the directory that contains the SA file). (Only if
45                        using fastq)
46  --nthread NTHREAD     The number of threads. The optimal number of threads
47                        should be equal to the number of CPU cores. Default: 1
48  --tstat TSTAT         The number of threads for the statistical model.
49                        Default: 1
50  --cstat CSTAT         The cutoff splicing difference. The cutoff used in the
51                        null hypothesis test for differential splicing. The
52                        default is 0.0001 for 0.01% difference. Valid: 0 <=
53                        cutoff < 1. Does not apply to the paired stats model
54  --task {prep,post,both,inte}
55                        Specify which step(s) of rMATS to run. Default: both.
56                        prep: preprocess BAMs and generate a .rmats file.
57                        post: load .rmats file(s) into memory, detect and
58                        count alternative splicing events, and calculate P
59                        value (if not --statoff). both: prep + post. inte
60                        (integrity): check that the BAM filenames recorded by
61                        the prep task(s) match the BAM filenames for the
62                        current command line
63  --statoff             Skip the statistical analysis
64  --paired-stats        Use the paired stats model
65  --novelSS             Enable detection of novel splice sites (unannotated
66                        splice sites). Default is no detection of novel splice
67                        sites
68  --mil MIL             Minimum Intron Length. Only impacts --novelSS
69                        behavior. Default: 50
70  --mel MEL             Maximum Exon Length. Only impacts --novelSS behavior.
71                        Default: 500

单个样本运行时

将NA12878的bam文件的具体路径写入到/path/to/b1.txt文件中

1condadir/envs/py2/bin/python condadir/envs/py2/rMATS/rmats.py --nthread 4 --b1 /path/to/b1.txt --gtf Homo_sapiens.hg19_ucsc.gtf --od NA12878 -t paired --readLength 101 --libType fr-unstranded --statoff

其中

--b1 为bam文件的路径，若有生物学重复则bam文件路径用逗号隔开，为单比较组时，仅给b1或者给s1即可；
--gtf 为已知的基因及转录本的gtf文件；--od 即为输出路径；-t 测序类型为单端或者双端 ;
--readLength 每条reads的长度，若长度不一致时，可使用--variable-read-length参数与readLength结合使用将reads截取到给定的数值；--libType 文库类型，可选择是否为链特异性；
--statoff 加上该参数则跳过统计部分，单样本或者单比较组时，跳过统计步骤。

比较组运行时

/path/to/b1.txt

1/path/to/1_1.bam,/path/to/1_2.bam

/path/to/b2.txt

1/path/to/2_1.bam,/path/to/2_2.bam

1python rmats.py --b1 /path/to/b1.txt --b2 /path/to/b2.txt --gtf /path/to/the.gtf -t paired --readLength 50 --nthread 4 --od /path/to/output --tmp /path/to/tmp_output --paired-stats

其中

--b1 为组别1的bam文件的路径，若有生物学重复则bam文件路径用逗号隔开，为单比较组时，仅给b1或者给s1即可；
--b2 为组别2的bam文件的路径，若有生物学重复则bam文件路径用逗号隔开；
--gtf 为已知的基因及转录本的gtf文件；
--od 即为输出路径；
-t 测序类型为单端或者双端 ;
--readLength 若长度不一致时，可使用该参数将reads截取到给定的数值；
--libType 文库类型，可选择是否为链特异性；
--tmp 暂存目录；
--paired-stats 使用成对统计模型。

备注

除了bam文件可作该软件的输入外，还可以使用fq文件做为输入，使用-s1和-s2参数即可，同一样本的双端reads使用冒号分隔，生物学重复间使用逗号分隔。

结果说明

每一种可变剪切事件有相关的一系列的输出文件，每一种事件的相关文件以事件名作为前缀之一，以下文件中以[AS_Event]代替了[SE (skipped exon)，MXE (mutually exclusive exons)，A3SS (alternative 3' splice site)，A5SS (alternative 5' splice site)，RI (retained intron)] 中各事件：

[AS_Event].MATS.JC.txt：检出的junction区域的reads数（Junction Counts）；
[AS_Event].MATS.JCEC.txt：检出的junction区域的reads数（Junction Counts）和不跨越的外显子上read数（Exon Counts），考虑已知可变剪切事件时，可重点参考这个文件；
fromGTF.[AS_Event].txt：从RNA和GTF中检出的所有可变剪切事件；
fromGTF.novelJunction.[AS_Event].txt：仅使用RNA鉴定的可变剪切事件，与gtf的分析分离，其中并不包含未注释的可变剪切位点；
fromGTF.novelSpliceSite.[AS_Event].txt：文件中仅包含未知的可变剪切位点的可变剪切事件，仅使用--novelSS参数时产生该文件；
JC.raw.input.[AS_Event].txt：[AS_Event].MATS.JC.txt文件的input raw文件；
JCEC.raw.input.[AS_Event].txt：[AS_Event].MATS.JCEC.txt文件的input raw文件。

事件文件中共同的属性列

ID：rMATS 事件的ID；
GeneID：Gene ID；
geneSymbol：Gene 名称；
chr：染色体；
strand：基因的正负链情况；
IJC_SAMPLE_1：sample 1中包含剪切区域的reads数，生物学重复以逗号分隔；
SJC_SAMPLE_1：sample 1中不包含剪切区域的reads数，生物学重复以逗号分隔；
IJC_SAMPLE_2：sample 2中包含剪切区域的reads数，生物学重复以逗号分隔；
SJC_SAMPLE_2：sample 2中不包含剪切区域的reads数，生物学重复以逗号分隔；
IncFormLen：包含区域的长度，用于校正；
SkipFormLen：跳过区域的长度，用于校正；
PValue：两个比较组可变剪切差异的显著性（仅在使用statistical model时存在）；
FDR：由 p-value计算的错误发现率（仅在使用statistical model时存在）；
IncLevel1：由校正后reads数得到的sample 1的区域等级，生物学重复以逗号分隔；
IncLevel2：由校正后reads数得到的sample 2的区域等级，生物学重复以逗号分隔；
IncLevelDifference：average(IncLevel1) - average(IncLevel2)。

事件文件中特异的属性列

SE：exonStart_0base，exonEnd，upstreamES，upstreamEE，downstreamES，downstreamEE

包含形式中的目标外显子（该外显子的起始位置, 终止位置）

MXE：1stExonStart_0base，1stExonEnd，2ndExonStart_0base，2ndExonEnd，upstreamES，upstreamEE，downstreamES，downstreamEE

+链，包含形式是包含第1个外显子（外显子的起始位置, 终止位置），跳跃第2个外显子
-链，包含形式是包含第2个外显子（外显子的起始位置, 终止位置），跳跃第1个外显子

A3SS, A5SS：longExonStart_0base，longExonEnd，shortES，shortEE，flankingES，flankingEE

包含形式中使用长外显子（长外显子的起始位置, 终止位置）代替短的外显子（短外显子的起始位置 ，终止位置）

RI：riExonStart_0base，riExonEnd，upstreamES，upstreamEE，downstreamES，downstreamEE

包含形式中包含内含子区域一般使用（上游外显子的终止位置 , 下有外显子的起始位置）

总结

总体上说目前rMATS4.1版不受限于单双端测序，reads长度不一，是否存在生物学重复，是否有比较组，是否需要检测新转录本，是否链特异性等条件，并且其可以进行分步，分机器计算，功能完善，主要可变剪切事件检测完整的一款软件。在二代测序可变剪切检测的软件中可以算佼佼者，希望小编的介绍能给大家的可变剪切分析带来帮助。

参考文献：

Mehmood A , Laiho A , Venlinen M S , et al. Systematic evaluation of differential splicing tools for RNA-seq studies[J]. Briefings in Bioinformatics, 2019.
Shen S , Park J W , Lu Z , et al. rMATS: Robust and flexible detection of differential alternative splicing from replicate RNA-Seq data[J]. Proc Natl Acad Sci U S A, 2014, 111(51):5593-601.
Park J W , Tokheim C , Shen S , et al. Identifying Differential Alternative Splicing Events from RNA Sequencing Data Using RNASeq-MATS[M]// Deep Sequencing Data Analysis. Humana Press, 2013.
Shihao S , Won P J , Jian H , et al. MATS: a Bayesian framework for flexible detection of differential alternative splicing from RNA-Seq data[J]. Nucleic Acids Research, 2012(8):e61.
http://rnaseq-mats.sourceforge.net/
https://github.com/Xinglab/rmats-turbo/blob/v4.1.0

作者：椰子糖

审稿：童蒙

编辑：amethyst

肿瘤全外显子测序实验技术要点

三分钟看懂TMT技术之分析质控篇

ATAC-seq / ChIP-seq问题盘点

高三女生醉酒后被强奸致死？检方回应

常德悲剧：让谴责无差别杀戮之声更加响亮一点

2024【公共营养师】培训报名通道已开启，不限学历，23岁及以上可报！还能领2000补贴

【惊】"以为要写遗书"! 飞温哥华航班遇炸弹气旋华人崩溃大哭连空姐都吐了; 客机颠簸盘旋3小时

女人最偏爱的十种男人

详细介绍最新版可变剪接软件rMATS

参数说明:

参考文献：

您可能也对以下帖子感兴趣

高三女生醉酒后被强奸致死？检方回应

常德悲剧：让谴责无差别杀戮之声更加响亮一点

2024【公共营养师】培训报名通道已开启，不限学历，23岁及以上可报！还能领2000补贴

【惊】"以为要写遗书"! 飞温哥华航班遇炸弹气旋 华人崩溃大哭 连空姐都吐了; 客机颠簸盘旋3小时

女人最偏爱的十种男人

生成图片，分享到微信朋友圈

详细介绍最新版可变剪接软件rMATS

参数说明:

参考文献：

您可能也对以下帖子感兴趣

【惊】"以为要写遗书"! 飞温哥华航班遇炸弹气旋华人崩溃大哭连空姐都吐了; 客机颠簸盘旋3小时