RNA-seq：Salmon 定量结果差异分析

Original JunJunLab 老俊俊的生信笔记 2022-08-15

Part1RNA-seq：Salmon 定量结果差异分析

上篇文章讲到如何使用 salmon 快速进行定量分析，这回讲解如何将定量结果导入到 DESeq2 软件中进行差异分析。
Salmon 下游分析主要有：

differential transcript usage (DTU) 差异转录本的使用
differential gene expression (DGE) 差异基因表达分析

这里主要介绍差异基因表达分析，DTU 分析可参考：Swimming downstream: statistical analysis of differential transcript usage following Salmon quantification^[1] 或者 https://f1000research.com/articles/7-952/v3^[2]

1安装并加载 R 包

BiocManager::install("tximport")
BiocManager::install("GenomicFeatures")
library(tximport)
library(GenomicFeatures)

2构造输入文件

# 设置定量结果存放目录
setwd('D:/rnaseq/salmon/quants')
# 获取当前工作路径
dir <- getwd()
# 保存当前文件夹名给file
file <- c('test1_rep1_quant','test1_rep2_quant',
          'test2_rep1_quant','test2_rep2_quant')
# 或者
file <- list.files(pattern = '*.quant')

# 构建定量结果文件路径
file_dir <- file.path(dir, file, 'quant.sf')
file_dir
[1] "D:/rnaseq/salmon/quants/test1_rep1_quant/quant.sf"
[2] "D:/rnaseq/salmon/quants/test1_rep2_quant/quant.sf"
[3] "D:/rnaseq/salmon/quants/test2_rep1_quant/quant.sf"
[4] "D:/rnaseq/salmon/quants/test2_rep2_quant/quant.sf"
# 给文件命名
names(file_dir) <- c('test1_rep1','test1_rep2','test2_rep1','test2_rep2')
file_dir
                                         test1_rep1                                          test1_rep2
"D:/rnaseq/salmon/quants/test1_rep1_quant/quant.sf" "D:/rnaseq/salmon/quants/test1_rep2_quant/quant.sf"
                                         test2_rep1                                          test2_rep2
"D:/rnaseq/salmon/quants/test2_rep1_quant/quant.sf" "D:/rnaseq/salmon/quants/test2_rep2_quant/quant.sf"

3准备转录本 id 和基因 id 对应关系文件

可以从 TxDb 包中获得，也可以自己从 gtf 注释文件中获得，推荐后者，尽量和上游分析注释文件保持一致，注释文件 R 包都是 UCSC 的注释类型，可能注释完整性每个数据库也不一样。

# 从自己gtf文件获取，把gtf文件放到当前工作目录
# 读取gtf文件
txdb <- makeTxDbFromGFF(file = 'gencode.vM27.annotation.gtf',format = 'gtf',organism = 'Mus musculus')

# 查看keys类型
keytypes(txdb)
[1] "CDSID"    "CDSNAME"  "EXONID"   "EXONNAME" "GENEID"   "TXID"     "TXNAME"

# 选择GENEID和TXNAME
k <- keys(txdb, keytype = "TXNAME")
tx2gene <- select(txdb, k, "GENEID", "TXNAME")

# 查看内容
head(tx2gene,4)
                TXNAME               GENEID
1 ENSMUST00000193812.2 ENSMUSG00000102693.2
2 ENSMUST00000082908.3 ENSMUSG00000064842.3
3 ENSMUST00000192857.2 ENSMUSG00000102851.2
4 ENSMUST00000161581.2 ENSMUSG00000089699.2

# 从注释文件R包中获取
# 人hg19注释信息
BiocManager::install("TxDb.Hsapiens.UCSC.hg19.knownGene")
# 人hg38注释信息
BiocManager::install("TxDb.Hsapiens.UCSC.hg38.knownGene")
# 小鼠mm9注释信息
BiocManager::install("TxDb.Mmusculus.UCSC.mm9.knownGene")
# 小鼠mm10注释信息
BiocManager::install("TxDb.Mmusculus.UCSC.mm10.knownGene")
##########################################################
library(TxDb.Mmusculus.UCSC.mm10.knownGene)
txdb <- TxDb.Mmusculus.UCSC.mm10.knownGene
keytypes(txdb)
k <- keys(txdb, keytype = "TXNAME")
tx2gene <- select(txdb, k, "GENEID", "TXNAME")
head(tx2gene,4)

4使用 tximport^[3] 导入

tximport 包能导入以下软件的定量结果：

Salmon (Patro et al. 2017)
Alevin (Srivastava et al. 2019)
Sailfish (Patro, Mount, and Kingsford 2014)
kallisto (Bray et al. 2016)
RSEM (Li and Dewey 2011)
StringTie (Pertea et al. 2015)

具体可参考 tximport 的 bioconductor 帮助文档。

# 读取定量结果,保存为一个list
txi <- tximport(file_dir, type = "salmon", tx2gene = tx2gene)
reading in files with read_tsv
1 2 3 4
summarizing abundance
summarizing counts
summarizing length

# 查看list
names(txi)
[1] "abundance"   "counts"    "length"    "countsFromAbundance"
# 查看abundance数据，代表tpm值
head(txi$abundance,3)
                      test1_rep1 test1_rep2 test2_rep1 test2_rep2
ENSMUSG00000000001.5    62.21182   55.25616   53.45963   60.46954
ENSMUSG00000000003.16    0.00000    0.00000    0.00000    0.00000
ENSMUSG00000000028.16   52.10330   39.86649   35.43731   37.04251
# 查看counts数据
head(txi$counts,3)
                      test1_rep1 test1_rep2 test2_rep1 test2_rep2
ENSMUSG00000000001.5    2695.758   2605.763   2311.899   2838.876
ENSMUSG00000000003.16      0.000      0.000      0.000      0.000
ENSMUSG00000000028.16   1332.969   1139.000    919.746   1047.000
# 查看有效基因长度
head(txi$length,3)
                      test1_rep1 test1_rep2 test2_rep1 test2_rep2
ENSMUSG00000000001.5    2962.254   2998.074   2957.301   2981.350
ENSMUSG00000000003.16    512.772    512.772    512.772    512.772
ENSMUSG00000000028.16   1748.917   1816.368   1774.842   1794.938

Part2差异分析

后续可接三款主流的差异分析软件，下面分别介绍：

1、DESeq2 差异分析代码

# 加载包
library(DESeq2)
# 构建样本信息表
sampleTable <- data.frame(condition = factor(rep(c("control", "treat"), each = 2)))
rownames(sampleTable) <- colnames(txi$counts)

# 使用DESeqDataSetFromTximport函数传入给DESeq2
dds <- DESeqDataSetFromTximport(txi, sampleTable, ~condition)
using counts and average transcript lengths from tximport

# 差异分析
dds <- DESeq(dds)
estimating size factors
using 'avgTxLength' from assays(dds), correcting for library size
estimating dispersions
gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates
fitting model and testing
res <- results(dds)
# 把res转为data.frame格式
diff_res <- as.data.frame(res)
# 查看差异结果
head(diff_res,3)
                      baseMean log2FoldChange     lfcSE       stat    pvalue      padj
ENSMUSG00000000001.5  2610.399    -0.02393157 0.2183073 -0.1096233 0.9127081 0.9673216
ENSMUSG00000000003.16    0.000             NA        NA         NA        NA        NA
ENSMUSG00000000028.16 1110.566    -0.32164574 0.2437125 -1.3197754 0.1869100 0.4593252

# 导出保存文件
diff_res$gene_id <- rownames(diff_res)
write.table(diff_res,file = 'DESeq2_results.csv',sep = ',',row.names = FALSE)

2、edgeR 差异分析代码

# 加载包
library(edgeR)
# 取出counts
cts <- txi$counts
# 取出基因有效长度
normMat <- txi$length

# 得到每个基因长度的缩放因子
normMat <- normMat/exp(rowMeans(log(normMat)))
normCts <- cts/normMat
# 从缩放的counts来计算有效文库大小，避免样本间的组成偏差
eff.lib <- calcNormFactors(normCts) * colSums(normCts)

# Combining effective library sizes with the length factors, and calculating offsets for a log-link GLM.
# 结合有效文库大小和长度因子，用广义线性模型（GLM)计算偏移量
normMat <- sweep(normMat, 2, eff.lib, "*")
normMat <- log(normMat)

# 构建 DGEList object
y <- DGEList(cts)
y <- scaleOffset(y, normMat)
# 自动过滤低表达基因
keep <- filterByExpr(y)
y <- y[keep, ]

# 设计实验
sampleTable <- data.frame(condition = factor(rep(c("control", "treat"), each = 2)))
rownames(sampleTable) <- colnames(txi$counts)

design_full <- model.matrix(~condition, data = sampleTable)
           (Intercept) conditiontreat
test1_rep1           1              0
test1_rep2           1              0
test2_rep1           1              1
test2_rep2           1              1
attr(,"assign")
[1] 0 1
attr(,"contrasts")
attr(,"contrasts")$condition
[1] "contr.treatment"
# 估计离散度
y <- estimateDisp(y, design_full)
# 拟合模型
fit <- glmFit(y, design_full)
# 统计检验
lrt <- glmLRT(fit)
# 提取差异结果
tt <- topTags(lrt, n=nrow(y), sort="none")[[1]]
# 查看差异结果
head(tt,3)
                            logFC   logCPM         LR      PValue        FDR
ENSMUSG00000000001.5  -0.03700919 7.126297 0.03409904 0.853496315 0.93655267
ENSMUSG00000000028.16 -0.33464260 5.895729 2.18261856 0.139576806 0.36365238
ENSMUSG00000000031.17  2.67117263 3.563905 9.90557402 0.001647789 0.01718821

# 输出保存差异结果
tt$gene_id <- rownames(tt)
write.table(tt,file = 'edgeR_results.csv',sep = ',',row.names = FALSE)

3、limma-voom 差异分析代码

limma 包最早开发用来分析微整列芯片的表达数据（microarray），后来又开发了 voom 功能来针对 RNA-seq 数据进行差异分析。
因为 lima-voom 不使用存储在y$offset中的偏移矩阵，所以建议使用从丰度生成的缩放计数，scaledTPM或lengthScaledTPM。

# 加载R包
library(limma)
# 重新导入数据
txi <- tximport(file_dir, type = "salmon", tx2gene = tx2gene, countsFromAbundance = "lengthScaledTPM")
reading in files with read_tsv
1 2 3 4
summarizing abundance
summarizing counts
summarizing length

# 构建 DGEList object
y <- DGEList(txi$counts)
# 自动过滤低表达基因
keep <- filterByExpr(y)
y <- y[keep, ]
# 计算标准化因子
y <- calcNormFactors(y)

# 设计实验
sampleTable <- data.frame(condition = factor(rep(c("control", "treat"), each = 2)))
rownames(sampleTable) <- colnames(txi$counts)

design <- model.matrix(~condition, data = sampleTable)
v <- voom(y, design)
# 差异分析，先拟合在统计检验
fit <- lmFit(v,design)
fit2 <- eBayes(fit)
lim_res = topTable(fit2, coef=2, n=Inf)
# 查看差异结果
head(lim_res,3)
                          logFC  AveExpr         t      P.Value  adj.P.Val        B
ENSMUSG00000031548.8  -2.488670 5.957044 -14.20956 2.321833e-06 0.01446637 5.525443
ENSMUSG00000000037.18  2.028352 6.782320  12.33597 5.956494e-06 0.01446637 4.739079
ENSMUSG00000010796.12  3.766355 4.153263  13.39382 3.446536e-06 0.01446637 4.622380

# 输出保存差异结果
lim_res$gene_id <- rownames(lim_res)
write.table(lim_res,file = 'limma_results.csv',sep = ',',row.names = FALSE)

Part3差异结果比较

用 DESeq2、edgeR 和 limma 三种差异软件鉴定到的显著性结果：

软件	DESeq2	edgeR	limma
All significant gene numbers	1918	1627	1503
Significant upregulated genes	842	641	625
Significant downregulated genes	1076	986	878

筛选标准：log2FC > 1 或 log2FC < -1 且 p < 0.05

看起来鉴定到的差异基因的数量 DESeq2 是最多的，其次是 edgeR，limma 在这里面鉴定到的是最少的。
差异基因总数和显著上调的基因数量中，edgeR 和 limma 鉴定到的比较相近，显著性下调基因的数量中，DESeq2 和 edgeR 比较相近。

1、差异基因韦恩图
在线用 vennDiagram 绘制韦恩图点击：shiny VennDiagram

从 overlap 的基因数量来看，edgeR 和 limma overlap 的更多一些，其次是 DESeq2 和 edgeR，limma 与 DESeq2 overlap 到的基因是这里面最少的。

2、差异基因火山图
在线绘制火山图点击：画个 CNS 级别火山图！

三个火山图里面都标注了一样的两个基因，可以看到在三个图里分布的趋势是一致的，log2FoldChange 的话，感觉都差不多，p 值的话，DESeq2 和 edgeR 比较相近，而整体 limma 结果的 p 值都比前两者要大很多，可能因为统计检验的方法不一样吧。

参考资料

[1]

Swimming downstream: statistical analysis of differential transcript usage following Salmon quantification: http://www.bioconductor.org/packages/devel/workflows/vignettes/rnaseqDTU/inst/doc/rnaseqDTU.html#dge-analysis-with-deseq2

[2]

https://f1000research.com/articles/7-952/v3: https://f1000research.com/articles/7-952/v3

[3]

tximport: https://bioconductor.org/packages/release/bioc/vignettes/tximport/inst/doc/tximport.htm

欢迎小伙伴留言评论！

今天的分享就到这里了，敬请期待下一篇！

最后欢迎大家分享转发，您的点赞是对我的鼓励和肯定！

如果觉得对您帮助很大，打赏一下吧！

震撼！上海开放Google、Meta等国际平台，中国数字化迈出历史性一步！

战争烈度再升级，特朗普加征10%对华关税

特朗普称计划对中国商品征收25%的关税，外交部回应

从地方目标看全国！今年经济增长还是有很高要求

二姐李思林早知道李玟有轻生的念头，居然没一点防备措施！

RNA-seq：Salmon 定量结果差异分析

Part1RNA-seq：Salmon 定量结果差异分析

1安装并加载 R 包

2构造输入文件

3准备转录本 id 和基因 id 对应关系文件

4使用 tximport^[3] 导入

Part2差异分析

Part3差异结果比较

用 DESeq2、edgeR 和 limma 三种差异软件鉴定到的显著性结果：

今天的分享就到这里了，敬请期待下一篇！

您可能也对以下帖子感兴趣

震撼！上海开放Google、Meta等国际平台，中国数字化迈出历史性一步！

战争烈度再升级，特朗普加征10%对华关税

特朗普称计划对中国商品征收25%的关税，外交部回应

从地方目标看全国！今年经济增长还是有很高要求

二姐李思林早知道李玟有轻生的念头，居然没一点防备措施！

生成图片，分享到微信朋友圈

RNA-seq：Salmon 定量结果差异分析

Part1RNA-seq：Salmon 定量结果差异分析

1安装并加载 R 包

2构造输入文件

3准备转录本 id 和基因 id 对应关系文件

4使用 tximport[3] 导入

Part2差异分析

Part3差异结果比较

用 DESeq2、edgeR 和 limma 三种差异软件鉴定到的显著性结果：

今天的分享就到这里了，敬请期待下一篇！

您可能也对以下帖子感兴趣

4使用 tximport^[3] 导入