查看原文
其他

单细胞转录组3大R包之scater

jimmy 生信技能树 2022-06-06

scater 这个R包很强大,是McCarthy et al. 2017 发表的,包含的功能有:

  • Automated computation of QC metrics

  • Transcript quantification from read data with pseudo-alignment

  • Data format standardisation

  • Rich visualizations for exploratory analysis

  • Seamless integration into the Bioconductor universe

  • Simple normalisation methods

R包工作流程图

S4对象

主要是基于 SCESet 对象来进行下游分析,跟ExpressionSet对象类似,也是常见的3个组成:

  • exprs, a numeric matrix of expression values, where rows are features, and columns are cells

  • phenoData, an AnnotatedDataFrame object, where rows are cells, and columns are cell attributes (such as cell type, culture condition, day captured, etc.)

  • featureData, an AnnotatedDataFrame object, where rows are features (e.g. genes), and columns are feature attributes, such as biotype, gc content, etc.

主要就是读取scRNA上游分析处理得到的表达矩阵,加上每个样本的描述信息,形成矩阵之后。对样本进行过滤,然后对基因进行过滤。针对过滤后的表达矩阵进行各种分类的可视化。

最新的文档如下:

HTMLR ScriptAn introduction to the scater package
HTMLR ScriptData visualisation methods in scater
HTMLR ScriptExpression quantification and import
HTMLR ScriptQuality control with scater
HTMLR ScriptTransition from SCESet to SingleCellExperiment
PDF

其GitHub上面专门开设了介绍它用法的课程:http://hemberg-lab.github.io/scRNA.seq.course/

测试数据

suppressPackageStartupMessages(library(scater))
data("sc_example_counts")
data("sc_example_cell_info")

example_sce <- SingleCellExperiment(
   assays = list(counts = sc_example_counts),
   colData = sc_example_cell_info
)

exprs(example_sce) <- log2(
   calculateCPM(example_sce, use.size.factors = FALSE) + 1
)

keep_feature <- rowSums(exprs(example_sce) > 0) > 0
example_sce <- example_sce[keep_feature,]

example_sce <- calculateQCMetrics(example_sce,
                                 feature_controls = list(eg = 1:40))

#scater_gui(example_sce)

但是真的非常好用,所有的可视化都集中在了 scater_gui 这个函数产生的shiny网页里面:

  • plotScater: a plot method exists for SingleCellExperiment objects, which gives an overview of expression across cells.

  • plotQC: various methods are available for producing QC diagnostic plots.

  • plotPCA: produce a principal components plot for the cells.

  • plotTSNE: produce a t-distributed stochastic neighbour embedding (reduced dimension) plot for the cells.

  • plotDiffusionMap: produce a diffusion map (reduced dimension) plot for the cells.

  • plotMDS: produce a multi-dimensional scaling plot for the cells.

  • plotReducedDim: plot a reduced-dimension representation of the cells.

  • plotExpression: plot expression levels for a defined set of features.

  • plotPlatePosition: plot cells in their position on a plate, coloured by cell metadata and QC metrics or feature expression level.

  • plotColData: plot cell metadata and QC metrics.

  • plotRowData: plot feature metadata and QC metrics.

可以充分的探索自己的数据,随便看一个可视化函数的结果:

## ----plot-expression, eval=TRUE--------------------------------------------
plotExpression(example_sce, rownames(example_sce)[1:6],
              x = "Mutation_Status", exprs_values = "exprs",
              colour = "Treatment")

详细的QC

做QC要结合上面的可视化步骤,所有没办法自动化,只能先可视化,肉眼分辨一下哪些样本或者基因数据是需要舍弃的。

library(knitr)
opts_chunk$set(fig.align = 'center', fig.width = 6, fig.height = 5, dev = 'png')
library(ggplot2)
theme_set(theme_bw(12))

## ----quickstart-load-data, message=FALSE, warning=FALSE--------------------
suppressPackageStartupMessages(library(scater))
data("sc_example_counts")
data("sc_example_cell_info")

## ----quickstart-make-sce, results='hide'-----------------------------------
gene_df <- DataFrame(Gene = rownames(sc_example_counts))
rownames(gene_df) <- gene_df$Gene
example_sce <- SingleCellExperiment(assays = list(counts = sc_example_counts),
                                   colData = sc_example_cell_info,
                                   rowData = gene_df)

example_sce <- normalise(example_sce)
## Warning in .local(object, ...): using library sizes as size factors
## ----quickstart-add-exprs, results='hide'----------------------------------
exprs(example_sce) <- log2(
   calculateCPM(example_sce, use.size.factors = FALSE) + 1)

## ----filter-no-exprs-------------------------------------------------------
keep_feature <- rowSums(exprs(example_sce) > 0) > 0
example_sce <- example_sce[keep_feature,]

example_sceset <- calculateQCMetrics(example_sce, feature_controls = list(eg = 1:40))


colnames(colData(example_sceset))
##  [1] "Cell"                                  
##  [2] "Mutation_Status"                      
##  [3] "Cell_Cycle"                            
##  [4] "Treatment"                            
##  [5] "total_features"                        
##  [6] "log10_total_features"                  
##  [7] "total_counts"                          
##  [8] "log10_total_counts"                    
##  [9] "pct_counts_top_50_features"            
## [10] "pct_counts_top_100_features"          
## [11] "pct_counts_top_200_features"          
## [12] "pct_counts_top_500_features"          
## [13] "total_features_endogenous"            
## [14] "log10_total_features_endogenous"      
## [15] "total_counts_endogenous"              
## [16] "log10_total_counts_endogenous"        
## [17] "pct_counts_endogenous"                
## [18] "pct_counts_top_50_features_endogenous"
## [19] "pct_counts_top_100_features_endogenous"
## [20] "pct_counts_top_200_features_endogenous"
## [21] "pct_counts_top_500_features_endogenous"
## [22] "total_features_feature_control"        
## [23] "log10_total_features_feature_control"  
## [24] "total_counts_feature_control"          
## [25] "log10_total_counts_feature_control"    
## [26] "pct_counts_feature_control"            
## [27] "total_features_eg"                    
## [28] "log10_total_features_eg"              
## [29] "total_counts_eg"                      
## [30] "log10_total_counts_eg"                
## [31] "pct_counts_eg"                        
## [32] "is_cell_control"
colnames(rowData(example_sceset))
##  [1] "Gene"                  "is_feature_control"  
##  [3] "is_feature_control_eg" "mean_counts"          
##  [5] "log10_mean_counts"     "rank_counts"          
##  [7] "n_cells_counts"        "pct_dropout_counts"  
##  [9] "total_counts"          "log10_total_counts"

首先是基于样本的过滤,用 colData(object) 可以查看各个样本统计情况

  • total_counts: total number of counts for the cell (aka ‘library size’)

  • log10_total_counts: total_counts on the log10-scale

  • total_features: the number of features for the cell that have expression above the detection limit (default detection limit is zero)

  • filter_on_total_counts: would this cell be filtered out based on its log10-total_counts being (by default) more than 5 median absolute deviations from the median log10-total_counts for the dataset?

  • filter_on_total_features: would this cell be filtered out based on its total_features being (by default) more than 5 median absolute deviations from the median total_features for the dataset?

  • counts_feature_controls: total number of counts for the cell that come from (a set of user-defined) control features. Defaults to zero if no control features are indicated.

  • counts_endogenous_features: total number of counts for the cell that come from endogenous features (i.e. not control features). Defaults to total_counts if no control features are indicated.

  • log10_counts_feature_controls: total number of counts from control features on the log10-scale. Defaults to zero (i.e. log10(0 + 1), offset to avoid infinite values) if no control features are indicated.

  • log10_counts_endogenous_features: total number of counts from endogenous features on the log10-scale. Defaults to zero (i.e. log10(0 + 1), offset to avoid infinite values) if no control features are indicated.

  • n_detected_feature_controls: number of defined feature controls that have expression greater than the threshold defined in the object. *pct_counts_feature_controls: percentage of all counts that come from the defined control features. Defaults to zero if no control features are defined.

然后是基于基因的过滤,用 rowData(object) 可以查看各个基因统计情况

  • mean_exprs: the mean expression level of the gene/feature.

  • exprs_rank: the rank of the feature’s expression level in the cell.

  • total_feature_counts: the total number of counts mapped to that feature across all cells.

  • log10_total_feature_counts: total feature counts on the log10-scale.

  • pct_total_counts: the percentage of all counts that are accounted for by the counts mapping to the feature.

  • is_feature_control: is the feature a control feature? Default is FALSE unless control features are defined by the user.

  • n_cells_exprs: the number of cells for which the expression level of the feature is above the detection limit (default detection limit is zero).

scater一站式过滤低质量样本

scater包自己提供了一个基于PCA的QC标准,不需要自己根据文库大小,覆盖的基因数量,外源的ERCC spike-ins 含量以及线粒体DNA含量来进行人工过滤。

默认的筛选条件如下:

  • pct_counts_top100features

  • total_features

  • pct_counts_feature_controls

  • n_detected_feature_controls

  • log10_counts_endogenous_features

  • log10_counts_feature_controls

一站式QC函数如下:

dat_pca <- scater::plotPCA(dat_qc,
                 size_by = "total_features",
                 shape_by = "use",
                 pca_data_input = "pdata",
                 detect_outliers = TRUE,
                 return_SCESet = TRUE)

还有更详细的教程,需要看

  • https://www.bioconductor.org/help/workflows/simpleSingleCell/

  • http://hemberg-lab.github.io/scRNA.seq.course/index.html

sessionInfo()


过滤只是它最基本的工具,它作为单细胞转录组3大R包,功能肯定是非常全面的,比如前面我们讲解的normalization,DEG, features selection,cluster,它都手到擒来,只不过是包装的是其它R包的函数。


您可能也对以下帖子感兴趣

文章有问题?点此查看未经处理的缓存