其他
答读者问(八)为什么Read10X也会报错?
往期回顾
问题
最近有读者表示看了我的这一讲(手把手教你做单细胞测序数据分析(二)——各类输入文件读取)以后并不能读入自己下载的10X数据(GSE176078),读取时遇到了以下的错误。显然我是不能背这个锅的,让我们来找找问题在哪,并给出解决办法。
怎么解决问题
首先,我反复强调过,遇到报错不要慌,要先看看R反馈给你什么错误,这里Error明显发生在feature文件中。在数据读取那一讲中,我们是把barcodes.tsv、feature.tsv(或gene.tsv)、matrix.tsv都用excel打开给大家看过数据结构,正常的是下图左的形式,而本次用到的数据打开是下图右的形式,这可能是版本因素造成的。所以问题很简单,正常的feature文件拥有两列,一列是ensemble ID,一列是基因的symbol(这个我们当时也给大家展示过,从2分钟开始看:https://www.bilibili.com/video/BV1S44y1b76Z?p=3),而这里少了一列。最简单的方法就是复制粘贴,把这个一列的文件变两列
改成两列后重新运行,就可以成功读入了。
那如果想以修改代码的方式解决这个问题呢?其实也很简单,我们来看一下Read10X()函数的源码,虽然这个代码很长,大概一百多行,但是在参数栏中已经指出了gene.column为2,那么我们只需要将这个参数修改为1,即可读入GSE176078中的数据。不过注意,修改过的函数可能无法调用Seurat的依赖包,所以需要额外加载,按我下面的代码操作即可。
更改后的代码:
library(Seurat)
library(Matrix)
myread10x <- function (data.dir, gene.column = 1, cell.column = 1, unique.features = TRUE,
strip.suffix = FALSE)
{
full.data <- list()
for (i in seq_along(along.with = data.dir)) {
run <- data.dir[i]
if (!dir.exists(paths = run)) {
stop("Directory provided does not exist")
}
barcode.loc <- file.path(run, "barcodes.tsv")
gene.loc <- file.path(run, "genes.tsv")
features.loc <- file.path(run, "features.tsv.gz")
matrix.loc <- file.path(run, "matrix.mtx")
pre_ver_3 <- file.exists(gene.loc)
if (!pre_ver_3) {
addgz <- function(s) {
return(paste0(s, ".gz"))
}
barcode.loc <- addgz(s = barcode.loc)
matrix.loc <- addgz(s = matrix.loc)
}
if (!file.exists(barcode.loc)) {
stop("Barcode file missing. Expecting ", basename(path = barcode.loc))
}
if (!pre_ver_3 && !file.exists(features.loc)) {
stop("Gene name or features file missing. Expecting ",
basename(path = features.loc))
}
if (!file.exists(matrix.loc)) {
stop("Expression matrix file missing. Expecting ",
basename(path = matrix.loc))
}
data <- readMM(file = matrix.loc)
cell.barcodes <- read.table(file = barcode.loc, header = FALSE,
sep = "\t", row.names = NULL)
if (ncol(x = cell.barcodes) > 1) {
cell.names <- cell.barcodes[, cell.column]
}
else {
cell.names <- readLines(con = barcode.loc)
}
if (all(grepl(pattern = "\\-1$", x = cell.names)) &
strip.suffix) {
cell.names <- as.vector(x = as.character(x = sapply(X = cell.names,
FUN = ExtractField, field = 1, delim = "-")))
}
if (is.null(x = names(x = data.dir))) {
if (length(x = data.dir) < 2) {
colnames(x = data) <- cell.names
}
else {
colnames(x = data) <- paste0(i, "_", cell.names)
}
}
else {
colnames(x = data) <- paste0(names(x = data.dir)[i],
"_", cell.names)
}
feature.names <- read.delim(file = ifelse(test = pre_ver_3,
yes = gene.loc, no = features.loc), header = FALSE,
stringsAsFactors = FALSE)
if (any(is.na(x = feature.names[, gene.column]))) {
warning("Some features names are NA. Replacing NA names with ID from the opposite column requested",
call. = FALSE, immediate. = TRUE)
na.features <- which(x = is.na(x = feature.names[,
gene.column]))
replacement.column <- ifelse(test = gene.column ==
2, yes = 1, no = 2)
feature.names[na.features, gene.column] <- feature.names[na.features,
replacement.column]
}
if (unique.features) {
fcols = ncol(x = feature.names)
if (fcols < gene.column) {
stop(paste0("gene.column was set to ", gene.column,
" but feature.tsv.gz (or genes.tsv) only has ",
fcols, " columns.", " Try setting the gene.column argument to a value <= to ",
fcols, "."))
}
rownames(x = data) <- make.unique(names = feature.names[,
gene.column])
}
if (ncol(x = feature.names) > 2) {
data_types <- factor(x = feature.names$V3)
lvls <- levels(x = data_types)
if (length(x = lvls) > 1 && length(x = full.data) ==
0) {
message("10X data contains more than one type and is being returned as a list containing matrices of each type.")
}
expr_name <- "Gene Expression"
if (expr_name %in% lvls) {
lvls <- c(expr_name, lvls[-which(x = lvls ==
expr_name)])
}
data <- lapply(X = lvls, FUN = function(l) {
return(data[data_types == l, , drop = FALSE])
})
names(x = data) <- lvls
}
else {
data <- list(data)
}
full.data[[length(x = full.data) + 1]] <- data
}
list_of_data <- list()
for (j in 1:length(x = full.data[[1]])) {
list_of_data[[j]] <- do.call(cbind, lapply(X = full.data,
FUN = `[[`, j))
list_of_data[[j]] <- as(object = list_of_data[[j]],
Class = "dgCMatrix")
}
names(x = list_of_data) <- names(x = full.data[[1]])
if (length(x = list_of_data) == 1) {
return(list_of_data[[1]])
}
else {
return(list_of_data)
}
}
myread10xcount <- myread10x('GSE176078_Wu_etal_2021_BRCA_scRNASeq/onecols/')
所以呢,如果大家按照我的教程走,出了错,那肯定是你没有完全跟着我的教程走。遇到报错,核心问题无法是“代码敲错”、“数据格式不对”、“版本不一致”。我的教程还是比较详细的,几乎涵盖所有常见问题,自己细心的跟着我的视频核对一下其实就能发现问题所在。当然,实在找不到问题在哪,也别自己硬扛着浪费时间,欢迎带着你的问题来找我。
如何联系我们
答疑公约
笑一笑也就算了