查看原文
其他

答读者问(八)为什么Read10X也会报错?

BIOMAMBA Biomamba 生信基地 2023-06-15


往期回顾




答读者问 (一)单基因究竟能否进行GSEA

答读者问 (二)为什么我的PCA分析报错了?

答读者问 (三)单细胞测序前景

答读者问 (四)如何分析细胞亚群

答读者问 (五)如何实现各物种基因的ID/symbol的转换

答读者问 (六)Seurat中如何让细胞听你指挥

答读者问 (七)有人问我Biomamba何解




问题




最近有读者表示看了我的这一讲(手把手教你做单细胞测序数据分析(二)——各类输入文件读取)以后并不能读入自己下载的10X数据(GSE176078),读取时遇到了以下的错误。显然我是不能背这个锅的,让我们来找找问题在哪,并给出解决办法。




怎么解决问题




首先,我反复强调过,遇到报错不要慌,要先看看R反馈给你什么错误,这里Error明显发生在feature文件中。在数据读取那一讲中,我们是把barcodes.tsv、feature.tsv(或gene.tsv)、matrix.tsv都用excel打开给大家看过数据结构,正常的是下图左的形式,而本次用到的数据打开是下图右的形式,这可能是版本因素造成的。所以问题很简单,正常的feature文件拥有两列,一列是ensemble ID,一列是基因的symbol(这个我们当时也给大家展示过,从2分钟开始看:https://www.bilibili.com/video/BV1S44y1b76Z?p=3),而这里少了一列。最简单的方法就是复制粘贴,把这个一列的文件变两列


                                       



改成两列后重新运行,就可以成功读入了。





那如果想以修改代码的方式解决这个问题呢?其实也很简单,我们来看一下Read10X()函数的源码,虽然这个代码很长,大概一百多行,但是在参数栏中已经指出了gene.column为2,那么我们只需要将这个参数修改为1,即可读入GSE176078中的数据。不过注意,修改过的函数可能无法调用Seurat的依赖包,所以需要额外加载,按我下面的代码操作即可。


更改后的代码:
library(Seurat)library(Matrix)myread10x <- function (data.dir, gene.column = 1, cell.column = 1, unique.features = TRUE, strip.suffix = FALSE) { full.data <- list() for (i in seq_along(along.with = data.dir)) { run <- data.dir[i] if (!dir.exists(paths = run)) { stop("Directory provided does not exist") } barcode.loc <- file.path(run, "barcodes.tsv") gene.loc <- file.path(run, "genes.tsv") features.loc <- file.path(run, "features.tsv.gz") matrix.loc <- file.path(run, "matrix.mtx") pre_ver_3 <- file.exists(gene.loc) if (!pre_ver_3) { addgz <- function(s) { return(paste0(s, ".gz")) } barcode.loc <- addgz(s = barcode.loc) matrix.loc <- addgz(s = matrix.loc) } if (!file.exists(barcode.loc)) { stop("Barcode file missing. Expecting ", basename(path = barcode.loc)) } if (!pre_ver_3 && !file.exists(features.loc)) { stop("Gene name or features file missing. Expecting ", basename(path = features.loc)) } if (!file.exists(matrix.loc)) { stop("Expression matrix file missing. Expecting ", basename(path = matrix.loc)) } data <- readMM(file = matrix.loc) cell.barcodes <- read.table(file = barcode.loc, header = FALSE, sep = "\t", row.names = NULL) if (ncol(x = cell.barcodes) > 1) { cell.names <- cell.barcodes[, cell.column] } else { cell.names <- readLines(con = barcode.loc) } if (all(grepl(pattern = "\\-1$", x = cell.names)) & strip.suffix) { cell.names <- as.vector(x = as.character(x = sapply(X = cell.names, FUN = ExtractField, field = 1, delim = "-"))) } if (is.null(x = names(x = data.dir))) { if (length(x = data.dir) < 2) { colnames(x = data) <- cell.names } else { colnames(x = data) <- paste0(i, "_", cell.names) } } else { colnames(x = data) <- paste0(names(x = data.dir)[i], "_", cell.names) } feature.names <- read.delim(file = ifelse(test = pre_ver_3, yes = gene.loc, no = features.loc), header = FALSE, stringsAsFactors = FALSE) if (any(is.na(x = feature.names[, gene.column]))) { warning("Some features names are NA. Replacing NA names with ID from the opposite column requested", call. = FALSE, immediate. = TRUE) na.features <- which(x = is.na(x = feature.names[, gene.column])) replacement.column <- ifelse(test = gene.column == 2, yes = 1, no = 2) feature.names[na.features, gene.column] <- feature.names[na.features, replacement.column] } if (unique.features) { fcols = ncol(x = feature.names) if (fcols < gene.column) { stop(paste0("gene.column was set to ", gene.column, " but feature.tsv.gz (or genes.tsv) only has ", fcols, " columns.", " Try setting the gene.column argument to a value <= to ", fcols, ".")) } rownames(x = data) <- make.unique(names = feature.names[, gene.column]) } if (ncol(x = feature.names) > 2) { data_types <- factor(x = feature.names$V3) lvls <- levels(x = data_types) if (length(x = lvls) > 1 && length(x = full.data) == 0) { message("10X data contains more than one type and is being returned as a list containing matrices of each type.") } expr_name <- "Gene Expression" if (expr_name %in% lvls) { lvls <- c(expr_name, lvls[-which(x = lvls == expr_name)]) } data <- lapply(X = lvls, FUN = function(l) { return(data[data_types == l, , drop = FALSE]) }) names(x = data) <- lvls } else { data <- list(data) } full.data[[length(x = full.data) + 1]] <- data } list_of_data <- list() for (j in 1:length(x = full.data[[1]])) { list_of_data[[j]] <- do.call(cbind, lapply(X = full.data, FUN = `[[`, j)) list_of_data[[j]] <- as(object = list_of_data[[j]], Class = "dgCMatrix") } names(x = list_of_data) <- names(x = full.data[[1]]) if (length(x = list_of_data) == 1) { return(list_of_data[[1]]) } else { return(list_of_data) }}
myread10xcount <- myread10x('GSE176078_Wu_etal_2021_BRCA_scRNASeq/onecols/')


所以呢,如果大家按照我的教程走,出了错,那肯定是你没有完全跟着我的教程走。遇到报错,核心问题无法是“代码敲错”、“数据格式不对”、“版本不一致”。我的教程还是比较详细的,几乎涵盖所有常见问题,自己细心的跟着我的视频核对一下其实就能发现问题所在。当然,实在找不到问题在哪,也别自己硬扛着浪费时间,欢迎带着你的问题来找我。


如何联系我们


最近发现后台中有一些消息我没能及时看到并答复,微信后台中超过48h后便不允许回复读者消息,这里还是再给大家留一下答疑的扣扣号,方便大家随时交流:1913507043。微信号可以点击喜欢作者后自动回复里有。欢迎大家向我咨询或者提供建议。大家可以阅读完这几篇之后添加我:如何搜索公众号过往发布内容
答疑公约
笑一笑也就算了






您可能也对以下帖子感兴趣

文章有问题?点此查看未经处理的缓存