不知道10x单细胞转录组样品和fastq文件的对应关系
我们在单细胞天地多次分享过cellranger流程的笔记,大家可以自行前往学习,如下:
单细胞实战(一)数据下载 单细胞实战(二) cell ranger使用前注意事项 单细胞实战(三) Cell Ranger使用初探 单细胞实战(四) Cell Ranger流程概览 单细胞实战(五) 理解cellranger count的结果
因为这个流程其实是需要10X单细胞转录组的fastq文件,而且呢,命名是有规则的!
如果你的样品被分散到了多个library、flowcell,就会出现一个样品有84个fastq文件的情况,恰好我看到了一个文献里面的数据就是这样的情况。该研究于2021年3月发表在《Nature Communications》杂志的文章, 标题是:《Time-resolved single-cell analysis of Brca1 associated mammary tumourigenesis reveals aberrant differentiation of luminal progenitors》,链接是:https://www.nature.com/articles/s41467-021-21783-3
如下所示的一个样品,是SIGAA11, 足足有84个fastq文件,如果你仔细观察这84个fastq文件的名字,就会发现规律,如果以下划线为分隔符,那么
第2列是S37到S40这4种情况 第3列是L003到L009这7种情况 第五列是R1,R1,I1这样的3种情况
总共就是 4x7x3=84个fastq文件。
当然了,并不是每个10X样品都有84个fastq文件哈。甚至绝大多数情况下,就3个文件,或者两个文件,都是可以跑我们前面分享过cellranger流程。
接下来我要介绍的一种特殊情况是,有44个fastq文件,但是却没办法对应到10x的样品:
SRR15860129 week 11 MMTV-PyMT Week11-1
SRR15860128 week 11 MMTV-PyMT Week11-2
SRR15860127 week 11 MMTV-PyMT Week11-3
SRR15860126 week 11 MMTV-PyMT Week11-4
SRR15860125 week 17 MMTV-PyMT Week17-1
SRR15860124 week 17 MMTV-PyMT Week17-2
SRR15860123 week 17 MMTV-PyMT Week17-3
SRR15860122 week 17 MMTV-PyMT Week17-4
SRR15860120 week 17 MMTV-PyMT Week17-5
SRR15860119 week 17 MMTV-PyMT Week17-6
SRR15860118 week 17 MMTV-PyMT Week17-7
SRR15860117 week 17 MMTV-PyMT Week17-8
SRR15860155 week 7 MMTV-PyMT Week7-1
SRR15860154 week 7 MMTV-PyMT Week7-2
SRR15860143 week 7 MMTV-PyMT Week7-3
SRR15860132 week 7 MMTV-PyMT Week7-4
SRR15860121 week 9 MMTV-PyMT Week9-1
SRR15860150 week 9 MMTV-PyMT Week9-10
SRR15860149 week 9 MMTV-PyMT Week9-11
SRR15860148 week 9 MMTV-PyMT Week9-12
SRR15860147 week 9 MMTV-PyMT Week9-13
SRR15860146 week 9 MMTV-PyMT Week9-14
SRR15860145 week 9 MMTV-PyMT Week9-15
SRR15860144 week 9 MMTV-PyMT Week9-16
SRR15860142 week 9 MMTV-PyMT Week9-17
SRR15860141 week 9 MMTV-PyMT Week9-18
SRR15860140 week 9 MMTV-PyMT Week9-19
SRR15860116 week 9 MMTV-PyMT Week9-2
SRR15860139 week 9 MMTV-PyMT Week9-20
SRR15860138 week 9 MMTV-PyMT Week9-21
SRR15860137 week 9 MMTV-PyMT Week9-22
SRR15860136 week 9 MMTV-PyMT Week9-23
SRR15860135 week 9 MMTV-PyMT Week9-24
SRR15860134 week 9 MMTV-PyMT Week9-25
SRR15860133 week 9 MMTV-PyMT Week9-26
SRR15860131 week 9 MMTV-PyMT Week9-27
SRR15860130 week 9 MMTV-PyMT Week9-28
SRR15860115 week 9 MMTV-PyMT Week9-3
SRR15860114 week 9 MMTV-PyMT Week9-4
SRR15860113 week 9 MMTV-PyMT Week9-5
SRR15860112 week 9 MMTV-PyMT Week9-6
SRR15860153 week 9 MMTV-PyMT Week9-7
SRR15860152 week 9 MMTV-PyMT Week9-8
SRR15860151 week 9 MMTV-PyMT Week9-9
可以看到,如果是按照第二列,这44个fastq文件,应该是属于4个10x样品,所以跑四次cellranger流程即可,但是如果你这样改名,去运行cellranger流程,就会报错。
为了解决这个问题,我首先对这44个fastq文件各自独立跑cellranger流程,得到的结果批量读取,代码如下所示:
rm(list=ls())
library(data.table)
dir='/home/data/jmzeng/scRNA/mice-4-stage/tmp/matrix'
# 这个文件夹下面是44个fastq文件各自独立跑cellranger流程的矩阵文件夹
samples=list.files( dir )
samples
sceList = lapply(samples,function(pro){
# pro=samples[1]
folder=file.path( dir ,pro)
print(pro)
print(folder)
print(list.files(folder))
sce=CreateSeuratObject(counts = Read10X(folder),
project = pro ,
min.cells = 5,
min.features = 300)
return(sce)
})
names(sceList)
如果这44个fastq文件属于不同的10x样品,它们独立走cellranger流程的时候的细胞barcodes理论上应该是会有交集,所以我做了一个统计。
out = do.call(rbind,
lapply(1:length(sceList), function(i){
do.call(rbind,
lapply(1:length(sceList), function(j){
sample1 <- as.data.frame(sceList[[i]]@meta.data)
sample1_barcodes <- as.numeric(length(sample1$orig.ident))
sample2 <- as.data.frame(sceList[[j]]@meta.data)
sample2_barcodes <- as.numeric(length(sample2$orig.ident))
both <- intersect(rownames(sample1),rownames(sample2))
both_barcodes <- as.numeric(length(both))
rate <- as.numeric(both_barcodes/(sample1_barcodes+sample2_barcodes))
c(sample1_name=as.character(unique(sample1$orig.ident)),
sample1_barcodes=sample1_barcodes,
sample2_name=as.character(unique(sample2$orig.ident)),
sample2_barcodes=sample2_barcodes,
rate=rate)
}))
}))
save(out,file = 'out.Rdata')
对这44个fastq文件的单细胞矩阵进行组合比较:
> head(out)
sample1_name sample1_barcodes sample2_name sample2_barcodes rate
[1,] "SRR15860112" "2197" "SRR15860112" "2197" "0.5"
[2,] "SRR15860112" "2197" "SRR15860113" "1923" "0.465776699029126"
[3,] "SRR15860112" "2197" "SRR15860114" "2746" "0.00101153145862836"
[4,] "SRR15860112" "2197" "SRR15860115" "2768" "0.00100704934541793"
[5,] "SRR15860112" "2197" "SRR15860116" "2811" "0.000998402555910543"
[6,] "SRR15860112" "2197" "SRR15860117" "5100" "0.00315198026586268"
如果两个结果配对的细胞barcodes重合度很高,就说明是同一个样品。如下所示,可以看到SRR15860112和SRR15860113就是同一个样品。
library(tidyr)
outs <- as.data.frame(out[,-c(2,4)] )
library(reshape2)
out_wide <- dcast(outs,sample1_name~sample2_name)
name=out_wide[,1]
out_wide=out_wide[,-1]
out_wide <- apply(out_wide,2,as.numeric)
rownames(out_wide) <- name
library(pheatmap)
如下所示:
可以看到我们的44个fastq文件,应该是属于7个样品,所以我输出如下所示的文件列表:
head(tmp)
srr hc age
SRR15860126 SRR15860126 5 week 11
SRR15860127 SRR15860127 5 week 11
SRR15860128 SRR15860128 5 week 11
SRR15860129 SRR15860129 5 week 11
SRR15860117 SRR15860117 3 week 17
SRR15860118 SRR15860118 3 week 17
然后去改名即可:
1_S1_L001_R1_001.fastq.gz -> /home/PRJNA762594/SRR15860146_1.fastq.gz
1_S1_L001_R2_001.fastq.gz -> /home/PRJNA762594/SRR15860146_2.fastq.gz
1_S1_L002_R1_001.fastq.gz -> /home/PRJNA762594/SRR15860147_1.fastq.gz
1_S1_L002_R2_001.fastq.gz -> /home/PRJNA762594/SRR15860147_2.fastq.gz
1_S1_L003_R1_001.fastq.gz -> /home/PRJNA762594/SRR15860148_1.fastq.gz
1_S1_L003_R2_001.fastq.gz -> /home/PRJNA762594/SRR15860148_2.fastq.gz
1_S1_L004_R1_001.fastq.gz -> /home/PRJNA762594/SRR15860149_1.fastq.gz
1_S1_L004_R2_001.fastq.gz -> /home/PRJNA762594/SRR15860149_2.fastq.gz
1_S1_L005_R1_001.fastq.gz -> /home/PRJNA762594/SRR15860150_1.fastq.gz
1_S1_L005_R2_001.fastq.gz -> /home/PRJNA762594/SRR15860150_2.fastq.gz
1_S1_L006_R1_001.fastq.gz -> /home/PRJNA762594/SRR15860151_1.fastq.gz
1_S1_L006_R2_001.fastq.gz -> /home/PRJNA762594/SRR15860151_2.fastq.gz
1_S1_L007_R1_001.fastq.gz -> /home/PRJNA762594/SRR15860152_1.fastq.gz
1_S1_L007_R2_001.fastq.gz -> /home/PRJNA762594/SRR15860152_2.fastq.gz
1_S1_L008_R1_001.fastq.gz -> /home/PRJNA762594/SRR15860153_1.fastq.gz
1_S1_L008_R2_001.fastq.gz -> /home/PRJNA762594/SRR15860153_2.fastq.gz
1_S2_L001_R1_001.fastq.gz -> /home/PRJNA762594/SRR15860112_1.fastq.gz
1_S2_L001_R2_001.fastq.gz -> /home/PRJNA762594/SRR15860112_2.fastq.gz
1_S2_L002_R1_001.fastq.gz -> /home/PRJNA762594/SRR15860113_1.fastq.gz
1_S2_L002_R2_001.fastq.gz -> /home/PRJNA762594/SRR15860113_2.fastq.gz
1_S2_L003_R1_001.fastq.gz -> /home/PRJNA762594/SRR15860144_1.fastq.gz
1_S2_L003_R2_001.fastq.gz -> /home/PRJNA762594/SRR15860144_2.fastq.gz
1_S2_L004_R1_001.fastq.gz -> /home/PRJNA762594/SRR15860145_1.fastq.gz
1_S2_L004_R2_001.fastq.gz -> /home/PRJNA762594/SRR15860145_2.fastq.gz
2_S1_L001_R1_001.fastq.gz -> /home/PRJNA762594/SRR15860114_1.fastq.gz
2_S1_L001_R2_001.fastq.gz -> /home/PRJNA762594/SRR15860114_2.fastq.gz
2_S1_L002_R1_001.fastq.gz -> /home/PRJNA762594/SRR15860115_1.fastq.gz
2_S1_L002_R2_001.fastq.gz -> /home/PRJNA762594/SRR15860115_2.fastq.gz
2_S1_L003_R1_001.fastq.gz -> /home/PRJNA762594/SRR15860116_1.fastq.gz
2_S1_L003_R2_001.fastq.gz -> /home/PRJNA762594/SRR15860116_2.fastq.gz
2_S1_L004_R1_001.fastq.gz -> /home/PRJNA762594/SRR15860121_1.fastq.gz
2_S1_L004_R2_001.fastq.gz -> /home/PRJNA762594/SRR15860121_2.fastq.gz
3_S1_L001_R1_001.fastq.gz -> /home/PRJNA762594/SRR15860117_1.fastq.gz
3_S1_L001_R2_001.fastq.gz -> /home/PRJNA762594/SRR15860117_2.fastq.gz
3_S1_L002_R1_001.fastq.gz -> /home/PRJNA762594/SRR15860118_1.fastq.gz
3_S1_L002_R2_001.fastq.gz -> /home/PRJNA762594/SRR15860118_2.fastq.gz
3_S1_L003_R1_001.fastq.gz -> /home/PRJNA762594/SRR15860119_1.fastq.gz
3_S1_L003_R2_001.fastq.gz -> /home/PRJNA762594/SRR15860119_2.fastq.gz
3_S1_L004_R1_001.fastq.gz -> /home/PRJNA762594/SRR15860120_1.fastq.gz
3_S1_L004_R2_001.fastq.gz -> /home/PRJNA762594/SRR15860120_2.fastq.gz
4_S1_L001_R1_001.fastq.gz -> /home/PRJNA762594/SRR15860122_1.fastq.gz
4_S1_L001_R2_001.fastq.gz -> /home/PRJNA762594/SRR15860122_2.fastq.gz
4_S1_L002_R1_001.fastq.gz -> /home/PRJNA762594/SRR15860123_1.fastq.gz
4_S1_L002_R2_001.fastq.gz -> /home/PRJNA762594/SRR15860123_2.fastq.gz
4_S1_L003_R1_001.fastq.gz -> /home/PRJNA762594/SRR15860124_1.fastq.gz
4_S1_L003_R2_001.fastq.gz -> /home/PRJNA762594/SRR15860124_2.fastq.gz
4_S1_L004_R1_001.fastq.gz -> /home/PRJNA762594/SRR15860125_1.fastq.gz
4_S1_L004_R2_001.fastq.gz -> /home/PRJNA762594/SRR15860125_2.fastq.gz
5_S1_L001_R1_001.fastq.gz -> /home/PRJNA762594/SRR15860126_1.fastq.gz
5_S1_L001_R2_001.fastq.gz -> /home/PRJNA762594/SRR15860126_2.fastq.gz
5_S1_L002_R1_001.fastq.gz -> /home/PRJNA762594/SRR15860127_1.fastq.gz
5_S1_L002_R2_001.fastq.gz -> /home/PRJNA762594/SRR15860127_2.fastq.gz
5_S1_L003_R1_001.fastq.gz -> /home/PRJNA762594/SRR15860128_1.fastq.gz
5_S1_L003_R2_001.fastq.gz -> /home/PRJNA762594/SRR15860128_2.fastq.gz
5_S1_L004_R1_001.fastq.gz -> /home/PRJNA762594/SRR15860129_1.fastq.gz
5_S1_L004_R2_001.fastq.gz -> /home/PRJNA762594/SRR15860129_2.fastq.gz
6_S1_L001_R1_001.fastq.gz -> /home/PRJNA762594/SRR15860130_1.fastq.gz
6_S1_L001_R2_001.fastq.gz -> /home/PRJNA762594/SRR15860130_2.fastq.gz
6_S1_L002_R1_001.fastq.gz -> /home/PRJNA762594/SRR15860131_1.fastq.gz
6_S1_L002_R2_001.fastq.gz -> /home/PRJNA762594/SRR15860131_2.fastq.gz
6_S1_L003_R1_001.fastq.gz -> /home/PRJNA762594/SRR15860133_1.fastq.gz
6_S1_L003_R2_001.fastq.gz -> /home/PRJNA762594/SRR15860133_2.fastq.gz
6_S1_L004_R1_001.fastq.gz -> /home/PRJNA762594/SRR15860134_1.fastq.gz
6_S1_L004_R2_001.fastq.gz -> /home/PRJNA762594/SRR15860134_2.fastq.gz
6_S1_L005_R1_001.fastq.gz -> /home/PRJNA762594/SRR15860135_1.fastq.gz
6_S1_L005_R2_001.fastq.gz -> /home/PRJNA762594/SRR15860135_2.fastq.gz
6_S1_L006_R1_001.fastq.gz -> /home/PRJNA762594/SRR15860136_1.fastq.gz
6_S1_L006_R2_001.fastq.gz -> /home/PRJNA762594/SRR15860136_2.fastq.gz
6_S1_L007_R1_001.fastq.gz -> /home/PRJNA762594/SRR15860137_1.fastq.gz
6_S1_L007_R2_001.fastq.gz -> /home/PRJNA762594/SRR15860137_2.fastq.gz
6_S1_L008_R1_001.fastq.gz -> /home/PRJNA762594/SRR15860138_1.fastq.gz
6_S1_L008_R2_001.fastq.gz -> /home/PRJNA762594/SRR15860138_2.fastq.gz
6_S2_L001_R1_001.fastq.gz -> /home/PRJNA762594/SRR15860139_1.fastq.gz
6_S2_L001_R2_001.fastq.gz -> /home/PRJNA762594/SRR15860139_2.fastq.gz
6_S2_L002_R1_001.fastq.gz -> /home/PRJNA762594/SRR15860140_1.fastq.gz
6_S2_L002_R2_001.fastq.gz -> /home/PRJNA762594/SRR15860140_2.fastq.gz
6_S2_L003_R1_001.fastq.gz -> /home/PRJNA762594/SRR15860141_1.fastq.gz
6_S2_L003_R2_001.fastq.gz -> /home/PRJNA762594/SRR15860141_2.fastq.gz
6_S2_L004_R1_001.fastq.gz -> /home/PRJNA762594/SRR15860142_1.fastq.gz
6_S2_L004_R2_001.fastq.gz -> /home/PRJNA762594/SRR15860142_2.fastq.gz
7_S1_L001_R1_001.fastq.gz -> /home/PRJNA762594/SRR15860132_1.fastq.gz
7_S1_L001_R2_001.fastq.gz -> /home/PRJNA762594/SRR15860132_2.fastq.gz
7_S1_L002_R1_001.fastq.gz -> /home/PRJNA762594/SRR15860143_1.fastq.gz
7_S1_L002_R2_001.fastq.gz -> /home/PRJNA762594/SRR15860143_2.fastq.gz
7_S1_L003_R1_001.fastq.gz -> /home/PRJNA762594/SRR15860154_1.fastq.gz
7_S1_L003_R2_001.fastq.gz -> /home/PRJNA762594/SRR15860154_2.fastq.gz
7_S1_L004_R1_001.fastq.gz -> /home/PRJNA762594/SRR15860155_1.fastq.gz
7_S1_L004_R2_001.fastq.gz -> /home/PRJNA762594/SRR15860155_2.fastq.gz
然后,这样的文件,就可以跑cellranger流程流程啦。
最后得到的降维聚类分群和生物学命名,如下所示:
如果是单细胞常规分析可以参考前面的例子:人人都能学会的单细胞聚类分群注释 ,我们演示了第一层次的分群。如果你对单细胞数据分析还没有基础认知,可以看基础10讲:
文末友情宣传
强烈建议你推荐给身边的博士后以及年轻生物学PI,多一点数据认知,让他们的科研上一个台阶:
数据挖掘(GEO,TCGA,单细胞)2022年3月场,快速了解一些生物信息学应用图表 生信入门课-2022年3月场,你的生物信息学第一课