scRNA小鼠发育Smartseq2流程—前言及上游介绍

Original 刘小泽单细胞天地 2022-06-07

收录于合集 #第二期单细胞视频笔记汇总 16个

课程笔记

粉丝：有单细胞线上课程吗？

小编：什么？我们的单细胞转录组分析线上课程已经上线好久了，你们竟然都不知道吗，每篇推文后面的课程推荐没人看的吗，小编已哭晕在厕所

好了，戏演完了，下面郑重介绍下我们的单细胞线上课程：（详情戳下方链接）

全网第二个单细胞视频课程预售

这个课程笔记栏目记录了学员们学习单细胞转录组课程的学习笔记

希望大家能有所收获！

作者 | 单细胞天地小编刘小泽

课程链接在：http://jm.grazy.cn/index/mulitcourse/detail.html?cid=55
这次会介绍文章的两张主图，还有上游分析，算是引言部分。对应视频第三单元1-4讲

前言

这次要重复的文章是：Dissecting Cell Lineage Specification and Sex Fate Determination in Gonadal Somatic Cells Using Single-Cell Transcriptomics

文章方法思路

为了研究卵巢发育过程中体细胞谱系特征，作者结合Fluidigm C1分选和smartseq2建库，在Tg(Nr5a1-GFP)转基因小鼠性腺分化的6个时期（E10.5, E11.5, E12.5, E13.5, E16.5, and post-natalday6）提取了663个细胞（GSE119766），然后使用Hiseq2000 进行双端100bp测序，每个细胞测10M reads。
经过比对、定量得到表达矩阵，然后对细胞进行过滤，最后剩下563个。进行PCA，用Jackstraw提取了最显著的几个主成分，并用FactoMineR进行HCPC 聚类，使用Rtsne 进行tSNE，然后分群，得到了4个胚胎发育时期（C1到C4），之后对标记基因进行可视化（包括等高线图和热图，使用了自己包装的大量代码）。
然后对C1-C4每个群进行差异分析（使用monocle2），差异基因进行功能注释（使用clusterProfiler和PANTHER）
根据高变化基因进行细胞谱系推断，利用diffusion map的Destiny 和Slingshot 。后来谱系发育基因进行了可视化、聚类和注释

综上，文章得到了6个发育时期、4群细胞、2个发育轨迹这三种细胞属性

重要的信息如下：

分群信息（对应课程下游分析之第5、6讲）

标记基因可视化（对应课程下游分析之第7、8讲）

差异分析及注释（对应课程下游分析之第9、10讲）

比较不同的差异分析方法，如monocle、ROST（对应课程下游分析之第11讲）

谱系推断（对应课程下游分析之第12讲）

谱系发育基因可视化（对应课程下游分析之第13讲）

谱系发育基因聚类和注释（对应课程下游分析之第14讲）

上游分析

将会对应课程的全部上游分析第1-4讲
这部分和转录组分析很相似，所以也是简单带过

smartseq2得到的文件和10X的不一样，10X的数据虽然也有R1、R2两个数据，但第一个存储的是barcode和UMI信息，而只有第二个才是真正的测序信息，也就是单端测序；而smartseq2得到的两个R1、R2都是测序信息，于是它的操作和我们常规bulk转录组是类似的。可以用hisat2+featureCounts进行操作；或者如果不想比对的话，也可以用salmon进行操作

配置conda

 1# 安装conda
 2wget https://mirrors.tuna.tsinghua.edu.cn/anaconda/miniconda/Miniconda3-latest-Linux-x86_64.sh
 3bash Miniconda3-latest-Linux-x86_64.sh 
 4# 激活
 5source ~/.bashrc
 6
 7# 添加镜像
 8conda config --add channels r
 9conda config --add channels conda-forge
10conda config --add channels bioconda
11conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/free
12conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
13conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/bioconda
14conda config --set show_channel_urls yes
15
16# 创建环境
17conda  create -n rna  python=2
18conda activate rna
19conda install -y  sra-tools trim-galore fastqc multiqc hisat2 samtools subread salmon qualimap
20
21#注销当前的rna环境 
22# conda deactivate

下载数据

XX全部的数据比较大，数据在：https://www.ncbi.nlm.nih.gov/Traces/study/?acc=PRJNA490198&o=acc_s%3Aa

因此简单下几个体验一下就好，我这里选取了这些：

 1# wkd=/YOUR_PATH/
 2# cd $wkd/sra
 3# 这些数据全部都放在了EBI，而不是NCBI
 4SRR7815790
 5SRR7815850
 6SRR7815870
 7SRR7815890
 8SRR7815910
 9SRR7815960
10SRR7815980
11SRR7816020
12SRR7816100
13SRR7816120
14SRR7816130
15SRR7816140
16SRR7816160

如果prefetch + ascp可以用的话，那是最好，因为那是最省事的下载办法，而且不用管SRA数据是来自欧洲的EBI还是美国的NCBI。如果出现了使用https下载方式，而非快速的fasp下载，可以看这个：来吧，加速你的下载

需要注意EBI和NCBI的SRA数据存储结构有些区别：

NCBI的文件存在ftp://ftp.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByRun/sra/SRR/
EBI文件存在ftp://ftp.sra.ebi.ac.uk/vol1/srr
例如要下载SRR7815792.sra，它就是ftp://ftp.sra.ebi.ac.uk/vol1/srr/SRR781/002/SRR7815792，倒数第二个位置的002是和SRR ID的最后一个数字对应的，另外EBI下载的名称没有后缀.sra

sra2fq

1wkd=/YOUR_PATH/
2ls $wkd/sra/SRR* | while read i;do \
3    (fastq-dump --gzip --split-3 -A `basename $i` -O $wkd/rawdata $i &);done

下载参考数据

1# 从Gencode下载参考基因组
2wget ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/release_M23/GRCm38.primary_assembly.genome.fa.gz
3# 从Gencode下载参考转录组
4wget ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/release_M23/gencode.vM23.transcripts.fa.gz
5# 下载GTF文件
6wget ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/release_M23/gencode.vM23.annotation.gtf.gz

基于比对的hisat2流程

构建索引

1ref=$wkd/reference/mm10.fasta
2hisat2-build -p 40 $ref hisat.mm10
3# 历时：00:14:39

如果不想自己构建索引，可以直接下载hisat官网数据

1wget ftp://ftp.ccb.jhu.edu/pub/infphilo/hisat2/data/mm10.tar.gz

比对过程

1# 以其中SRR7815790为例
2index=$wkd/hisat/hisat.mm10
3hisat2 -p 10 -x $index -1 SRR7815790_1.fastq.gz -2 SRR7815790_2.fastq.gz  --new-summary -S SRR7815790_hisat.sam
4
5samtools sort -O bam -@ 10  -o SRR7815790_hisat.bam SRR7815790_hisat.sam 
6
7samtools index -@ 10 -b SRR7815790_hisat.bam

比对结果

 1# 看到有一个细胞（SRR7816020）质量不好，后面需要去除
 2SRR7815790.hisat.log:    Overall alignment rate: 90.14%
 3SRR7815850.hisat.log:    Overall alignment rate: 87.49%
 4SRR7815870.hisat.log:    Overall alignment rate: 87.39%
 5SRR7815890.hisat.log:    Overall alignment rate: 90.08%
 6SRR7815910.hisat.log:    Overall alignment rate: 90.32%
 7SRR7815960.hisat.log:    Overall alignment rate: 91.73%
 8SRR7815980.hisat.log:    Overall alignment rate: 86.01%
 9SRR7816020.hisat.log:    Overall alignment rate: 61.82%
10SRR7816100.hisat.log:    Overall alignment rate: 88.14%
11SRR7816120.hisat.log:    Overall alignment rate: 90.21%
12SRR7816130.hisat.log:    Overall alignment rate: 87.30%
13SRR7816140.hisat.log:    Overall alignment rate: 89.59%
14SRR7816160.hisat.log:    Overall alignment rate: 84.33%
15SRR7815790.hisat.log:Execution time for SRR7815790 hisat and sam2bam : 115.350829 seconds
16SRR7815850.hisat.log:Execution time for SRR7815850 hisat and sam2bam : 149.416581 seconds
17SRR7815870.hisat.log:Execution time for SRR7815870 hisat and sam2bam : 218.976666 seconds
18SRR7815890.hisat.log:Execution time for SRR7815890 hisat and sam2bam : 168.738353 seconds
19SRR7815910.hisat.log:Execution time for SRR7815910 hisat and sam2bam : 214.020996 seconds
20SRR7815960.hisat.log:Execution time for SRR7815960 hisat and sam2bam : 169.597353 seconds
21SRR7815980.hisat.log:Execution time for SRR7815980 hisat and sam2bam : 146.707250 seconds
22SRR7816020.hisat.log:Execution time for SRR7816020 hisat and sam2bam : 3.484966 seconds
23SRR7816100.hisat.log:Execution time for SRR7816100 hisat and sam2bam : 131.300152 seconds
24SRR7816120.hisat.log:Execution time for SRR7816120 hisat and sam2bam : 133.382700 seconds
25SRR7816130.hisat.log:Execution time for SRR7816130 hisat and sam2bam : 124.014219 seconds
26SRR7816140.hisat.log:Execution time for SRR7816140 hisat and sam2bam : 166.373111 seconds
27SRR7816160.hisat.log:Execution time for SRR7816160 hisat and sam2bam : 209.587995 seconds

定量

1gtf=$wkd/reference/gencode.vM23.annotation.gtf
2featureCounts  -T 10 -p -t exon -g gene_name -a $gtf -o $wkd/hisat/count/hisat2_counts.txt $wkd/hisat/*.bam 1>featureCounts.log 2>&1

不基于比对的salmon流程

salmon需要对转录本构建索引。因此只能使用参考转录组，而不能使用基因组

构建索引

1ref=$wkd/reference/gencode.vM23.transcripts.fa
2salmon index -t $ref -i salmon.mm10

定量

1# 以其中SRR7815790为例
2index=$wkd/salmon/salmon.mm10
3salmon quant -i $index -l A --validateMappings \
4        -1 SRR7815790_1.fastq.gz -2 SRR7815790_2.fastq.gz \
5        -p 10 -o SRR7815790_quant 1>SRR7815790.salmon.log 2>&1

运行速度的确很快

 1SRR7815790.salmon.log:Execution time for SRR7815790 salmon : 36.016640 seconds
 2SRR7815850.salmon.log:Execution time for SRR7815850 salmon : 54.478129 seconds
 3SRR7815870.salmon.log:Execution time for SRR7815870 salmon : 76.499805 seconds
 4SRR7815890.salmon.log:Execution time for SRR7815890 salmon : 58.166234 seconds
 5SRR7815910.salmon.log:Execution time for SRR7815910 salmon : 56.277339 seconds
 6SRR7815960.salmon.log:Execution time for SRR7815960 salmon : 52.918936 seconds
 7SRR7815980.salmon.log:Execution time for SRR7815980 salmon : 47.993982 seconds
 8SRR7816020.salmon.log:Execution time for SRR7816020 salmon : 8.981974 seconds
 9SRR7816100.salmon.log:Execution time for SRR7816100 salmon : 40.505802 seconds
10SRR7816120.salmon.log:Execution time for SRR7816120 salmon : 52.856553 seconds
11SRR7816130.salmon.log:Execution time for SRR7816130 salmon : 38.405255 seconds
12SRR7816140.salmon.log:Execution time for SRR7816140 salmon : 53.823648 seconds
13SRR7816160.salmon.log:Execution time for SRR7816160 salmon : 75.848338 seconds

Salmon的结果和Hisat的不同，它的每个数据都是独立的文件夹，其中quant.sf 就是每个样本的定量结果

对于salmon的定量结果，需要用到R里面的`tximport` 导入

tximport函数主要需要两个参数：定量文件files和转录本与基因名的对应文件tx2gene

 1rm(list = ls())
 2options(stringsAsFactors = F)
 3
 4####################
 5# 配置files路径
 6####################
 7dir <- file.path(getwd(),'quant/')
 8dir
 9files <- list.files(pattern = '*sf',dir,recursive = T)
10files <- file.path(dir,files)
11all(file.exists(files))
12
13####################
14# 配置tx2gene
15####################
16# https://support.bioconductor.org/p/101156/
17# BiocManager::install("EnsDb.Mmusculus.v79")
18# 如何构建？
19if(F){
20  library(EnsDb.Mmusculus.v79)
21  txdf <- transcripts(EnsDb.Mmusculus.v79, return.type="DataFrame")
22  mm10_tx2gene <- as.data.frame(txdf[,c("tx_id", "gene_id")])
23  head(mm10_tx2gene)
24  write.csv(mm10_tx2gene,file = 'mm10_tx2gene.csv')
25}
26tx2_gene_file <- 'mm10_tx2gene.csv'
27tx2gene <- read.csv(tx2_gene_file,row.names = 1)
28head(tx2gene)
29
30####################
31# 开始整合
32####################
33library(tximport)
34library(readr)
35txi <- tximport(files, type = "salmon", tx2gene = tx2gene,ignoreTxVersion = T)
36names(txi)
37head(txi$length)
38head(txi$counts)
39# 发现目前counts的列名还没有指定
40
41####################
42# 添加列名
43####################
44# 获取导入文件名称的ID，如SRR7815870
45library(stringr)
46# 先得到SRRxxxx_quant这一部分
47n1 <- sapply(strsplit(files,'\\/'), function(x)x[12])
48# 再得到SRRxxxx这一部分
49n2 <- sapply(strsplit(n1,'_'), function(x)x[1])
50
51colnames(txi$counts) <- n2
52head(txi$counts)
53
54
55####################
56# 操作新得到的表达矩阵
57####################
58salmon_expr <- txi$counts
59# 表达量取整
60salmon_expr <- apply(salmon_expr, 2, as.integer)
61head(salmon_expr)
62# 添加行名
63rownames(salmon_expr) <- rownames(txi$counts)
64dim(salmon_expr)
65
66save(salmon_expr,file = 'salmon-aggr.Rdata')

好，现在有了salmon和featureCounts两种方法得到的矩阵，就可以来比较一下

 1rm(list = ls())
 2options(stringsAsFactors = F)
 3load('salmon-aggr.Rdata')
 4dim(salmon_expr)
 5colnames(salmon_expr)
 6salmon_expr[1:3,1:3]
 7
 8# 为了方便比较，我们选第一个样本SRR7815790
 9salmon_SRR7815790 <- salmon_expr[,1]
10names(salmon_SRR7815790) <- rownames(salmon_expr)
11
12###########################
13# 然后找featureCounts结果
14###########################
15ft <- read.table('../hisat/hisat2_counts.txt',row.names = 1,comment.char = '#',header = T)
16ft[1:4,1:4]
17colnames(ft)
18# 两种取子集的方法，grep返回结果的序号；grepl返回逻辑值
19ft_SRR7815790 <-  ft[,grep(colnames(salmon_expr)[1], colnames(ft))]
20ft_SRR7815790 <-  ft[,grepl(colnames(salmon_expr)[1], colnames(ft))]
21
22names(ft_SRR7815790) <- rownames(ft)
23
24###########################
25# 取salmon和featureCounts公共基因
26###########################
27# salmon使用的Ensembl基因，而featureCounts得到的是Symbol基因
28# 先进行基因名转换
29library(org.Mm.eg.db)
30columns(org.Mm.eg.db)
31gene_tr <- clusterProfiler::bitr(names(salmon_SRR7815790),
32                      "ENSEMBL","SYMBOL",
33                      org.Mm.eg.db)
34# 要找salmon在gene_tr中的对应位置，然后取gene_tr的第二列symbol，因此match中把salmon放第一个位置
35names(salmon_SRR7815790) <- gene_tr[match(names(salmon_SRR7815790),gene_tr$ENSEMBL),2]
36
37# 找共有基因
38uni_gene <- intersect(names(salmon_SRR7815790),names(ft_SRR7815790))
39length(uni_gene)
40
41plot(log1p(salmon_SRR7815790[uni_gene]),
42     log1p(ft_SRR7815790[uni_gene]))

结论就是：有些基因用不同的流程检测效果是不一样的，使用不同的参考数据得到的结果也是有区别的，但整体上一致

“

如果你对单细胞转录组研究感兴趣，但又不知道如何入门，也许你可以关注一下下面的课程

生信技能树（爆款入门培训课）全国巡讲约你

（上海见！）全国巡讲第19-20站（生信入门课加量不加价）

”

高三女生醉酒后被强奸致死？检方回应

常德悲剧：让谴责无差别杀戮之声更加响亮一点

2024【公共营养师】培训报名通道已开启，不限学历，23岁及以上可报！还能领2000补贴

【惊】"以为要写遗书"! 飞温哥华航班遇炸弹气旋华人崩溃大哭连空姐都吐了; 客机颠簸盘旋3小时

女人最偏爱的十种男人

scRNA小鼠发育Smartseq2流程—前言及上游介绍

前言

文章方法思路

重要的信息如下：

分群信息（对应课程下游分析之第5、6讲）

标记基因可视化（对应课程下游分析之第7、8讲）

差异分析及注释（对应课程下游分析之第9、10讲）

比较不同的差异分析方法，如monocle、ROST（对应课程下游分析之第11讲）

谱系推断（对应课程下游分析之第12讲）

谱系发育基因可视化（对应课程下游分析之第13讲）

谱系发育基因聚类和注释（对应课程下游分析之第14讲）

上游分析

配置conda

下载数据

sra2fq

下载参考数据

基于比对的hisat2流程

构建索引

比对过程

定量

不基于比对的salmon流程

构建索引

定量

对于salmon的定量结果，需要用到R里面的`tximport` 导入

结论就是：有些基因用不同的流程检测效果是不一样的，使用不同的参考数据得到的结果也是有区别的，但整体上一致

您可能也对以下帖子感兴趣

高三女生醉酒后被强奸致死？检方回应

常德悲剧：让谴责无差别杀戮之声更加响亮一点

2024【公共营养师】培训报名通道已开启，不限学历，23岁及以上可报！还能领2000补贴

【惊】"以为要写遗书"! 飞温哥华航班遇炸弹气旋 华人崩溃大哭 连空姐都吐了; 客机颠簸盘旋3小时

女人最偏爱的十种男人

生成图片，分享到微信朋友圈

scRNA小鼠发育Smartseq2流程—前言及上游介绍

前言

文章方法思路

重要的信息如下：

分群信息（对应课程下游分析之第5、6讲）

标记基因可视化（对应课程下游分析之第7、8讲）

差异分析及注释（对应课程下游分析之第9、10讲）

比较不同的差异分析方法，如monocle、ROST（对应课程下游分析之第11讲）

谱系推断（对应课程下游分析之第12讲）

谱系发育基因可视化（对应课程下游分析之第13讲）

谱系发育基因聚类和注释（对应课程下游分析之第14讲）

上游分析

配置conda

下载数据

sra2fq

下载参考数据

基于比对的hisat2流程

构建索引

比对过程

定量

不基于比对的salmon流程

构建索引

定量

对于salmon的定量结果，需要用到R里面的`tximport` 导入

结论就是：有些基因用不同的流程检测效果是不一样的，使用不同的参考数据得到的结果也是有区别的，但整体上一致

您可能也对以下帖子感兴趣

【惊】"以为要写遗书"! 飞温哥华航班遇炸弹气旋华人崩溃大哭连空姐都吐了; 客机颠簸盘旋3小时