PANDA姐的转录组入门(6)：reads计数

查看原文

其他

PANDA姐的转录组入门(6)：reads计数

原创 2017-07-31 沈梦圆 沈梦圆

任务

前期补充：bam文件排序及sam/bam格式
搜索实现reads计数的软件教程
用htseq-count，每个样本都会输出一个表达量文件
编写脚本合并所有的样本为表达矩阵
对表达矩阵可进行简单的探索
看看一些生物学意义特殊的基因表现如何，比如GAPDH,β-ACTIN等等

代码记录

前期补充：bam文件排序及sam/bam格式

（要透彻地理解，一定要自己动手，光看不练是假把式！）

samtools sort 默认参数
samtools view Homo_sapiens_AKAP95_KD_miR_12_293_cell_sorted.bam |less -S

看上去很类似fastq文件，它也有read名称，序列，质量等信息，但是又不完全一样。

首先，每个read只占一行，只是它被tab分成了很多列，一共有12列，分别记录了1：
read名称
SAM标记
chromosome
5′端起始位置
MAPQ（mapping quality，描述比对的质量，数字越大，特异性越高）
CIGAR字串，记录插入，删除，错配以及splice junctions（后剪切拼接的接头）
mate名称，记录mate pair信息
mate的位置
模板的长度
read序列
read质量
程序用标记
显然，其中chromosome至CIGAR的信息都是非常重要的。但是这些对我们不重要，我们只需要了解SAM/BAM文件是什么，就可以了。重要的是如果进行下游的操作。更多详情见了解sam格式比对结果

samtools sort -n --threads 30 Homo_sapiens_AKAP95_KD_miR_12_293_cell.bam -o Homo_sapiens_AKAP95_KD_miR_12_293_cell_sorted_name.bam
samtools view Homo_sapiens_AKAP95_KD_miR_12_293_cell_sorted_name.bam |less -S

# 按read name重新进行排序，后续reads count会更快，出错率低
ls --color=never *.bam|while read id;do(samtools sort -n --threads 40 $id -o ${id%.*}_sorted_name.bam);done
ls --color=never *_sorted_name.bam | while read id;do(samtools index $id);done
# 不能根据reads name 排序的进行index
# [E::hts_idx_push] unsorted positions
# samtools index: "Homo_sapiens_AKAP95_KD_miR_12_293_cell_sorted_name.bam" is corrupted or unsorted

pos排序的文件小些，为什么呢？因为按照染色体位置排序相似的内容会排在一起，按照reads name排序内容则更加杂乱，跟为排序之前的内容差不多。所以一般没特殊需求，可以采用默认的pos排序。提高文件的压缩比例。

为什么 BAM 文件 sort 之后体积会变小
因为BAM 文件是压缩的二进制文件，对文件内容排序之后相似的内容排在一起，使得文件压缩比提高了，因此排序之后的 BAM 文件变小了，相对应的 SAM 文件就是纯文本文件，对 SAM 文件进行排序就不会改变文件大小。而且由于 RNA-seq 中由于基因表达量的关系，RNA-seq 的数据比对结果 BAM 文件使用 samtools 进行 sort 之后文件压缩比例变化会比 DNA-seq 更甚。另外，samtools 对 BAM 文件进行排序之后那些没有比对上的 reads 会被放在文件的末尾。

搜索实现reads计数的软件教程

（终有一天，会派上大用场，用心学！）

From: Gaining comprehensive biological insight into the transcriptome by performing a broad-spectrum RNA-seq analysis

Figure 1
The RNACocktail analysis protocol. RNACocktail is a comprehensive protocol of RNA-seq data analysis. The figure summarizes the widely used approaches for the key steps over the broad spectrum of RNA-seq analysis and also succinctly captures the possible workflows one can use to analyse RNA-seq data

Salmon2

bedtools multicov3
HTseq4

usage: htseq-count [options] alignment_file gff_file
<alignment_file>：比对到基因组后得到的SAM文件（SAMtools 包含一些perl脚本可以将大多数的比对文件转换成SAM格式），注意基因组mapping时一定要用支持剪接的比对软件（splicing-aware aligner）进行比对软件如TopHat. HTSeq-count 需要用到SAM格式中的 CIGAR 区域的信息。
<gff_file>: 包含单位信息的gff/GTF文件（gff文件格式），大多数情况下就是指注释文件; 由于GTF文件其实就是gff文件格式的变形，在这里同样可以传入GTF格式文件。
This script takes one or more alignment files in SAM/BAM format and a feature file in GFF format and calculates for each feature the number of reads mapping to it. See http://htseq.readthedocs.io/en/master/count.html for details.
positional arguments:
samfilenames Path to the SAM/BAM files containing the mapped reads.
If '-' is selected, read from standard input 想要通过标准输入来传入基因组mapping得到SAM文件，用 – 替换 <alignment_file> 即可
featuresfilename Path to the file containing the features
optional arguments:
-h, --help show this help message and exit
-f {sam,bam}, --format {sam,bam}
type of <alignment_file> data, either 'sam' or 'bam'
(default: sam)
指定输入文件的格式，<format> 可以是 sam (for text SAM files) 或 bam (for binary BAM files)格式，默认是sam.
-r {pos,name}, --order {pos,name}
'pos' or 'name'. Sorting order of <alignment_file>
(default: name). **Paired-end sequencing data must be sorted either by position or by read name, and the sorting order must be specified. **Ignored for single-end data. 如果你是双端测序，必须要对SAM进行排序（单端可不必排序，但这里我也推荐对单端测序结果排序已减少内存消耗并提高软件效率），对read name或染色体位置进行排序皆可（这里我推荐按read name排序，因为通过位置排序我遇到过错误）。具体需要通过 -r 参数指定，所以排序请详见参数 -r，在sort时用-n则不用修改，默认的排序则修改成pos。
--max-reads-in-buffer MAX_BUFFER_SIZE
When <alignment_file> is paired end sorted by position, allow only so many reads to stay in memory until the mates are found (raising this number will use more memory). Has no effect for single end or paired end sorted by name
-s {yes,no,reverse}, --stranded {yes,no,reverse}
whether the data is from a strand-specific assay.Specify 'yes', 'no', or 'reverse' (default: yes).
'reverse' means 'yes' with reversed strand interpretation
是否链特异性，如果不是修改成no
-a MINAQUAL, --minaqual MINAQUAL
skip all reads with alignment quality lower than the given minimum value (default: 10)
指定一个最低 read mapping质量值，低于<minaqual>值会被过滤掉
-t FEATURETYPE, --type FEATURETYPE
feature type (3rd column in GFF file) to be used, all features of other type are ignored (default, suitable for Ensembl GTF files: exon)
指定最小计数单位类型（gff文件的第3列中的类型如： exon）, 指定后其他单位类型将被忽略（默认情况下，对于rna-seq分析采用Ensembl GTF：http://mblab.wustl.edu/GTF22.html文件类型时,默认值是：exon）。
-i IDATTR, --idattr IDATTR
GFF attribute to be used as feature ID (default,suitable for Ensembl GTF files: gene_id)
最终的计数单位，一般为基因。默认为：gene_id，也可以设置转录本，但由于模型问题，计数效果不佳
需要改成gene_name，可以直接识别，如果是gene_id不一定认识
--additional-attr ADDITIONAL_ATTR [ADDITIONAL_ATTR ...]
Additional feature attributes (default: none, suitable for Ensembl GTF files: gene_name)
-m {union,intersection-strict,intersection-nonempty}, --mode {union,intersection-strict,intersection-nonempty}
mode to handle reads overlapping more than one feature
(choices: union, intersection-strict, intersection-nonempty; default: union)
判断一个reads属于某个基因的模型，用来判断统计reads的时候对一些比较特殊的reads定义是否计入。
<mode> 包括：默认的union和intersection-strict、 intersection-nonempty （默认：union）。
--nonunique {none,all}
Whether to score reads that are not uniquely aligned or ambiguously assigned to features
-o SAMOUTS [SAMOUTS ...], --samout SAMOUTS [SAMOUTS ...]
write out all SAM alignment records into an output SAM file called SAMOUT, annotating each line with its feature assignment (as an optional field with tag 'XF')
输出所有alignment的reads到一个叫 <samout>的sam文件中，通过一个可选的sam标签 ‘ XF ’来标注每一行对应的单位和计数，可以不设置。
-q, --quiet suppress progress report 屏蔽程序报告和警告
Written by Simon Anders (sanders@fs.tum.de), European Molecular Biology Laboratory (EMBL). (c) 2010. Released under the terms of the GNU General Public License v3. Part of the 'HTSeq' framework, version 0.8.0.

更详细参考htseq-count的用法5

如何判断一个reads属于某个基因， htseq-count 提供了union, intersection_strict, intersection_nonempty 3 种模型，如图（大多数情况下作者推荐用union模型）：

如果这3种模型不能满足您的需求，您还可以通过htseq-count 自定义模型6。

用htseq-count，每个样本都会输出一个表达量文件

（参考别人的，仍要用心写自己的！）

使用版本：version 0.7.2

ls *.Nsort.bam |while read id;do (nohup samtools view $id | ~/.local/bin/htseq-count -f sam -s no -i gene_name - ~/reference/gtf/gencode/gencode.v25lift37.annotation.gtf 1>${id%%.*}.geneCounts 2>${id%%.*}.HTseq.log&);done

一个RNA-seq实战-超级简单-2小时搞定！7过程比较复杂，中间有个bam/sam转换过程，下面是我的命令8：

# 人类
# mkdir matrix
Homo_GTF=~/rna_seq/data/reference/genome/hg19/gencode.v26lift37.annotation.gtf
ls --color=never Homo_sapiens*sorted_name.bam|while read id;do(htseq-count -f bam -i gene_name -s no $id $Homo_GTF > matrix/${id%_*}.count 2> matrix/${id%_*}.log);done
# 老鼠
# 下载gtf：http://www.gencodegenes.org/mouse_stats/archive.html
# wget ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_mouse/release_M10/gencode.vM10.annotation.gtf.gz
# gzip -d gencode.vM10.annotation.gtf.gz
Mus_GTF=~/rna_seq/data/reference/genome/mm10/gencode.vM10.annotation.gtf
ls --color=never Mus_musculus*sorted_name.bam|while read id;do(htseq-count -f bam -i gene_name -s no $id $Mus_GTF > matrix/${id%_*}.count 2> matrix/${id%_*}.log);done

出现警告，我在网上检索了一下，应该没什么问题，可以忽略9

编写脚本合并所有的样本为表达矩阵

（看，之前学的就用到了吧！）

wc -l Homo_sapiens*.count
head -n 4 Homo_sapiens*.count

具体参考生信编程直播第四题：多个同样的行列式文件合并起来10

perl -lne 'if($ARGV=~/Homo_sapiens_(.*)_sorted.count/){print "$1\t$_"}' *|grep -v Homo_sapiens>hg.count
# 先把所有文件进行合并
setwd("~/rna_seq/work/matrix")
hg <- read.csv(file = "hg.count",header = F,sep = "\t")
colnames(hg) <- c('sample','gene','count')
library(reshape2)
reads <- dcast(hg,formula = gene ~ sample)
write.table(reads,file = "hg_join.count",sep = "\t",quote = FALSE,row.names = FALSE)

对表达矩阵可进行简单的探索

summary(reads)

sample_mean <- group_by(hg,sample)%>% summarize(mean=mean(count))

gene_mean <- group_by(hg,gene)%>% summarize(mean=mean(count))

tmp <- full_join(hg,gene_mean,by='gene')

看看一些生物学意义特殊的基因表现如何，比如GAPDH,β-ACTIN等等

（生物学意义，你懂么？）

#软件工具#比较Samtools和HTSeq读取和处理SAM/BAM格式文件
基因表达分析（前传）-准备count矩阵
使用Bedtools对RNA-seq进行基因计数：http://www.bio-info-trainee.com/745.html
转录组HTseq对基因表达量进行计数：http://www.bio-info-trainee.com/244.html
htseq-count的用法：http://yangl.net/2016/09/21/htseq-count_manual/
htseq官网：http://htseq.readthedocs.io/en/release_0.9.1/
一个RNA-seq实战-超级简单-2小时搞定！：http://www.bio-info-trainee.com/2218.html
htseq-counts跟bedtools的区别：http://www.bio-info-trainee.com/2022.html
Question: Htseq Count:https://www.biostars.org/p/84907/
生信编程直播第四题：多个同样的行列式文件合并起来：http://www.biotrainee.com/thread-603-1-1.html

历史记录

【PPT】文献报告-A survey of best practices for RNA-seq data analysis

PANDA姐的转录组入门(1)：计算机资源的准备

PANDA姐的转录组入门(2)：读文章拿到测序数据

PANDA姐的转录组入门(3):了解fastq测序数据

PANDA姐的转录组入门(4):了解参考基因组及基因注释

PANDA姐的转录组入门(5)：序列比对

~ 牛郎织女，千里相会呀 ~

<< 滑动查看下一张图片 >>

100 34914 100 34914 0 0 7185 0 0:00:04 0:00:04 --:--:-- 7591 * Connection #0 to host 37.48.118.90 left intact

一把短刀，怎么就让他连捅18人？！

上海超市血案：背后缘由让人揪心

为啥一线城市只有广州取消限购？是因为穷吗

当前三大问题：国家的方向感、精英的安全感、百姓的希望感

这次我怀疑邱成桐已经“学阀化”了