BAM文件格式解读
文件头部信息(Header section)
https://www.slideserve.com/anaya/read-processing-and-mapping-from-raw-to-analysis-ready-reads
比对信息(Alignment section)
一个案例
实战数据:
SRR581720.2237957 147 chr11 5226998 60 72M = 5226897 -173 CTCCTCAGGAGTCAGATGCACCATGGTGTCTGTTTGAGGTTGCTAGTGAACACAGTTGTGTCAGAAGCAAAT @7@7...88784B8<B989*9:*)?;F;HG?*D9GEE;:9A9<<+D@H>F<3<+BHE?BABFBBB=::B;?BNM:i:0 MD:Z:72 MC:Z:76M AS:i:72 XS:i:57 RG:Z:SRRxx
SRR581247.2252703 83 chr11 5226999 60 75M = 5226943 -131 TCCTCAGGAGTCAGATGCACCATGGTGTCTGTTTGAGGTTGCTAGTGAACACAGTTGTGTCAGAAGCAAATGTAA :GHAEHC@@>GEGDIGCIIJHF=GFGIJJIIGJIHFEEHGADAHFJJJJIHIIJIIGIIGGIHHDDAFFFDD@@@ NM:i:0 MD:Z:75 MC:Z:76M AS:i:75 XS:i:56 RG:Z:SRRxx
SRR582147.2312466 83 chr11 5226999 60 76M = 5226926 -149 TCCTCAGGAGTCAGATGCACCATGGTGTCTGTTTGAGGTTGCTAGTGAACACAGTTGTGTCAGAAGCAAATGTAAG HHIGIJJJIJGJJIJJIGFIIJJJJJJIJJJJJJJJJJJJJJJJIJJJJIGJJJJIJJJJJJJHHHHHFFFFFCCC NM:i:0 MD:Z:76 MC:Z:76M AS:i:76 XS:i:56 RG:Z:SRRxx
第1列: QNAME,Reads的ID
FLAG:83(十进制) = 000001010011(二进制) = 1 + 2 + 16 + 64。
83最后被拆分为4个十进制数字的和,再从下表中查询4个数字(1、2、16、64)各自的含义:
Decimal Description of read
1 Read paired
2 Read mapped in proper pair (双端不完全或完全比对到参考序列)
4 Read unmapped (没有比对到参考序列)
8 Mate unmapped (双端序列的另外一条序列没有比对上参考序列)
16 Read reverse strand (比对到参考序列的负链上)
32 Mate reverse strand (该read的另一条read比对到参考序列的负链上)
64 First in pair
128 Second in pair
256 Not primary alignment (不是最佳的比对序列)
512 Read fails platform/vendor quality checks (该read通过碱基质量或测序平台等过滤时没通过)
1024 Read is PCR or optical duplicate (PCR、文库构建或测序时导致的重复序列)
2048 Supplementary alignment (可能存在PCR导致的嵌合,当前比对部分只是read的一部分)
FLAG=83 最终表示:(1) Read paired; (2) 完全比对到参考序列; (16) 比对到参考序列的负链上; (64) First in pair。
同理:FLAG,147(十进制) = 000010010011(二进制) = 1 + 2 + 16 + 128
第5列: 比对的质量分数(MAPQ),值越高说明该Read比对到参考基因组上的位置越唯一
其以参考序列为基础,使用数字加字母表示比对结果,
match/mismatch、insertion和deletion 分别对应字母 M、I和D。
例如:36M,表示36个碱基在比对时完全匹配
例如:3S6M1P1I4M,表示前三个碱基被剪切去除了,然后6个比对上了,然后打开了一个缺口,
有1个碱基插入,最后是4个比对上了
第8列:PNEXT,双端测序中指另外一条Read比对到参考基因组的位置坐标,最小为1(1-based leftmost)。
第9列:TLEN, Observed Template LENgth.
It represents the length of reference that is covered by pair end reads.
The distance between leftmost mapped base to rightmost mapped base in paired reads.
For unpaired reads it is 0.
Plus/minus means the current read is the leftmost/rightmost read.
https://bioinformatics.uconn.edu/resources-and-events/tutorials-2/file-formats-tutorial/
第10列:SEQ,Read 碱基序列,即FASTQ的第二行。
第11列:QUAL,即FASTQ的第四行。
第12列之后:Optional fields,可选的自定义区域。可能有多列,使用\t隔开。
NM:i --> Count Number of differences (mismatches plus inserted and deleted bases) between
the sequence and reference, counting only (case-insensitive) A, C, G and T bases in sequence
and reference as potential matches, with everything else being a mismatch.
Note this means that ambiguity codes in both.
MD:Z:[0-9]+(([A-Z]|\^[A-Z]+)[0-9]+)* --> String for mismatching positions.
MC:Z --> cigar CIGAR string for mate/next segment.
AS:i --> Alignment score generated by aligner.
XS:i --> Suboptimal alignment score, essentially a secondary alignment.
If AS and XS are close or equal, then you are getting multiple alignments happening.
RG:Z --> readgroup The read group to which the read belongs. If @RG headers are present, then readgroup must match the RG-ID fifield of one of the headers.
需要注意的是,BAM文件所记录的变异被存放在第12列(Optional fields)的以下位置:
MD:Z:[0-9]+(([A-Z]|\^[A-Z]+)[0-9]+)*
撰写:叶明皓
扫码添加好友
备注“姓名-研究方向”,拉您进入同行交流群