提取酵母两端扩展 50nt 的 CDS 序列
戴上金箍你就不是凡人了
1引言
酵母虽然是真核生物,但是与高等真核生物相差还是很大的,例如基因大部分没有 UTR 区域, CDS 区域基本上是 连续的, 异构体很少,可变剪接也较少。
酵母 IGV 基因结构:
Ribominer 有这样一段话:
Because there is almost no UTR regions for Saccharomyces cerevisiae, for better observation I extended 50 nt away from start codon and stop codon, which means coordinates in gtf file are increased about 50 nt for both start and end position.
今日尝试自己写代码来做这样一件事。
2下载基因组及注释文件
从 ensembl
官网下载酵母的基因组及注释文件:
genome:
http://ftp.ensembl.org/pub/release-106/fasta/saccharomyces_cerevisiae/dna/Saccharomyces_cerevisiae.R64-1-1.dna.toplevel.fa.gz
GTF::
http://ftp.ensembl.org/pub/release-106/fasta/saccharomyces_cerevisiae/cds/Saccharomyces_cerevisiae.R64-1-1.cds.all.fa.gz
3提取基因长度及 CDS 坐标信息
from pyfaidx import Fasta
import pandas as pd
cds_info = []
exon_info = {}
with open('Saccharomyces_cerevisiae.R64-1-1.105.gtf','r') as gtf:
for line in gtf:
# skip #
if line.startswith('#'):
continue
# split
fields = line.split()
# type
type = fields[2]
if type == 'CDS':
# ser no UTR,extend +- 50 nt
start = int(fields[3])
end = int(fields[4])
chrom = fields[0]
strand = fields[6]
if fields[14] == 'gene_name':
gene_name = fields[15].replace('"','').replace(';','')
else:
gene_name = fields[9].replace('"','').replace(';','')
cds_info.append([gene_name,chrom,start,end,strand])
elif type == 'exon':
# exon length
length = int(fields[4]) - int(fields[3]) + 1
if fields[14] == 'gene_name':
gene_name = fields[15].replace('"','').replace(';','')
else:
gene_name = fields[9].replace('"','').replace(';','')
# sum
exon_info.setdefault(gene_name,0)
exon_info[gene_name] += length
else:
pass
转为 dataframe:
# to dataframe
df = pd.DataFrame(cds_info,columns=['gene_name','chrom','start','end','strand'])
# ascend coord by + strand gene
dfinfo_1_strand = df[df['strand'] == '+'].sort_values(by = ['gene_name','start','end'],ascending = True)
# descrese coord by - strand gene
dfinfo_2_strand = df[df['strand'] == '-'].sort_values(by = ['gene_name','start','end'],ascending = False)
# merge
df_fianl = pd.concat([dfinfo_1_strand,dfinfo_2_strand],axis=0).reset_index(drop=True)
df_fianl
4extend 50 nt
对 CDS 区域上下游拓展 50 nt 时,需要注意一个基因如果有多个不连续的 CDS 区域时,则需要考虑第一个和最后一个 CDS 区域左端和右端进行拓展,还需要考虑正负链两种情况对两端进行拓展。 如果只有一个 CDS 区域,则上下游减加 50 即可。
我们测试一个有两个 CDS 的正链基因:
# save in list
extend_df = []
# unique gene name
# gene = df_fianl.gene_name.unique().tolist()
for g in ['ABP140']:
# for g in ['ACT1']:
# for g in gene:
tmp = df_fianl[df_fianl.gene_name == g].reset_index(drop=True)
print(tmp)
# whether multiple CDS
if tmp.shape[0] == 1:
tmp.start = tmp.start - 50
tmp.end = tmp.end + 50
extend_df.append(tmp)
else:
# multiple CDS extend +- 50 nt
if tmp.iloc[0,4] == '+':
nrow = tmp.shape[0] - 1
tmp.iloc[0,2] = tmp.iloc[0,2] - 50
tmp.iloc[nrow,3] = tmp.iloc[nrow,3] + 50
extend_df.append(tmp)
else:
nrow = tmp.shape[0] - 1
tmp.iloc[0,3] = tmp.iloc[0,3] + 50
tmp.iloc[nrow,2] = tmp.iloc[nrow,2] - 50
extend_df.append(tmp)
# combine
extend_dfame = pd.concat(extend_df).reset_index(drop=True)
extend_dfame
可以看到正确的对特定位置进行了拓展。
同样测试一个有两个 CDS 的负链基因:
# save in list
extend_df = []
# unique gene name
# gene = df_fianl.gene_name.unique().tolist()
# for g in ['ABP140']:
for g in ['ACT1']:
# for g in gene:
tmp = df_fianl[df_fianl.gene_name == g].reset_index(drop=True)
print(tmp)
# whether multiple CDS
if tmp.shape[0] == 1:
tmp.start = tmp.start - 50
tmp.end = tmp.end + 50
extend_df.append(tmp)
else:
# multiple CDS extend +- 50 nt
if tmp.iloc[0,4] == '+':
nrow = tmp.shape[0] - 1
tmp.iloc[0,2] = tmp.iloc[0,2] - 50
tmp.iloc[nrow,3] = tmp.iloc[nrow,3] + 50
extend_df.append(tmp)
else:
nrow = tmp.shape[0] - 1
tmp.iloc[0,3] = tmp.iloc[0,3] + 50
tmp.iloc[nrow,2] = tmp.iloc[nrow,2] - 50
extend_df.append(tmp)
# combine
extend_dfame = pd.concat(extend_df).reset_index(drop=True)
extend_dfame
可以看到正确的对特定位置进行了拓展。
把 for 这句替换成所有基因的即可:
# unique gene name
gene = df_fianl.gene_name.unique().tolist()
# for g in ['ABP140']:
# for g in ['ACT1']:
for g in gene:
tmp = df_fianl[df_fianl.gene_name == g].reset_index(drop=True)
...
5读取基因组提取序列
# extact sequnece from genome
# load genome
genome = Fasta('Saccharomyces_cerevisiae.R64-1-1.dna.toplevel.fa')
# chrmosome info
chrmosome_list = genome.keys()
# save in dict
res = {}
for line in range(0,df_fianl.shape[0]):
# chromosome strand
fileds = extend_dfame.iloc[line]
chrom = fileds['chrom']
strand = fileds['strand']
start = int(fileds['start'])
end = int(fileds['end'])
# key
key = '|'.join([fileds['gene_name'],fileds['strand']])
# filter chromoseome
if chrom in chrmosome_list:
# extarct sequence
if strand == '+':
seq = genome[chrom][(start-1):end].seq
elif strand == '-':
seq = genome[chrom][(start-1):end].complement.reverse.seq
else:
pass
# save in dict
res.setdefault(key,'')
res[key] += seq
else:
pass
添加基因长度信息
new_res = {}
for key,val in res.items():
length = len(val)
cds_start = 51
cds_end = length - 50
# add gene length
gene_length = exon_info[key.split(sep='|')[0]]
key1 = '|'.join([key,str(cds_start),str(cds_end),str(gene_length)])
new_res[key1] = val
最后输出序列
# 输出序列
outputfile = open('extened-CDS-seq.fa','w')
# 输出
for key,val in new_res.items():
outputfile.write('>' + key + '\n')
outputfile.write(val + '\n')
# 关闭文件
outputfile.close()
6检查
正链基因起始:
正链基因末尾:
负链基因起始:
负链基因末尾:
IGV 检查
7结尾
stop codon 因为
不属于 CDS 区域
,所以也处于拓展的 50nt 里面。
欢迎加入生信交流群。加我微信我也拉你进 微信群聊 老俊俊生信交流群
哦,数据代码已上传至QQ群,欢迎加入下载。
群二维码:
老俊俊微信:
知识星球:
所以今天你学习了吗?
今天的分享就到这里了,敬请期待下一篇!
最后欢迎大家分享转发,您的点赞是对我的鼓励和肯定!
如果觉得对您帮助很大,赏杯快乐水喝喝吧!
往期回顾
◀南京的春
◀...