对bed格式的基因组区间文件进行基因注释

Original 曾健明生信技能树 2022-06-06

基因组区间文件一般是拷贝数变异记录文件或者peaks文件，一般的注释是基于基因的，因为目前大家对编码蛋白的基因是比较熟悉的，一个区域跟基因关联起来了就大概了解它的功能了。

比如CNV文本文件，bed格式的，如下：

然后制作基因的坐标信息bed文件如下：

步骤很简单，首先从gencode数据库里面可以下载所有的gtf文件

下载代码是：

mkdir -p ~/reference/gtf/gencode
cd ~/reference/gtf/gencode
## https://www.gencodegenes.org/releases/current.html
wget ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_25/gencode.v25.2wayconspseudos.gtf.gz
wget ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_25/gencode.v25.long_noncoding_RNAs.gtf.gz
wget ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_25/gencode.v25.polyAs.gtf.gz
wget ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_25/gencode.v25.annotation.gtf.gz
## https://www.gencodegenes.org/releases/25lift37.html
wget ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_25/GRCh37_mapping/gencode.v25lift37.annotation.gtf.gz
wget ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_25/GRCh37_mapping/gencode.v25lift37.metadata.HGNC.gz
wget ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_25/GRCh37_mapping/gencode.v25lift37.metadata.EntrezGene.gz
wget ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_25/GRCh37_mapping/gencode.v25lift37.metadata.RefSeq.gz

然后写脚本得到基因的染色体还有起始终止坐标

代码是：

zcat gencode.v25.long_noncoding_RNAs.gtf.gz |perl -alne '{next unless $F[2] eq "gene" ;/gene_name \"(.*?)\";/; print "$F[0]\t$F[3]\t$F[4]\t$1" }' >lncRNA.hg38.position
zcat gencode.v25.2wayconspseudos.gtf.gz |perl -alne '{next unless $F[2] eq "transcript" ;/gene_name \"(.*?)\";/; print "$F[0]\t$F[3]\t$F[4]\t$1" }' >pseudos.hg38.position
zcat gencode.v25.annotation.gtf.gz| grep protein_coding |perl -alne '{next unless $F[2] eq "gene" ;/gene_name \"(.*?)\";/; print "$F[0]\t$F[3]\t$F[4]\t$1" }' >protein_coding.hg38.position
zcat gencode.v25.annotation.gtf.gz|perl -alne '{next unless $F[2] eq "gene" ;/gene_name \"(.*?)\";/; print "$F[0]\t$F[3]\t$F[4]\t$1" }' >allGene.hg38.position
zcat gencode.v25lift37.annotation.gtf.gz | grep protein_coding |perl -alne '{next unless $F[2] eq "gene" ;/gene_name \"(.*?)\";/; print "$F[0]\t$F[3]\t$F[4]\t$1" }' >protein_coding.hg19.position
zcat gencode.v25lift37.annotation.gtf.gz | perl -alne '{next unless $F[2] eq "gene" ;/gene_name \"(.*?)\";/; print "$F[0]\t$F[3]\t$F[4]\t$1" }' >allGene.hg19.position

PS：这里面有一个小问题，gencode里面的数据有着HAVANA和ENSEMBL的区别，尤其是在hg38里面，需要区别对待！

避免重复造轮子，我就用我擅长的bedtools解决这个需求吧，命令很简单，如下：

bedtools intersect -a Features.bed -b ~/reference/gtf/gencode/protein_coding.hg19.position -wa -wb \
| bedtools groupby -i - -g 1-4 -c 10 -o collapse

注释结果，我挑了几个可以看的给大家，可以看到，每个CNV片段都注释到了对应的基因，有些特别大的片段，会被注释到非常多的基因。

chr8 42584924 42783715 TCGA-5T-A9QA-01A-11D-A41E-01 CHRNB3,CHRNA6,THAP1,RNF170,HOOK3
chr8 42789728 42793594 TCGA-5T-A9QA-01A-11D-A41E-01 HOOK3
chr8 42797957 42933372 TCGA-5T-A9QA-01A-11D-A41E-01 RP11-598P20.5,FNTA,HOOK3
chr8 70952673 70964372 TCGA-5T-A9QA-01A-11D-A41E-01 PRDM14
chr10 42947970 43833200 TCGA-5T-A9QA-01A-11D-A41E-01 BMS1,RET,RASGEF1A,ZNF33B,CSGALNACT2
chr10 106384615 106473355 TCGA-5T-A9QA-01A-11D-A41E-01 SORCS3
chr10 106478366 107298256 TCGA-5T-A9QA-01A-11D-A41E-01 SORCS3
chr10 117457285 117457859 TCGA-5T-A9QA-01A-11D-A41E-01 ATRNL1
chr11 68990173 69277078 TCGA-5T-A9QA-01A-11D-A41E-01 MYEOV
chr11 76378708 76926535 TCGA-5T-A9QA-01A-11D-A41E-01 LRRC32,B3GNT6,OMP,TSKU,MYO7A,ACER3,CAPN5

如果是需要写程序，也可以的，我用过perl来实现这个需求，是博客里面有