来自于文章:Landscape of somatic mutations in 560 breast cancer whole genome sequences
The genome was partitioned according to different sets of regulatory elements/gene features, with a separate analysis performed for each set of elements, including
exons (n=20,245 genes)
core promoters (n=20,245 genes, where a core promoter is the interval [−250,+250] bp from any transcription start site (TSS) of a coding transcript of the gene, excluding any overlap with coding regions)
5’ UTR (n=9,576 genes)
3’ UTR (n=19,502 genes)
intronic regions flanking exons (n=20,212 genes, represents any intronic sequence within 75bp from an exon, excluding any base overlapping with any of the above elements.
ncRNAs (n=10,684, full length lincRNAs, miRNAs or rRNAs)
enhancers (n=194,054)
ultra-conserved regions (n=187,057, a collection of regions under negative selection based on 1,000 genomes data
cat CCDS.20110907.txt |perl -alne '{/\[(.*?)\]/;next unless $1;$gene=$F[2];$exons=$1;$exons=~s/\s//g;$exons=~s/-/\t/g;print "$F[0]\t$_\t$gene" foreach split/,/,$exons;}'|sort -u |bedtools sort -i>exon_probe.hg19.gene.bed
cat CCDS.20160908.txt |perl -alne '{/\[(.*?)\]/;next unless $1;$gene=$F[2];$exons=$1;$exons=~s/\s//g;$exons=~s/-/\t/g;print "$F[0]\t$_\t$gene" foreach split/,/,$exons;}'|sort -u |bedtools sort -i >exon_probe.hg38.gene.bed
比如打开 上面得到的近20万行的外显子坐标文件 exon_probe.hg19.gene.bed
1 69090 70007 OR4F5
1 367658 368596 OR4F29
1 621095 622033 OR4F16
1 801942 802433 LINC00115
1 861321 861392 SAMD11
1 865534 865715 SAMD11
1 866418 866468 SAMD11
1 871151 871275 SAMD11
1 874419 874508 SAMD11
1 874654 874839 SAMD11