其他
数据处理过程中有的是意外
明天和意外,哪一个先到来呢?
在运行picard的时候,总是报错,如下;
To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
Exception in thread "main" picard.PicardException: Start on sequence 'chr10' was past the end: 133797422 < 133805459
从字面上很难理解, 我查看了一些,最新版的hg38基因组里面的10号染色体长度是133797422 ,那么为什么我的bed文件里面会超出呢?
再查看,发现是这个基因在作怪:
10 132104808 132104942 JAKMIP3
10 132117076 132117573 JAKMIP3
10 132133311 132133526 JAKMIP3
10 132135040 132135159 JAKMIP3
10 132135929 132136075 JAKMIP3
10 132137018 132137149 JAKMIP3
10 132137253 132137288 JAKMIP3
10 132138118 132138177 JAKMIP3
10 132140450 132140578 JAKMIP3
10 132141919 132142047 JAKMIP3
10 132145106 132145189 JAKMIP3
10 132145517 132145579 JAKMIP3
10 132147951 132148049 JAKMIP3
10 132149411 132149509 JAKMIP3
10 132149981 132150034 JAKMIP3
10 132152957 132153022 JAKMIP3
10 132153758 132153826 JAKMIP3
10 132153912 132153989 JAKMIP3
10 132163208 132163411 JAKMIP3
10 132164669 132164734 JAKMIP3
10 132166982 132167032 JAKMIP3
10 133805458 133805541 JAKMIP3
10 133808600 133808683 JAKMIP3
10 133809011 133809073 JAKMIP3
10 133811445 133811543 JAKMIP3
10 133812905 133813003 JAKMIP3
10 133813475 133813528 JAKMIP3
10 133816451 133816516 JAKMIP3
10 133817252 133817320 JAKMIP3
10 133817406 133817483 JAKMIP3
10 133826702 133826905 JAKMIP3
10 133828163 133828228 JAKMIP3
10 133830476 133830526 JAKMIP3
为什么这个基因有两个完全不一样的坐标呢? http://www.genecards.org/cgi-bin/carddisp.pl?gene=JAKMIP3
Genomic Locations for JAKMIP3 Gene
Genomic Locations for JAKMIP3 Gene
chr10:132,036,107-132,184,809 (GRCh38/hg38)
Size:148,703 bases
Orientation:Plus strand
chr10:133,918,175-133,998,313 (GRCh37/hg19)
似乎是我把hg19的这个坐标混入了hg38,那我为什么会犯这个错误呢?
我是在CCDS数据库里面下载了 CCDS.20160908.txt
文件,然后:
cat CCDS.20160908.txt |perl -alne '{/\[(.*?)\]/;next unless $1;$gene=$F[2];$exons=$1;$exons=~s/\s//g;$exons=~s/-/\t/g;print "$F[0]\t$_\t$gene" foreach split/,/,$exons;}'|sort -u |bedtools sort -i >exon_probe.hg38.gene.bed
我以为我理解了 CCDS.20160908.txt
文件的规律,所以上面的代码是正确的,但是它并不准备,我搜索 JAKMIP3这个基因如下:
10 NC_000010.9 JAKMIP3 282973 CCDS7664.1 Withdrawn + 133805458 133830526 [133805458-133805541, 133808600-133808683, 133809011-133809073, 133811445-133811543, 133812905-133813003, 133813475-133813528, 133816451-133816516, 133817252-133817320, 133817406-133817483, 133826702-133826905, 133828163-133828228, 133830476-133830526] Identical
10 NC_000010.11 JAKMIP3 282973 CCDS44494.1 Public + 132104808 132167032 [132104808-132104942, 132117076-132117573, 132133311-132133526, 132135040-132135159, 132135929-132136075, 132137018-132137149, 132137253-132137288, 132138118-132138177, 132140450-132140578, 132141919-132142047, 132145106-132145189, 132145517-132145579, 132147951-132148049, 132149411-132149509, 132149981-132150034, 132152957-132153022, 132153758-132153826, 132153912-132153989, 132163208-132163411, 132164669-132164734, 132166982-132167032] Identical
很意外吧,它的确在CCDS数据库里面被记录了两个坐标,O(∩_∩)O哈哈~
值得注意的是第五列表明这两个记录其实是不一样的,我再统计一下:
32534 Public
7 Reviewed, update pending
4 Under review, update
15 Under review, withdrawal
1731 Withdrawn
6 Withdrawn, inconsistent annotation
既然Withdrawn的记录那么少,而且在JAKMIP3这个基因我可以肯定它是错误的,所以应该是要舍弃的。
新的代码如下:
cat CCDS.20160908.txt |grep -w -v Withdrawn|perl -alne '{/\[(.*?)\]/;next unless $1;$gene=$F[2];$exons=$1;$exons=~s/\s//g;$exons=~s/-/\t/g;print "$F[0]\t$_\t$gene" foreach split/,/,$exons;}'|sort -u |bedtools sort -i >exon_probe.hg38.gene.bed
问题解决咯~
细思极恐,为什么这个错误我直到现在才发现呢?而CCDS数据库我使用了超过两年了,那么之前的其它软件为什么不报错呢?他们是忽略了我的input错误吗?