查看原文
其他

数据处理过程中有的是意外

jimmy 生信技能树 2022-06-07

明天和意外,哪一个先到来呢?

在运行picard的时候,总是报错,如下;

To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
Exception in thread "main" picard.PicardException: Start on sequence 'chr10' was past the end: 133797422 < 133805459

从字面上很难理解, 我查看了一些,最新版的hg38基因组里面的10号染色体长度是133797422 ,那么为什么我的bed文件里面会超出呢?

再查看,发现是这个基因在作怪:

10    132104808   132104942   JAKMIP3
10    132117076   132117573   JAKMIP3
10    132133311   132133526   JAKMIP3
10    132135040   132135159   JAKMIP3
10    132135929   132136075   JAKMIP3
10    132137018   132137149   JAKMIP3
10    132137253   132137288   JAKMIP3
10    132138118   132138177   JAKMIP3
10    132140450   132140578   JAKMIP3
10    132141919   132142047   JAKMIP3
10    132145106   132145189   JAKMIP3
10    132145517   132145579   JAKMIP3
10    132147951   132148049   JAKMIP3
10    132149411   132149509   JAKMIP3
10    132149981   132150034   JAKMIP3
10    132152957   132153022   JAKMIP3
10    132153758   132153826   JAKMIP3
10    132153912   132153989   JAKMIP3
10    132163208   132163411   JAKMIP3
10    132164669   132164734   JAKMIP3
10    132166982   132167032   JAKMIP3
10    133805458   133805541   JAKMIP3
10    133808600   133808683   JAKMIP3
10    133809011   133809073   JAKMIP3
10    133811445   133811543   JAKMIP3
10    133812905   133813003   JAKMIP3
10    133813475   133813528   JAKMIP3
10    133816451   133816516   JAKMIP3
10    133817252   133817320   JAKMIP3
10    133817406   133817483   JAKMIP3
10    133826702   133826905   JAKMIP3
10    133828163   133828228   JAKMIP3
10    133830476   133830526   JAKMIP3

为什么这个基因有两个完全不一样的坐标呢? http://www.genecards.org/cgi-bin/carddisp.pl?gene=JAKMIP3

Genomic Locations for JAKMIP3 Gene

Genomic Locations for JAKMIP3 Gene

  • chr10:132,036,107-132,184,809 (GRCh38/hg38)

  • Size:148,703 bases

  • Orientation:Plus strand

  • chr10:133,918,175-133,998,313 (GRCh37/hg19)

似乎是我把hg19的这个坐标混入了hg38,那我为什么会犯这个错误呢?

我是在CCDS数据库里面下载了 CCDS.20160908.txt 文件,然后:

 cat CCDS.20160908.txt |perl -alne '{/\[(.*?)\]/;next unless $1;$gene=$F[2];$exons=$1;$exons=~s/\s//g;$exons=~s/-/\t/g;print "$F[0]\t$_\t$gene" foreach split/,/,$exons;}'|sort -u |bedtools sort -i >exon_probe.hg38.gene.bed

我以为我理解了 CCDS.20160908.txt 文件的规律,所以上面的代码是正确的,但是它并不准备,我搜索 JAKMIP3这个基因如下:

10    NC_000010.9 JAKMIP3 282973  CCDS7664.1  Withdrawn   +   133805458   133830526   [133805458-133805541, 133808600-133808683, 133809011-133809073, 133811445-133811543, 133812905-133813003, 133813475-133813528, 133816451-133816516, 133817252-133817320, 133817406-133817483, 133826702-133826905, 133828163-133828228, 133830476-133830526]    Identical
10    NC_000010.11    JAKMIP3 282973  CCDS44494.1 Public  +   132104808   132167032   [132104808-132104942, 132117076-132117573, 132133311-132133526, 132135040-132135159, 132135929-132136075, 132137018-132137149, 132137253-132137288, 132138118-132138177, 132140450-132140578, 132141919-132142047, 132145106-132145189, 132145517-132145579, 132147951-132148049, 132149411-132149509, 132149981-132150034, 132152957-132153022, 132153758-132153826, 132153912-132153989, 132163208-132163411, 132164669-132164734, 132166982-132167032]   Identical

很意外吧,它的确在CCDS数据库里面被记录了两个坐标,O(∩_∩)O哈哈~

值得注意的是第五列表明这两个记录其实是不一样的,我再统计一下:

 32534 Public
     7 Reviewed, update pending
     4 Under review, update
    15 Under review, withdrawal
  1731 Withdrawn
     6 Withdrawn, inconsistent annotation

既然Withdrawn的记录那么少,而且在JAKMIP3这个基因我可以肯定它是错误的,所以应该是要舍弃的。

新的代码如下:

cat CCDS.20160908.txt |grep -w -v Withdrawn|perl -alne '{/\[(.*?)\]/;next unless $1;$gene=$F[2];$exons=$1;$exons=~s/\s//g;$exons=~s/-/\t/g;print "$F[0]\t$_\t$gene" foreach split/,/,$exons;}'|sort -u |bedtools sort -i >exon_probe.hg38.gene.bed

问题解决咯~

细思极恐,为什么这个错误我直到现在才发现呢?而CCDS数据库我使用了超过两年了,那么之前的其它软件为什么不报错呢?他们是忽略了我的input错误吗?

您可能也对以下帖子感兴趣

文章有问题?点此查看未经处理的缓存