其他
Perl学习19之生信简单运用(三)
"pythonic生物人"的第33篇分享
原创不易,点个“赞“或"在看"鼓励下呗
摘要
Perl计算SAM文件中reads落在,每个1M bin区间中的reads数目,GC碱基数目,GC比。
正文开始啦
输入文件1
file1,kemr区间文件,每个bin区间长1M。
输入文件2
file2,sam文件
问题解决代码:reads_inbin.pl
#!/usr/bin/perl
use strict;
use warnings;
#存储每个bin的区间{chr1=>{1000001=>0,2000001>0.......}}
my (%bin,%GC,%ATGC);
open BIN,"<",$ARGV[0];
while(<BIN>){
chomp;
my @line = split /\t/;
$bin{$line[0]}{$line[1]}=0;
}
close BIN;
open IN,"<",$ARGV[1];
while(<IN>){
chomp;
next if /^@/;
my @line = split/\t/,$_;
next if $line[2] eq "*";
next if $line[4] < 25;
#求每个bin区间例如,[1000001,1000001+1000000)上的reads数目
#keys %{$hash{line[2]}}取出所有1000001,二维哈希的循环写法
#foreach my $start (keys $bin{$line[2]}){#该写法报如下错误
#Type of argument to keys on reference must be unblessed hashref or arrayref
foreach my $s (sort keys %{$bin{$line[2]}}){
if($line[3]>=$s && $line[3]<($s+1000000)){
$bin{$line[2]}{$s} ++;
my $G = ($line[9] =~ s/G/G/g);
my $C = ($line[9] =~ s/C/C/g);
my $A = ($line[9] =~ s/A/A/g);
my $T = ($line[9] =~ s/T/T/g);
$GC{$line[2]}{$s} += ($G + $C);
$ATGC{$line[2]}{$s} += ($G + $C + $A + $T);
}
}
}
close IN;
foreach (keys %bin){
my $chr=$_;
foreach (keys $bin{$chr}){
printf "$chr\t$_\t$bin{$chr}{$_}\t$GC{$chr}{$_}\t$ATGC{$chr}{$_}\t%.2f%%\n",($GC{$chr}{$_}/$ATGC{$chr}{$_})*100;
}
}
perl reads_inbin.pl file1 test.sam
chr7 131000001 790 17650 39500 44.68%
chr7 79000001 706 12653 35300 35.84%
chr7 36000001 710 15303 35500 43.11%
chr7 24000001 725 14647 36250 40.41%
chr7 97000001 681 14593 34050 42.86%
chr7 47000001 796 18134 39800 45.56%
chr7 64000001 512 10840 25600 42.34%
chr7 33000001 749 14887 37450 39.75%
chr7 19000001 652 12071 32600 37.03%
。。。。。。。。。。。。。。。。。。。