Ribo-seq 质控之 reads 分布特征

Original JunJunLab 老俊俊的生信笔记 2022-08-17

收录于合集 #Ribo-seq 39个

你的一切社会关系的总和让你活着

1引言

核糖体结合的片段是 mRNA 的 CDS 区域,所以 Ribo-seq 测序得到的 reads 的分布特征也主要是在 CDS 区最多。

在一些相关文献里也会展现一些类似的分布图。

2分配

思路:

这里我采用 reads 的中点位置 来分配不同的区域, 中点位置位于哪个区域,该区域就加一。

定义函数:

# load package
using XAM

# define function
function featureDistribution(inputFile,outputFile)
    #####################################################
    # 1.count gene numbers
    #####################################################

    # save in dict
    feature_dict = Dict{Int64,Int64}(1 => 0,2 => 0,3 =>0)

    # open sam file
    reader = open(SAM.Reader,inputFile)
    record = SAM.Record()

    # loop
    while !eof(reader)
        empty!(record)
        read!(reader, record)
        # do something
        if SAM.ismapped(record)

            # tags
            refname = SAM.refname(record)

            # tags
            _,_,_,cdsST,cdsSP,_ = split(refname,"|")

            # tags
            align_pos = SAM.position(record)
            read_length = SAM.seqlength(record)

            # read center position
            align_pos_center = align_pos + (read_length ÷ 2)

            # cds
            cdsST,cdsSP = parse(Int64,cdsST),parse(Int64,cdsSP)

            # count reads ribo on CDS region
            if align_pos_center <= cdsST
                key = 1
                if !haskey(feature_dict,key)
                    feature_dict[key] = 1
                else
                    feature_dict[key] += 1
                end
            elseif cdsST < align_pos_center <= cdsSP
                key = 2
                if !haskey(feature_dict,key)
                    feature_dict[key] = 1
                else
                    feature_dict[key] += 1
                end
            elseif align_pos_center > cdsSP
                key = 3
                if !haskey(feature_dict,key)
                    feature_dict[key] = 1
                else
                    feature_dict[key] += 1
                end
            end
        end
    end

    # output
    output = open(outputFile,"w")

    for (key,val) in feature_dict
        # save
        write(output,join([key,val],"\t")*"\n")
    end
    close(output)
end

3运行

featureDistribution("rCN.sam","rCN.feature.txt")
featureDistribution("ICN.sam","ICN.feature.txt")

结果:

2 738842
3 42178
1 23738

1代表 5'UTR, 2代表 CDS,3代表 3'UTR, 第二列为 reads 数量。

4可视化

读取数据:

library(tidyverse)
library(ggplot2)

# load data
file <- list.files(pattern = '.feature.txt')

map_df(1:length(file),function(x){
  tmp = read.table(file[x])
  colnames(tmp) <- c('type','numbers')
  # add samle
  tmp$sample <- sapply(strsplit(file[x],split = '\\.'),'[',1)
  # add percent
  tmp$per <- tmp$numbers/sum(tmp$numbers)
  return(tmp)
}) -> fdf

绘图:

# order
fdf$type <- factor(fdf$type,levels = c(2,1,3))

# plot
ggplot(fdf,aes(x = factor(type),y = per)) +
  geom_col(aes(fill = factor(type)),
           show.legend = F,
           width = 0.7,
           position = position_dodge2()) +
  scale_fill_brewer(palette = 'Reds',name = '',direction = -1) +
  scale_x_discrete(labels = c('2' = "CDS",'1' = "5'UTR",'3' = "3'UTR")) +
  theme_classic(base_size = 16) +
  xlab('') + ylab('Reads percent') +
  scale_y_continuous(expand = c(0,0)) +
  theme(aspect.ratio = 1.7,
        strip.background = element_rect(fill = 'grey85',colour = NA)) +
  facet_wrap(~sample,scales = 'free_x')