查看原文
其他

跟着 Cell 学 Ribo-seq 分析 四 (终)

JunJunLab 老俊俊的生信笔记 2022-08-15


随缘

1引言

为了去除测序文库大小带来的差异,往往需要对文库大小进行归一化,计算 RPM使不同样本之间的基因具有可比性

之前我们计算了 每个位置处的的 density, 但是没有进行归一化,所以需要计算一下 RPM 的值。

2calculate RPM

输入文件为 density 文件,输出文件为每个位置处的 RPM 值:

"""
Supplementary Note 6: RPM-normalized read densities

Author: Annemarie Becker

inputFileP:
read density file for plus strand (Supplementary Note 2)
    col0: position along genome
    col1: read density at that position

inputFileM:
read density file for minus strand (Supplementary Note 2)
    col0: position along genome
    col1: read density at that position

inputNumber:
total read number as float (Supplementary Note 3)

outputFileP:
RPM-normalized read density file for plus strand
    col0: position along genome
    col1: RPM-normalized read density at that position

outputFileM:
RPM-normalized read density file for minus strand
    col0: position along genome
    col1: RPM-normalized read density at that position

"""

定义函数:

def norm(inputFileP, inputFileM, inputNumber, outputFileP, outputFileM):

### PLUS STRAND ###

    inFile = open(inputFileP, 'r')
    inNumber = open(inputNumber, 'r')
    outFile = open(outputFileP, 'w')

    line = inFile.readline()
    number = inNumber.readline()
    totalReads = int(float(number))

    i = 0
    while line != '':
        fields = line.split()
        col0 = int(fields[0])
        col1 = float(fields[1])
        RPM = col1 / totalReads * 1000000
        outFile.write(str(col0) + '\t' + str(RPM) + '\n')
        line = inFile.readline()


### MINUS STRAND ###

    inFile = open(inputFileM, 'r')
    inNumber = open(inputNumber, 'r')
    outFile = open(outputFileM, 'w')

    line = inFile.readline()
    number = inNumber.readline()
    totalReads = int(float(number))

    i = 0
    while line != '':
        fields = line.split()
        col0 = int(fields[0])
        col1 = float(fields[1])
        RPM = col1 / totalReads * 1000000
        outFile.write(str(col0) + '\t' + str(RPM) + '\n')
        line = inFile.readline()

代码比较好懂,接下来批量运行:

# norm(inputFileP, inputFileM, inputNumber, outputFileP, outputFileM)

# sample name
inputFileP = ['Ssb1-inter-rep1.PDensity.txt','Ssb1-inter-rep2.PDensity.txt','Ssb2-inter-rep1.PDensity.txt','Ssb2-inter-rep2.PDensity.txt',
          'Ssb1-trans-rep1.PDensity.txt','Ssb1-trans-rep2.PDensity.txt','Ssb2-trans-rep1.PDensity.txt','Ssb2-trans-rep2.PDensity.txt']

inputFileM = ['Ssb1-inter-rep1.MDensity.txt','Ssb1-inter-rep2.MDensity.txt','Ssb2-inter-rep1.MDensity.txt','Ssb2-inter-rep2.MDensity.txt',
          'Ssb1-trans-rep1.MDensity.txt','Ssb1-trans-rep2.MDensity.txt','Ssb2-trans-rep1.MDensity.txt','Ssb2-trans-rep2.MDensity.txt']

inputNumber = ['Ssb1-inter-rep1.totalReads.txt','Ssb1-inter-rep2.totalReads.txt','Ssb2-inter-rep1.totalReads.txt','Ssb2-inter-rep2.totalReads.txt',
          'Ssb1-trans-rep1.totalReads.txt','Ssb1-trans-rep2.totalReads.txt','Ssb2-trans-rep1.totalReads.txt','Ssb2-trans-rep2.totalReads.txt']

outputFileP = ['Ssb1-inter-rep1.PRPM.txt','Ssb1-inter-rep2.PRPM.txt','Ssb2-inter-rep1.PRPM.txt','Ssb2-inter-rep2.PRPM.txt',
          'Ssb1-trans-rep1.PRPM.txt','Ssb1-trans-rep2.PRPM.txt','Ssb2-trans-rep1.PRPM.txt','Ssb2-trans-rep2.PRPM.txt']

outputFileM = ['Ssb1-inter-rep1.MRPM.txt','Ssb1-inter-rep2.MRPM.txt','Ssb2-inter-rep1.MRPM.txt','Ssb2-inter-rep2.MRPM.txt',
          'Ssb1-trans-rep1.MRPM.txt','Ssb1-trans-rep2.MRPM.txt','Ssb2-trans-rep1.MRPM.txt','Ssb2-trans-rep2.MRPM.txt']

# run
for i in range(0,8):
    norm(''.join(['2.ribo-density-data/',inputFileP[i]]),''.join(['2.ribo-density-data/',inputFileM[i]]),
         ''.join(['3.totalReads-data/',inputNumber[i]]),
         ''.join(['6.RPM-data/',outputFileP[i]]),''.join(['6.RPM-data/',outputFileM[i]]))

结果类似于下面这样:

726 0.0807090320751414
742 0.008967670230571266
743 0.008967670230571266
744 0.008967670230571266
745 0.008967670230571266

3extend positions

为了后面方便计算,需要对把数据 拓展为完整的基因组坐标位置, 因为现在是断断续续的位置:

输入文件为上一步的 RPM 文件:

"""
Supplementary Note 7: Complete RPM-normalized read densities

Author: Annemarie Becker

inputFileP:
RPM-normalized read density file for plus strand (Supplementary Note 6)
    col0: position along genome
    col1: RPM-normalized read density at that position

inputFileM:
RPM-normalized read density file for minus strand (Supplementary Note 6)
    col0: position along genome
    col1: RPM-normalized read density at that position

outputFileP:
complete RPM-normalized read density file for all positions along the genome on plus strand
    col0: position along genome
    col1: RPM-normalized read density at that position

outputFileM:
complete RPM-normalized read density file for all positions along the genome on minus strand
    col0: position along genome
    col1: RPM-normalized read density at that position
"""

定义函数:

def RPMcomplete(inputFileP, inputFileM, outputFileP, outputFileM):

### PLUS STRAND###

    DictP = {}

    inFile = open(inputFileP, 'r')
    line = inFile.readline()
    while line != '':
        fields = line.split()
        col0 = int(fields[0])
        col1 = float(fields[1])
        DictP[col0] = col1
        line = inFile.readline()

    outFile = open(outputFileP, 'w')

    for elem in range(1,4578160):          #genome length, change for different genome
        if elem not in DictP:
            DictP[elem] = 0.0

    ListP = list(DictP.items())
    ListP.sort()

    for J in ListP:
        outFile.write(str(J[0]) + '\t' + str(J[1]) + '\n')


### MINUS STRAND###

    DictM = {}

    inFile = open(inputFileM, 'r')
    line = inFile.readline()
    while line != '':
        fields = line.split()
        col0 = int(fields[0])
        col1 = float(fields[1])
        DictM[col0] = col1
        line = inFile.readline()

    outFile = open(outputFileM, 'w')

    for elem in range(1,4578160):          #genome length, change for different genome
        if elem not in DictM:
            DictM[elem] = 0.0

    ListM = list(DictM.items())
    ListM.sort()

    for J in ListM:
        outFile.write(str(J[0]) + '\t' + str(J[1]) + '\n')

这里作者设置的酵母基因组长度为 range(1,4578160)

注意,这里依然会出现我前面提到的问题,应该是对每个染色体对应的长度进行分析。所以我觉得代码是有问题的。

批量运行:

# RPMcomplete(inputFileP, inputFileM, outputFileP, outputFileM)

# sample name
inputFileP = ['Ssb1-inter-rep1.PRPM.txt','Ssb1-inter-rep2.PRPM.txt','Ssb2-inter-rep1.PRPM.txt','Ssb2-inter-rep2.PRPM.txt',
          'Ssb1-trans-rep1.PRPM.txt','Ssb1-trans-rep2.PRPM.txt','Ssb2-trans-rep1.PRPM.txt','Ssb2-trans-rep2.PRPM.txt']

inputFileM = ['Ssb1-inter-rep1.MRPM.txt','Ssb1-inter-rep2.MRPM.txt','Ssb2-inter-rep1.MRPM.txt','Ssb2-inter-rep2.MRPM.txt',
          'Ssb1-trans-rep1.MRPM.txt','Ssb1-trans-rep2.MRPM.txt','Ssb2-trans-rep1.MRPM.txt','Ssb2-trans-rep2.MRPM.txt']

outputFileP = ['Ssb1-inter-rep1.PRPM-full.txt','Ssb1-inter-rep2.PRPM-full.txt','Ssb2-inter-rep1.PRPM-full.txt','Ssb2-inter-rep2.PRPM-full.txt',
          'Ssb1-trans-rep1.PRPM-full.txt','Ssb1-trans-rep2.PRPM-full.txt','Ssb2-trans-rep1.PRPM-full.txt','Ssb2-trans-rep2.PRPM-full.txt']

outputFileM = ['Ssb1-inter-rep1.MRPM-full.txt','Ssb1-inter-rep2.MRPM-full.txt','Ssb2-inter-rep1.MRPM-full.txt','Ssb2-inter-rep2.MRPM-full.txt',
          'Ssb1-trans-rep1.MRPM-full.txt','Ssb1-trans-rep2.MRPM-full.txt','Ssb2-trans-rep1.MRPM-full.txt','Ssb2-trans-rep2.MRPM-full.txt']

# run
for i in range(0,8):
    RPMcomplete(''.join(['6.RPM-data/',inputFileP[i]]),''.join(['6.RPM-data/',inputFileM[i]]),
               ''.join(['6.RPM-data/',outputFileP[i]]),''.join(['6.RPM-data/',outputFileM[i]]))

输出结果,可以看到从 1 开始:

1 0.0
2 0.0
3 0.0
4 0.0
5 0.0
6 0.0

4Enrichment efficiency

计算富集效率,输入文件为 IP 和 Input 的 RPM 拓展坐标的文件:

"""
Supplementary Note 14: Enrichment efficiency

Author: Annemarie Becker

inputFile1:
RPM-normalized read densities along the whole genome or in protein coding regions on plus or minus strand from sample 1 (Supplementary Note 7 or 8)
    col0: position along genome
    col1: RPM-normalized read density at that position

inputFile2:
RPM-normalized read densities along the whole genome or in protein coding regions on plus or minus strand from sample 2 (Supplementary Note 7 or 8)
    col0: position along genome
    col1: RPM-normalized read density at that position

outputFile:
ratio of RPM-normalized read densities for protein coding regions on plus or minus strand from samples 1 and 2
    col0: position along genome
    col1: ratio of RPM-normalized read densities from samples 1 and 2

"""

因为 很多位置出的 density 并不连续, 作者采用了滑动窗口的方法,计算每个位置上下游 20nt 的 density 的总和,再除以 Input 的上下游 20nt 的 density 的总和得到 ratio,这样出来的图的曲线会更光滑一些:

作者使用滑动窗口的方法在整个基因组上进行滑动计算比值, 分为三个区间,前 20nt,中间区域末尾 20nt

定义函数:

def ratio(inputFile1, inputFile2, outputFile):

  # Upload input1

    Dict1 = {}
    inFile = open(inputFile1, 'r')
    line = inFile.readline()
    while line != '':
        fields = line.split()
        col0 = int(fields[0])
        col1 = float(fields[1])
        Dict1[col0] = col1
        line = inFile.readline()

  # Upload input2

    Dict2 = {}
    inFile = open(inputFile2, 'r')
    line = inFile.readline()
    while line != '':
        fields = line.split()
        col0 = int(fields[0])
        col1 = float(fields[1])
        Dict2[col0] = col1
        line = inFile.readline()


  # calculate window +-20

    outFile = open(outputFile, 'w')

    sum1 = 0
    sum2 = 0
    start = 1
    end = 4578159                  #change this value dependent on the genome
    start_sum = start + 20                  #start_sum = 21
    end_sum = end - 20                      #end_sum = 4578139


    for J in range(start, start_sum + 1):       #lines 1-21
        for X in range(start, J+20+1):          #sum 1-(J+20)
            sum1 += float(Dict1[X])
            sum2 += float(Dict2[X])
        if sum2 != 0:
            ratio = sum1 / sum2
        else:
            ratio = 0.0
        outFile.write(str(J) + '\t' + str(ratio) + '\n')
        sum1 = 0
        sum2 = 0

    for K in range(start_sum + 1, end_sum + 1):         #lines 22-4578139
        for Y in range(K-20, K+20+1):                   #sum (K-20)-(K+20)
            sum1 += float(Dict1[Y])
            sum2 += float(Dict2[Y])
        if sum2 != 0:
            ratio = sum1 / sum2
        else:
            ratio = 0.0
        outFile.write(str(K) + '\t' + str(ratio) + '\n')
        sum1 = 0
        sum2 = 0

    for L in range(end_sum + 1, end + 1):               #lines 4578140-4578159
        for Z in range(L-20, end + 1):
            sum1 += float(Dict1[Z])
            sum2 += float(Dict2[Z])
        if sum2 != 0:
            ratio = sum1 / sum2
        else:
            ratio = 0.0
        outFile.write(str(L) + '\t' + str(ratio) + '\n')
        sum1 = 0
        sum2 = 0

批量运行:

inputFile1 = ['Ssb1-inter-rep1.PRPM-full.txt','Ssb1-inter-rep2.PRPM-full.txt','Ssb2-inter-rep1.PRPM-full.txt','Ssb2-inter-rep2.PRPM-full.txt',
            'Ssb1-inter-rep1.MRPM-full.txt','Ssb1-inter-rep2.MRPM-full.txt','Ssb2-inter-rep1.MRPM-full.txt','Ssb2-inter-rep2.MRPM-full.txt']

inputFile2 = ['Ssb1-trans-rep1.PRPM-full.txt','Ssb1-trans-rep2.PRPM-full.txt','Ssb2-trans-rep1.PRPM-full.txt','Ssb2-trans-rep2.PRPM-full.txt',
             'Ssb1-trans-rep1.MRPM-full.txt','Ssb1-trans-rep2.MRPM-full.txt','Ssb2-trans-rep1.MRPM-full.txt','Ssb2-trans-rep2.MRPM-full.txt']

outputFile = ['Ssb1-rep1-Penrichment.txt','Ssb1-rep2-Penrichment.txt','Ssb2-rep1-Penrichment.txt','Ssb2-rep2-Penrichment.txt',
            'Ssb1-rep1-Menrichment.txt','Ssb1-rep2-Menrichment.txt','Ssb2-rep1-Menrichment.txt','Ssb2-rep2-Menrichment.txt']

# run
for i in range(0,8):
    ratio(''.join(['6.RPM-data/',inputFile1[i]]),
          ''.join(['6.RPM-data/',inputFile2[i]]),
          ''.join(['8.enrichment-ratio-data/',outputFile[i]]))

得到每个位置出的比值:

8750 0.0
8751 0.0
8752 0.0
8753 0.11379265589085022
8754 0.22758531178170044
8755 0.3413779676725507

5可视化

这里准备一个基因信息文件,挑选了 三个正链基因两个负链基因:

PMT1 287059 289512
CDC37 790328 791848
CCT3 407558 409162
SSC1 519638 521602
PDI1 48653 50221

正链基因

微信扫一扫付费阅读本文

可试读56%

微信扫一扫付费阅读本文

您可能也对以下帖子感兴趣

文章有问题?点此查看未经处理的缓存