跟着 Cell 学 Ribo-seq 分析二

Original JunJunLab 老俊俊的生信笔记 2022-08-15

收录于合集 #Ribo-seq 39个

随缘

1前言

昨天早晨醒来看到一些句子说的挺好,分享给大家:

你只是心动过，并没有坚定地走向我，也只是短暂的喜欢了我一下，我却记了很久很久… 确实啊，当初对我的感情那么浓烈，我想不到是你不爱我，心动这个词很好，只是心动，从来没有坚定的选择我。

难过是后知后觉的，可以是在一个阳光灿烂的午后，也可以是在一个大雨滂沱的夜晚,回想生命中错过的片段。

2引言

上一节我们看了 reads 的长度分布, 接下来需要计算每个位置的 reads 的 density。采用的不是传统的 5'end 的模式,而是采用 center weighting 的方法。具体原理我画个示意图阐述:

基本原理就是对 reads 的上下游缩减 11nt, 然后中间的位置的得分为剩余长度的倒数, 最后叠加这个位置的分数即为该位置的 density。

文章介绍:

A critical step in the analysis involves the use of a specific scoring system, a center-weighted strategy for counting each read. Center-weighting can better locate the position of the ribosomal A-site and P-site than a simple counting at the 5′ end of the read, because only the midpoint (center) of a footprint fragment is scored. Center-weighting also takes into account the heterogeneous read lengths after MNase digestion and size selection. To this end, every footprint receives the same score regardless of length. If the read is longer than the defined minimum length (23 nt), the position of the ribosome cannot be clearly assigned. To this end, 11 nt from either end are removed and the score of the footprint is distributed equally among the remaining nucleotides. This scoring system was successfully used to verify well-known pause sites,such as stop codons and nascent chain–mediated stalling sites15. In addition, it facilitated the detection of pausing at serine codons upon starvation and of Shine-Dalgarno-like sequences as general ribosome pausing sites15. Center-weighted scores can be used for downstream analyses, which include the calculation of gene expression levels, normalized read densities along the genome or in protein coding regions and the average read density in a meta-gene analysis.

3center weighting 计算

下面是代码实现过程,输入文件为 bowtie 默认输出的结果文件, 输出为正负链的 density:

"""
Supplementary Note 2: Center Weighting

Authors: Eugene Oh, Annemarie Becker

inputFile:
.map file generated by Bowtie default output.

outputFileP:
read density file for plus strand
    col0: position along genome
    col1: read density at that position

outputFileM:
read density file for minus strand
    col0: position along genome
    col1: read density at that position

"""
def rawdata(inputFile, outputFileP, outputFileM):

    pDict = {}
    mDict = {}

    inFile = open(inputFile, 'r')
    line = inFile.readline()
    while line != '':
        fields = line.split()
        col2 = str(fields[2])   #strand; note: if sequencing was performed without barcode reading, the column numbering is changed
        col4 = int(fields[4])   #left-most position
        col5 = str(fields[5])   #footprint seq
        length = len(col5)      #footprint length

        if 22 < length < 42:    #select range of footprint read lengths
            if col2 == '+': #for plus strand
                columns = len(fields)   #count number of columns to check if alignment contains mismatches
                if columns > 8:
                    col8 = str(fields[8])
                    if col8.startswith("0"): #if there is a mismatch in the 1st position
                        length0 = length - 1    #subtract wrong base at 1st position
                        end5 = col4 + 2  #Bowtie uses 0-based offset: transform to 1-based and subtract 1st base
                        end3 = end5 + length0 - 1
                        centerEnd5 = end5 + 11 #define center
                        centerEnd3 = end3 - 11
                        centerLength = centerEnd3 - centerEnd5 + 1
                    else:
                        end5 = col4 + 1     #Bowtie uses zero-based offset, transform to 1-based
                        end3 = end5 + length - 1
                        centerEnd5 = end5 + 11
                        centerEnd3 = end3 - 11
                        centerLength = centerEnd3 - centerEnd5 + 1
                else:
                    end5 = col4 + 1
                    end3 = end5 + length - 1
                    centerEnd5 = end5 + 11
                    centerEnd3 = end3 - 11
                    centerLength = centerEnd3 - centerEnd5 + 1

                for elem in range(centerEnd5, centerEnd3 + 1):
                    if elem in pDict:
                        pDict[elem] += (1.0 / centerLength)
                    else:
                        pDict[elem] = (1.0 / centerLength)

            elif col2 == '-':  #for minus strand
                columns = len(fields)
                if columns > 8:
                    col8 = str(fields[8])
                    if col8.startswith("0"):
                        length0 = length - 1
                        end3 = col4 + 1         #for minus strand, Bowtie gives leftmost position (3' end) with zero-based numbering
                        end5 = end3 + length0 - 1
                        centerEnd5 = end5 - 11
                        centerEnd3 = end3 + 11
                        centerLength = centerEnd5 - centerEnd3 + 1
                    else:
                        end3 = col4 + 1
                        end5 = end3 + length - 1
                        centerEnd5 = end5 - 11
                        centerEnd3 = end3 + 11
                        centerLength = centerEnd5 - centerEnd3 + 1
                else:
                    end3 = col4 + 1
                    end5 = end3 + length - 1
                    centerEnd5 = end5 - 11
                    centerEnd3 = end3 + 11
                    centerLength = centerEnd5 - centerEnd3 + 1

                for elem in range(centerEnd3, centerEnd5 + 1):
                    if elem in mDict:
                        mDict[elem] += (1.0 / centerLength)
                    else:
                        mDict[elem] = (1.0 / centerLength)

        line = inFile.readline()

    pList = list(pDict.items())
    pList.sort()
    outFileP = open(outputFileP, 'w')
    for J in pList:
        outFileP.write(str(J[0]) + '\t' + str(J[1]) + '\n')

    mList = list(mDict.items())
    mList.sort()
    outFileM = open(outputFileM, 'w')
    for J in mList:
        outFileM.write(str(J[0]) + '\t' + str(J[1]) + '\n')

    inFile.close()
    outFileP.close()
    outFileM.close()

代码也比较容易看懂, 作者也考虑了错配的碱基,并不纳入该位置进行计算。

值得注意的是,保存每个位置的 density 作者采用的字典的 key 为每个 read 的比对位置, 那么这就会有一个问题,每条 read 比对到不同的染色体同一位置处,也需要进行叠加 density?
这么计算是不合理的,应该把字典的 key 加上染色体的信息 才能消除这种影响。

作者的代码是下面这样,elem 应该加上染色体的信息作为字典的 key:

for elem in range(centerEnd3, centerEnd5 + 1):
    if elem in mDict:
        mDict[elem] += (1.0 / centerLength)
    else:
        mDict[elem] = (1.0 / centerLength)

咱们先不管,按着原来的代码往后分析:

批量计算:

# sample name
sample = ['Ssb1-inter-rep1.map','Ssb1-inter-rep2.map','Ssb2-inter-rep1.map','Ssb2-inter-rep2.map',
          'Ssb1-trans-rep1.map','Ssb1-trans-rep2.map','Ssb2-trans-rep1.map','Ssb2-trans-rep2.map']

outPlus = ['Ssb1-inter-rep1.PDensity.txt','Ssb1-inter-rep2.PDensity.txt','Ssb2-inter-rep1.PDensity.txt','Ssb2-inter-rep2.PDensity.txt',
          'Ssb1-trans-rep1.PDensity.txt','Ssb1-trans-rep2.PDensity.txt','Ssb2-trans-rep1.PDensity.txt','Ssb2-trans-rep2.PDensity.txt']

outMinus = ['Ssb1-inter-rep1.MDensity.txt','Ssb1-inter-rep2.MDensity.txt','Ssb2-inter-rep1.MDensity.txt','Ssb2-inter-rep2.MDensity.txt',
          'Ssb1-trans-rep1.MDensity.txt','Ssb1-trans-rep2.MDensity.txt','Ssb2-trans-rep1.MDensity.txt','Ssb2-trans-rep2.MDensity.txt']

# run
for i in range(0,8):
    rawdata(sample[i],''.join(['2.ribo-density-data/',outPlus[i]]),''.join(['2.ribo-density-data/',outMinus[i]]))

4计算每个样本的 total density

"""
Supplementary Note 3: Total reads

Author: Annemarie Becker

inputFileP:
read density file for plus strand (Supplementary Note 2)
    col0: position along genome
    col1: read density at that position

inputFileM:
read density file for minus strand (Supplementary Note 2)
    col0: position along genome
    col1: read density at that position

outputFile:
total read number as float

"""

def countReads(inputFileP,inputFileM, outputFile):

    inFileP = open(inputFileP, 'r')
    inFileM = open(inputFileM, 'r')
    outFile = open(outputFile, 'w')

    line = inFileP.readline()
    i = 0
    while line != '':
        fields = line.split()
        col0 = int(fields[0])
        col1 = float(fields[1])
        i = i+col1  #count reads on plus strand
        line = inFileP.readline()
    totReadsP = i

    line = inFileM.readline()
    j = 0
    while line != '':
        fields = line.split()
        col0 = int(fields[0])
        col1 = float(fields[1])
        j = j+col1  #count reads on minus strand
        line = inFileM.readline()
    totReadsM = j

    totalReads = i + j
    outFile.write(str(totalReads))

    inFileP.close()
    inFileM.close()
    outFile.close()

批量计算:

# sample name
outPlus = ['Ssb1-inter-rep1.PDensity.txt','Ssb1-inter-rep2.PDensity.txt','Ssb2-inter-rep1.PDensity.txt','Ssb2-inter-rep2.PDensity.txt',
          'Ssb1-trans-rep1.PDensity.txt','Ssb1-trans-rep2.PDensity.txt','Ssb2-trans-rep1.PDensity.txt','Ssb2-trans-rep2.PDensity.txt']

outMinus = ['Ssb1-inter-rep1.MDensity.txt','Ssb1-inter-rep2.MDensity.txt','Ssb2-inter-rep1.MDensity.txt','Ssb2-inter-rep2.MDensity.txt',
          'Ssb1-trans-rep1.MDensity.txt','Ssb1-trans-rep2.MDensity.txt','Ssb2-trans-rep1.MDensity.txt','Ssb2-trans-rep2.MDensity.txt']

sample = ['Ssb1-inter-rep1.totalReads.txt','Ssb1-inter-rep2.totalReads.txt','Ssb2-inter-rep1.totalReads.txt','Ssb2-inter-rep2.totalReads.txt',
          'Ssb1-trans-rep1.totalReads.txt','Ssb1-trans-rep2.totalReads.txt','Ssb2-trans-rep1.totalReads.txt','Ssb2-trans-rep2.totalReads.txt']

# run
for i in range(0,8):
    countReads(''.join(['2.ribo-density-data/',outPlus[i]]),''.join(['2.ribo-density-data/',outMinus[i]]),''.join(['3.totalReads-data/',sample[i]]))

5准备基因信息文件

作者只说了 NCBI 的 ptt 文件, 但我找不到,于是自己写代码准备这样一个基因文件,一个正链一个负链的基因信息文件,三列,基因名,基因起始位点和基因终止位点:

# get protein-coding gene start and end info
inputListP = open('geneListPlus.txt','w')
inputListM = open('geneListMlus.txt','w')

with open('Saccharomyces_cerevisiae.R64-1-1.105.gtf','r') as gtf:
    for line in gtf:
        if line.startswith('#'):
            continue
        fileds = line.split()
        # for Plus strand
        if fileds[2] == 'gene':
            if len(fileds) > 14:
                type = fileds[15]
                gene_name = fileds[11].replace('"','').replace(';','')
            else:
                type = fileds[13]
                gene_name = fileds[9].replace('"','').replace(';','')
            if  type == '"protein_coding";' and fileds[6] == '+':
                gene_name = gene_name
                start = fileds[3]
                end = fileds[4]
                inputListP.write('\t'.join([gene_name,str(start),str(end)]) + '\n')
            elif type == '"protein_coding";' and fileds[6] == '-':
                gene_name = gene_name
                start = fileds[3]
                end = fileds[4]
                inputListM.write('\t'.join([gene_name,str(start),str(end)]) + '\n')
            else:
                pass
        else:
            pass

inputListP.close()
inputListM.close()

输出文件长这样:

THI74 1338274 1339386
PAU10 1523249 1523611
IZH1 1434924 1435874

6计算每个基因的总 density

基因前面的 density 的基因信息文件,对基因位置内的 density 进行加和即可,代码也比较好理解:

"""
Supplementary Note 4: Read density per gene

Authors: Eugene Oh

inputFileP:
read density file for plus strand (Supplementary Note 2)
    col0: position along genome
    col1: read density at that position

inputFileM:
read density file for minus strand (Supplementary Note 2)
    col0: position along genome
    col1: read density at that position

inputListP:
E. coli MC4100 gene list of the plus strand
    col0: gene name
    col1: start coordinate of gene
    col2: stop coordinate of gene

inputListM
E. coli MC4100 gene list of the minus strand
    col0: gene name
    col1: start coordinate of gene
    col2: stop coordinate of gene

outputFileP:
read densities per gene on plus strand
    col0: gene name
    col1: start coordinate of gene
    col2: stop coordinate of gene
    col3: sum of read densities

outputFileM:
read densities per gene on minus strand
    col0: gene name
    col1: start coordinate of gene
    col2: stop coordinate of gene
    col3: sum of read densities

"""

def expression(inputFileP, inputFileM, inputListP, inputListM, outputFileP, outputFileM):

### PLUS STRAND ###

  # Upload read density file from plus strand as a dictionary

    inFileDictP = {}         #create dictionary that looks like input file

    inFile = open(inputFileP, 'r')
    line = inFile.readline()
    while line != '':
        fields = line.split()
        col0 = int(fields[0])
        col1 = float(fields[1])
        inFileDictP[col0] = col1
        line = inFile.readline()

  # Upload plus strand gene list as a dictionary and list

    geneDictP = {}         #create dictionary with col0=gene name; col1=read number
    geneListP = []         #create list that looks like input gene list

    inFile = open(inputListP, 'r')
    line = inFile.readline()
    while line != '':
        fields = line.split()
        geneListP.append(fields)        #add an item to the end of the list
        col0 = str(fields[0])           #plus strand gene list: gene name
        col1 = int(fields[1])           #plus strand gene list: start
        col2 = int(fields[2])           #plus strand gene list: stop

  # Sum up and write read densities per protein coding region in dictionary
        for Z in range(col1, col2 + 1):
            if Z in inFileDictP and col0 in geneDictP:
                geneDictP[col0] += inFileDictP[Z]
            elif Z in inFileDictP:
                geneDictP[col0] = inFileDictP[Z]
        line = inFile.readline()

  # Assign gene expression levels to all genes

    tupledlistP = geneDictP.items()
    for J in geneListP:
        match = 0
        for K in tupledlistP:
            if J[0] == K[0]:
                match = 1
                J.append(K[1])
        if match == 0:  #list genes that don't have any reads
            J.append(0)

  # Output file for plus strand

    outFile = open(outputFileP, 'w')
    for J in geneListP:
        outFile.write(str(J[0]) + '\t' + str(J[1]) + '\t' + str(J[2]) + '\t' + str(J[3]) + '\n')


### MINUS STRAND ###

  # Upload read density file from minus strand as a dictionary

    inFileDictM = {}

    inFile = open(inputFileM, 'r')
    line = inFile.readline()
    while line != '':
        fields = line.split()
        col0 = int(fields[0])
        col1 = float(fields[1])
        inFileDictM[col0] = col1
        line = inFile.readline()

  # Upload minus strand gene list as a dictionary and list

    geneDictM = {} #create dictionary with col0=gene name; col1=read number
    geneListM = [] #create list that looks like input gene list

    inFile = open(inputListM, 'r')
    line = inFile.readline()
    while line != '':
        fields = line.split()
        geneListM.append(fields) #add an item to the end of the list
        col0 = str(fields[0])           #minus strand gene list: gene name
        col1 = int(fields[1])           #minus strand gene list: start
        col2 = int(fields[2])           #minus strand gene list: stop

  # Sum up and write read densities per protein coding region in dictionary

        for Z in range(col2, col1 + 1):
            if Z in inFileDictM and col0 in geneDictM:
                geneDictM[col0] += inFileDictM[Z]
            elif Z in inFileDictM:
                geneDictM[col0] = inFileDictM[Z]
        line = inFile.readline()

  # Assign gene expression levels to all genes

    tupledlistM = geneDictM.items()
    for J in geneListM:
        match = 0
        for K in tupledlistM:
            if J[0] == K[0]:
                match = 1
                J.append(K[1])
        if match == 0:
            J.append(0)

  # Output file for minus strand

    outFile = open(outputFileM, 'w')
    for J in geneListM:
        outFile.write(str(J[0]) + '\t' + str(J[1]) + '\t' + str(J[2]) + '\t' + str(J[3]) + '\n')

批量计算:

# expression(inputFileP, inputFileM, inputListP, inputListM, outputFileP, outputFileM)

# sample name
inputFileP = ['Ssb1-inter-rep1.PDensity.txt','Ssb1-inter-rep2.PDensity.txt','Ssb2-inter-rep1.PDensity.txt','Ssb2-inter-rep2.PDensity.txt',
          'Ssb1-trans-rep1.PDensity.txt','Ssb1-trans-rep2.PDensity.txt','Ssb2-trans-rep1.PDensity.txt','Ssb2-trans-rep2.PDensity.txt']

inputFileM = ['Ssb1-inter-rep1.MDensity.txt','Ssb1-inter-rep2.MDensity.txt','Ssb2-inter-rep1.MDensity.txt','Ssb2-inter-rep2.MDensity.txt',
          'Ssb1-trans-rep1.MDensity.txt','Ssb1-trans-rep2.MDensity.txt','Ssb2-trans-rep1.MDensity.txt','Ssb2-trans-rep2.MDensity.txt']

inputListP = 'geneListPlus.txt';inputListM = 'geneListMlus.txt'


outputFileP = ['Ssb1-inter-rep1.geneExpPlus.txt','Ssb1-inter-rep2.geneExpPlus.txt','Ssb2-inter-rep1.geneExpPlus.txt','Ssb2-inter-rep2.geneExpPlus.txt',
          'Ssb1-trans-rep1.geneExpPlus.txt','Ssb1-trans-rep2.geneExpPlus.txt','Ssb2-trans-rep1.geneExpPlus.txt','Ssb2-trans-rep2.geneExpPlus.txt']

outputFileM = ['Ssb1-inter-rep1.geneExpMinus.txt','Ssb1-inter-rep2.geneExpMinus.txt','Ssb2-inter-rep1.geneExpMinus.txt','Ssb2-inter-rep2.geneExpMinus.txt',
          'Ssb1-trans-rep1.geneExpMinus.txt','Ssb1-trans-rep2.geneExpMinus.txt','Ssb2-trans-rep1.geneExpMinus.txt','Ssb2-trans-rep2.geneExpMinus.txt']

# run
for i in range(0,8):
    expression(''.join(['2.ribo-density-data/',inputFileP[i]]),''.join(['2.ribo-density-data/',inputFileM[i]]),
               inputListP,inputListM,
               ''.join(['4.gene-expression-data/',outputFileP[i]]),''.join(['4.gene-expression-data/',outputFileM[i]]))

输出结果:

THI74 1338274 1339386 1.5714285714285714
PAU10 1523249 1523611 0
IZH1 1434924 1435874 436.71428571428805
RBS1 122216 123589 7250.4234126983865
MRPL28 1386073 1386516 192.90277777777777
NUM1 755628 763874 19323.75396825382