Problem
The GC-content of a DNA string is given by the percentage of symbols in the string that are ‘C’ or ‘G’. For example, the GC-content of “AGCTATAG” is 37.5%. Note that the reverse complement of any DNA string has the same GC-content.
DNA strings must be labeled when they are consolidated into a database. A commonly used method of string labeling is called FASTA format. In this format, the string is introduced by a line that begins with ‘>’, followed by some labeling information. Subsequent lines contain the string itself; the first line to begin with ‘>’ indicates the label of the next string.
In Rosalind’s implementation, a string in FASTA format will be labeled by the ID “Rosalind_xxxx”, where “xxxx” denotes a four-digit code between 0000 and 9999.
Given: At most 10 DNA strings in FASTA format (of length at most 1 kbp each).
Return: The ID of the string having the highest GC-content, followed by the GC-content of that string. Rosalind allows for a default error of 0.001 in all decimal answers unless otherwise stated; please see the note on absolute error below.
Sample Dataset
>Rosalind_6404
CCTGCGGAAGATCGGCACTAGAATAGCCAGAACCGTTTCTCTGAGGCTTCCGGCCTTCCC
TCCCACTAATAATTCTGAGG
>Rosalind_5959
CCATCGGTAGCGCATCCTTAGTCCAATTAAGTCCCTATCCAGGCGCTCCGCCGAAGGTCT
ATATCCATTTGTCAGCAGACACGC
>Rosalind_0808
CCACCCTCGTGGTATGGCTAGGCATTCAGGAACCGGAGAACGCTTCAGACCAGCCCGGAC
TGGGAACCTGCGGGCAGTAGGTGGAAT
Sample Output
Rosalind_0808
60.919540
Solution
#!/usr/bin/env python3
f=open("DATA/rosalind_gc.txt", "r")
seq = f.readlines()
gc = 0
n = 0
max_gc = 0
for s in seq:
s = s.strip()
if s[0] == '>':
if n == 0:
desc = s[1:]
continue
else:
if gc*100/n > max_gc:
max_gc = gc*100/n
max_desc = desc
n = 0
gc = 0
desc = s[1:]
else:
n += len(s)
gc += s.count('G')
gc += s.count('C')
if max_gc < gc*100/n:
max_gc = gc*100/n
max_desc = desc
print(max_desc)
print(max_gc)
这道题第一次出现了fasta文件,直接解析文件算gc含量,当我们第二次看到fasta文件的时候,就该写函数来解析了,而不是像上面这样子。预先上biopython的SeqIO来解析fasta,下次我们将自己写一个解析fasta的函数。
#!/usr/bin/env python3
from Bio import SeqIO
fasta_sequences = SeqIO.parse("DATA/rosalind_gc.txt",'fasta')
def gc(seq):
return (seq.count('G') + seq.count('C'))/len(seq)
id_vector = []
gc_vector = []
for fasta in fasta_sequences:
id_vector.append(fasta.id)
gc_vector.append(gc(str(fasta.seq)))
max_gc = max(gc_vector)
max_gc_idx = gc_vector.index(max_gc)
max_gc_id = id_vector[max_gc_idx]
print(max_gc_id)
print(max_gc * 100)