s07 - Finding a Shared Motif
Problem
A common substring of a collection of strings is a substring of every member of the collection. We say that a common substring is a longest common substring if there does not exist a longer common substring. For example, “CG” is a common substring of “ACGTACGT” and “AACCGTATA”, but it is not as long as possible; in this case, “CGTA” is a longest common substring of “ACGTACGT” and “AACCGTATA”.
Note that the longest common substring is not necessarily unique; for a simple example, “AA” and “CC” are both longest common substrings of “AACC” and “CCAA”.
Given: A collection of k (k≤100) DNA strings of length at most 1 kbp each in FASTA format.
Return: A longest common substring of the collection. (If multiple solutions exist, you may return any single solution.)
Sample Dataset
>Rosalind_1
GATTACA
>Rosalind_2
TAGACCA
>Rosalind_3
ATACA
Sample Output
AC
Solution
这道题要求找出最长的共有子序列,解题思路从最长开始找,一旦找到就是答案。而对于每一个长度,先从第一条序列里生成一个kmer集合,然后在其它的序列中寻找。
#!/usr/bin/env python3
def readFASTA(file):
f = open(file)
lines = f.readlines()
des = list()
seqs = list()
first_line = True
for i in range(len(lines)):
line = lines[i].rstrip()
if line[0] == '>' and first_line == True:
des.append(line[1:])
first_line = False
seq = ''
elif line[0] == '>' and first_line == False:
des.append(line[1:])
seqs.append(seq)
seq = ''
else:
seq += line
seqs.append(seq)
return des, seqs
def lcs(strs):
maxK = min([len(x) for x in strs])
kmer = set()
seq = strs[0]
strs = strs[1:]
for k in reversed(range(maxK)):
for i in range(maxK+1-k):
kmer.add(seq[i:i+k])
for kk in kmer:
found = True
for x in strs:
if x.find(kk) == -1:
found = False
break
if found == True:
return(kk)
kmer = set()
if __name__ == "__main__":
description, sequence = readFASTA("DATA/rosalind_lcsm.txt")
print(lcs(sequence))