python - Extracting sequence and header from fasta file by known sequences -
i trying compare 2 files , extract sequences have subset of others. and, want extract identifiers too. however, can being able extract sequences including subsets. example files :
text.fa >header1 ettthaascisattvqeq*tlfrllp >header2 skspcsdsdy**aaa >header3 ssgavaaaptta and,
textref.fa >textref.fa cisa aaap aatp when run code, having output :
ettthaascisattvqeq*tlfrllp ssgavaaaptta however, expected output headers:
>header1 ettthaascisattvqeq*tlfrllp >header3 ssgavaaaptta my code in 2 parts, first create file these sequences, , try extract them headers original fasta file :
def get_nucl(filename): open(filename,'r') fd: nucl = [] line in fd: if line[0]!='>': nucl.append(line.strip()) return nucl def finding(filename,reffile): nucl = get_nucl(filename) open(reffile,'r') reffile2: line in reffile2: element in nucl: if line.strip() in element: yield(element) open('sequencesmatched.txt','w') output: results = finding('text.fa','textref.fa',) res in results: print(res) output.write(res + '\n') so, in sequencesmatched.txt, having sequences of text.fa have substrings of textref.fa. :
ettthaascisattvqeq*tlfrllp ssgavaaaptta so in other part, retrieve respective headers , these sequences :
def finding(filename,seqfile): open(filename,'r') fastafile: open(seqfile,'r') sequf: alls=[] line in fastafile: alls.append(line.strip()) print(alls) sequfs = [] line2 in sequf: sequfs.append(line2.strip()) if str(line.strip()) == str(line2.strip()): num = alls.index(line.strip()) print(alls[num-1] + line) print(finding('text.fa','sequencesmatched.txt')) however, output, can retrieve 1 sequence,which first match:
>header1 ettthaascisattvqeq*tlfrllp maybe without second file, not make right loops sequences , respective headers. therefore, went long way..
i happy if can help!
you can easier if file same structure:
def get_nucl(filename): open(filename, 'r') fd: headers = {} key = '' line in fd.readlines(): if '>' in line: key = line.strip()[1:] # remove '>' else: headers[key] = line.strip() return headers here i'm assuming file begin ">headern" whatever, if not have add test. have dictionnary headers['header1'] = 'ettthaascisattvqeq*tlfrllp'.
so find matches use dict :
def finding(filename, reffile): headers = get_nucl(filename) open(reffile, 'r') f: matches = {} line in f.readlines(): key, value in headers.items(): if line.stip() in value , key not in matches: matches[key] = value return matches so have dict headers matching values, can check in dict if have sub string, , have header value key.
just saw did print(finding(....) , function print, call it.
Comments
Post a Comment